Wordpress Export XML & Parse Elements to CSV

🧟‍♀️ Julia Nash published 10 months ago

I have to parse an XML for specific elements and output those elements to a CSV.

This shouldn't be bad. I decide to do this using Python.

This is the data format of the XML.

Example values to be extracted in the XML file.

# <item>
#         <title>70-Search-Blah</title>
#         <link>https://blah.blah.com/blah/deploying-blah/blah3/blah4/blah5/</link>
#         <pubDate>Fri, 08 Nov 2019 17:29:49 +0000</pubDate>
#         <dc:creator><![CDATA[blahblah@in.blah.com]]></dc:creator>
#         <guid isPermaLink="false">https://blah.com/image.png</guid>
#         <description></description>
#         <content:encoded><![CDATA[]]></content:encoded>
#         <excerpt:encoded><![CDATA[]]></excerpt:encoded>
#         <wp:post_type><![CDATA[attachment]]></wp:post_type>
#         <wp:post_password><![CDATA[]]></wp:post_password>
#         <wp:is_sticky>0</wp:is_sticky>
#         <wp:postmeta>
#             <wp:meta_key><![CDATA[_wp_attached_file]]></wp:meta_key>
#             <wp:meta_value><![CDATA[2019/11/70-blah.png]]></wp:meta_value>
#         </wp:postmeta>
#         <wp:postmeta>
#             <wp:meta_key><![CDATA[_wp_attachment_metadata]]></wp:meta_key>
#         <wp:postmeta>
#             <wp:meta_key><![CDATA[_amplitudeMeta]]></wp:meta_key>
#     </item>


Resources

XML tree and elements library - ElementTree

"XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level."

StackOverflow example on usage of ElementTree library.

CSV

Used the csv library as well for that export from XML to CSV.


import xml.etree.ElementTree as ET
import csv

tree = ET.parse('Desktop/recipe.xml')
root = tree.getroot()

with open('recipe_output.csv', 'w') as csvfile:
    fwriter = csv.writer(csvfile)
    fwriter.writerow(["Key", "Value"])

    for child in root:
        title = child[0].tag, child[0].text
        fwriter.writerow(title)
        link = child[1].tag, child[1].text
        fwriter.writerow(link)
        pubDate = child[2].tag, child[2].text
        fwriter.writerow(pubDate)
        email = 'creator', child[3].text
        fwriter.writerow(email)

# output from program parsing wp xml export looks like this:
# title   nsfpst3
# link   https://blah.blah.com/blah/?attachment_id=39203
# pubDate   Wed, 03 May 2017 12:08:37 +0000
# creator   spamalot@gmail.com

This is the data I needed to extract. We needed to automate a system of emailing authors about their publications that are older now, but its a lot of people! So I am trying to figure out the way to extract the data here in narrowing it down. Hopefully I have an API for the email and can utilize a cron job to set the scheduling up for sending emails per every month or so.

Related blogs