Wordpress Export XML & Parse Elements to CSV
🧟♀️ Julia Nash published 7 months ago
I have to parse an XML for specific elements and output those elements to a CSV.
This shouldn't be bad. I decide to do this using Python.
This is the data format of the XML.
Example values to be extracted in the XML file.
# <item> # <title>70-Search-Blah</title> # <link>https://blah.blah.com/blah/deploying-blah/blah3/blah4/blah5/</link> # <pubDate>Fri, 08 Nov 2019 17:29:49 +0000</pubDate> # <dc:creator><![CDATA[blahblah@in.blah.com]]></dc:creator> # <guid isPermaLink="false">https://blah.com/image.png</guid> # <description></description> # <content:encoded><![CDATA[]]></content:encoded> # <excerpt:encoded><![CDATA[]]></excerpt:encoded> # <wp:post_type><![CDATA[attachment]]></wp:post_type> # <wp:post_password><![CDATA[]]></wp:post_password> # <wp:is_sticky>0</wp:is_sticky> # <wp:postmeta> # <wp:meta_key><![CDATA[_wp_attached_file]]></wp:meta_key> # <wp:meta_value><![CDATA[2019/11/70-blah.png]]></wp:meta_value> # </wp:postmeta> # <wp:postmeta> # <wp:meta_key><![CDATA[_wp_attachment_metadata]]></wp:meta_key> # <wp:postmeta> # <wp:meta_key><![CDATA[_amplitudeMeta]]></wp:meta_key> # </item>
Resources
XML tree and elements library - ElementTree
"XML is an inherently hierarchical data format, and the most natural way to represent it is with a tree. ET has two classes for this purpose - ElementTree represents the whole XML document as a tree, and Element represents a single node in this tree. Interactions with the whole document (reading and writing to/from files) are usually done on the ElementTree level. Interactions with a single XML element and its sub-elements are done on the Element level."
StackOverflow example on usage of ElementTree library.
CSV
Used the csv library as well for that export from XML to CSV.
import xml.etree.ElementTree as ET import csv tree = ET.parse('Desktop/recipe.xml') root = tree.getroot() with open('recipe_output.csv', 'w') as csvfile: fwriter = csv.writer(csvfile) fwriter.writerow(["Key", "Value"]) for child in root: title = child[0].tag, child[0].text fwriter.writerow(title) link = child[1].tag, child[1].text fwriter.writerow(link) pubDate = child[2].tag, child[2].text fwriter.writerow(pubDate) email = 'creator', child[3].text fwriter.writerow(email) # output from program parsing wp xml export looks like this: # title nsfpst3 # link https://blah.blah.com/blah/?attachment_id=39203 # pubDate Wed, 03 May 2017 12:08:37 +0000 # creator spamalot@gmail.com
This is the data I needed to extract. We needed to automate a system of emailing authors about their publications that are older now, but its a lot of people! So I am trying to figure out the way to extract the data here in narrowing it down. Hopefully I have an API for the email and can utilize a cron job to set the scheduling up for sending emails per every month or so.