XML Processing Modules in Python



XML stands for "Extensible Markup Language". It is mainly used in webpages, where the data has a specific structure. It has elements, defined by a beginning and an ending tag. A tag is a markup construct that begins with < and ends with >. The characters between the start-tag and end-tag, are the element's content. Elements can contain other elements, which are called "child elements".

Example

Below is the example of an XML file we are going to use in this tutorial.

<?xml version="1.0"?> <Tutorials>    <Tutorial id="Tu101">       <author>Vicky, Matthew</author>       <title>Geo-Spatial Data Analysis</title>       <stream>Python</stream>       <price>4.95</price>       <publish_date>2020-07-01</publish_date>       <description>Learn geo Spatial data Analysis using Python.</description>    </Tutorial>    <Tutorial id="Tu102">       <author>Bolan, Kim</author>       <title>Data Structures</title>       <stream>Computer Science</stream>       <price>12.03</price>       <publish_date>2020-1-19</publish_date>       <description>Learn Data structures using different programming lanuages.</description>    </Tutorial>    <Tutorial id="Tu103">       <author>Sora, Everest</author>       <title>Analytics using Tensorflow</title>       <stream>Data Science</stream>       <price>7.11</price>       <publish_date>2020-1-19</publish_date>       <description>Learn Data analytics using Tensorflow.</description>    </Tutorial> </Tutorials>

Reading xml Using xml.etree.ElementTree

This module provides access to the root of the xml file and then we can access the contents of the inner elements. In the below example we use the attribute called text and get the content of those elements.

Example

import xml.etree.ElementTree as ET xml_tree = ET.parse('E:\TutorialsList.xml') xml_root = xml_tree.getroot() # Header print('Tutorial List :') for xml_elmt in xml_root:    for inner_elmt in xml_elmt:       print(inner_elmt.text) 

Output

Running the above code gives us the following result −

Tutorial List : Vicky, Matthew Geo-Spatial Data Analysis Python 4.95 2020-07-01 Learn geo Spatial data Analysis using Python. Bolan, Kim Data Structures Computer Science 12.03 2020-1-19 Learn Data structures using different programming lanuages. Sora, Everest Analytics using Tensorflow Data Science 7.11 2020-1-19 Learn Data analytics using Tensorflow.

Getting the xml attributes

We can get the list of attributes and their values in the root tag. Once we find the attributes, it helps us navigate the XML tree easily.

Example

import xml.etree.ElementTree as ET xml_tree = ET.parse('E:\TutorialsList.xml') xml_root = xml_tree.getroot() # Header print('Tutorial List :') for movie in xml_root.iter('Tutorial'):    print(movie.attrib)

Output

Running the above code gives us the following result −

Tutorial List : {'id': 'Tu101'} {'id': 'Tu102'} {'id': 'Tu103'}

Filtering Results

We can also filter the results out of the xml tree by using the findall() function of this module. In the below example we find out the id of the tutorial which has a price of 12.03.

Example

import xml.etree.ElementTree as ET xml_tree = ET.parse('E:\TutorialsList.xml') xml_root = xml_tree.getroot() # Header print('Tutorial List :') for movie in xml_root.findall("./Tutorial/[price ='12.03']"):    print(movie.attrib)

Output

Running the above code gives us the following result −

Tutorial List : {'id': 'Tu102'}

Parsing XML with DOM APIs

We create a minidom object using the xml.dom module. The minidom object provides a simple parser method that quickly creates a DOM tree from the XML file. The sample phrase calls the parse( file [,parser] ) function of the minidom object to parse the XML file designated by file into a DOM tree object.

Example

from xml.dom.minidom import parse import xml.dom.minidom # Open XML document using minidom parser DOMTree = xml.dom.minidom.parse('E:\TutorialsList.xml') collection = DOMTree.documentElement # Get all the movies in the collection tut_list = collection.getElementsByTagName("Tutorial") print("*****Tutorials*****") # Print details of each Tutorial. for tut in tut_list:    strm = tut.getElementsByTagName('stream')[0]    print("Stream: ",strm.childNodes[0].data)    prc = tut.getElementsByTagName('price')[0]    print("Price: ", prc.childNodes[0].data) 

Output

Running the above code gives us the following result −

*****Tutorials***** Stream: Python Price: 4.95 Stream: Computer Science Price: 12.03 Stream: Data Science Price: 7.11
Updated on: 2021-01-25T07:51:34+05:30

1K+ Views

Kickstart Your Career

Get certified by completing the course

Get Started
Advertisements