What is XML parsing in Python

Python course

Previous chapter: Turing machine
Next chapter: Python and SQL

Working with Python, XML and SAX

Preliminary remark

This chapter is only just being developed, so it is still incomplete and may be incorrect. So use at your own risk: - =

As of March 17, 2014

introduction

XML stands for "Extensible Markup Language". XML is a markup language for displaying hierarchically structured data in the form of text files. XML is particularly suitable for the platform and implementation-independent exchange of data between computer systems.

XML for its part is based on the metalanguage SGML, which stands for "Standard Generalized Markup Language"

Not only XML but also HTML are languages ​​based on SGML. HTML could be described as a dialect of SGML optimized for websites. While HTML is an application of SGML, XML is a subset of SGML, i.e. all XML documents are conforming SGML documents. XML could also be seen as an evolution of SGML.



Advantages of XML: (Of course also applies to other markup languages)
  • Increase in productivity in the company
  • Reusability of the data
  • Improved data integrity
  • Data longevity
  • Improved data control
  • Problem-free data exchange
    (especially in heterogeneous systems)
  • Flexible data output
Unfortunately, we cannot give a comprehensive introduction to XML here, as this is beyond the scope of this introduction. The aim of this chapter is rather the automatic processing of XML documents with Python.

How the SAX works

A SAX parser reads XML files as a data stream and calls certain callback functions for events defined in the standard.
An application can influence the evaluation of the XML data with its own callback functions.



Overview of callback methods

We have already got to know the "startElement" and "endElement" methods in our examples. We saw that they were automatically called up by the Sax when it encountered a start or end tag. Methods of this kind are called callback methods. The following is an overview of the callback methods used by SAX:

Callback methodExplanation
setDocumentLocator (locator)Call at the beginning of parsing. Initializes the locator, which refers to the position in the XML document at which the parser is currently located.
startDocument ()Called at the beginning of the parsing process
endDocument ()Call at the end of processing or after an error that leads to termination.
startPrefixMapping (prefix, uri)Called when starting an element that has a namespace prefix
endPrefixMapping (String prefix)Called at the end of an element that has a namespace prefix
startElement (name, attrs)Is called with every opening tag with the tag name of the element "name" and a Python dictionary "attrs" with the attributes.
endElement (name)is called for each closing tag with the tag name of the element "name" and a Python dictionary "attrs" with the attributes.
ignorableWhitespace (whitespace)The ignorableWhitespace method is used to handle the whitespace characters between elements. Spaces such as line breaks and tabs between items are usually ignored. The ignorableWhitespace method can be used to handle these characters. The whitespace parameter contains the white space found.
characters (content)Returns the text between the markup sections. The content parameter contains the actual white space found.
processingInstruction (target, data)Reacts to control instructions in the document, but not to the Control instruction at the beginning of the document.
skippedEntity (name)Forwards all entity references to the ContentHandler that cannot be resolved by the parser.


The SAX API defines four basic interfaces. These are implemented as classes in Python:

classExplanation
ContentHandlerimplements the main interface for document events
DTDHandlerfor processing DTD events
EntityResolverClass for resolving external entities
ErrorHandlerClass for handling errors and warnings


Simple example with SAX

For the following examples we want to use the following XML file with information about books, which we save under buecher.xml: The Language Instinct: How the Mind Creates Language (PS) Steven Pinker 7.50 The blank slate : The modern denial of human nature Steven Pinker 29.80 The Black Swan: The Impact of the Highly Improbable Nassim Nicholas Taleb 8.10 L'Etranger Albert Camus 6.10 Introduction to Python 3 Bernd Klein 24.99 Etrusker AG Bernd Klein 19.99 Evariste Galois or the tragic failure of a genius Bernd Klein 9.99 First we want to scan this file with SAX and just output all tags . For this we need the SAX module from the XML package. Actually we only need the two methods "make_parser" and "handler": from xml.sax import make_parser, handler Our Python program ("buecher.py") for scanning the XML file now looks like this: from xml.sax import make_parser, handler class BuecherHandler (handler.ContentHandler): def startElement (self, name, attrs): print (name) parser = make_parser () b = BuecherHandler () parser.setContentHandler (b) parser.parse ("buecher.xml ") We have defined a class BuecherHandler that inherits from the class" handler.ContentHandler ". In this class we overwrite the "startElement" method. While the name of the class can be freely chosen, the method must have exactly this name and the corresponding number of parameters, since it is the corresponding method of handler.ContentHandler. (If you have any questions or problems about overwriting methods and inheritance, it is best to work through the chapter Inheritance in our tutorial.) The call "parser = make_parser ()" creates a parser object. To the generated parser "parser" we have to use the method "setContentHandler" to transfer an instance b of our class BuecherHandler (), which we created with the instruction "b = BuecherHandler ()". Then we can finally open our parser with "parser.parse (" buecher.xml ")". The XML document is now parsed. Every time the parser encounters a start tag, our startElement method is called. The name of the tag is passed to the parameter "name" and an object with the attributes for this tag is passed to "attrs", if attributes are available. However, we are only interested in the name of the tag for our example. The output of our program now looks like this: $ python3 buecher1.py bookstore book title author price book title author price book title author price book title author price book title author price book title author price book title author price Instead of immediately adding the tag name print, let's just determine the amount of different tags. To do this, we define a self.tags attribute in the "__init__" method that is now to be written, in which we collect the tag names. We also need a method to output (getTags) the tags found: from xml.sax import make_parser, handler class BuecherHandler (handler.ContentHandler): def __init __ (self): self.tags = set () def startElement (self, name, attrs): self.tags.add (name) def getTags (self): return self.tags parser = make_parser () b = BookHandler () parser.setContentHandler (b) parser.parse ("books.xml") print (b .getTags ()) The output now looks like this: $ python3 buecher1.py {'price', 'author', 'bookstore', 'book', 'title'} Now we want all book titles and authors instead of the tag names collect: from xml.sax import make_parser, handler class BuecherHandler (handler.ContentHandler): def __init __ (self): self.authors = set () self.titles = set () self.current_content = "" def startElement (self, name , attrs): self.current_content = "" def characters (self, content): self.current_conten t + = content.strip () def endElement (self, name): if name == "title": self.titles.add (self.current_content) elif name == "author": self.authors.add (self. current_content) def getTitles (self): return self.titles def getAuthors (self): return self.authors parser = make_parser () b = BuecherHandler () parser.setContentHandler (b) parser.parse ("buecher.xml") print ( "Authors:") print (b.getAuthors ()) print ("Titles:") print (b.getTitles ()) We have introduced three new attributes in the program: self.authors is a set in which we name of authors collect. self.titles is used analogously for the book titles. self.current_content is used to temporarily store the textual information between the start and end tags. Since this information can extend over several lines, we append it with "+ =". We now also overwrite the endElement method, which is always called by the scanner when an end tag is reached, e.g. or . Whenever we reach an end tag "title", we add the text information from self.current_content, i.e. the title of the book, to the set self.titles. Accordingly, we add the text information from self.current_content to the set self.authors when we land on an end tag "".

The program provides us with the following outputs:

$ python3 buecher2.py Authors: {'Steven Pinker', 'Nassim Nicholas Taleb', 'Albert Camus',' Bernd Klein '} Titles {' Introduction to Python 3 ',' The Language Instinct: How the Mind Creates Language (PS ) ', "L'Etranger",' Evariste Galois or the tragic failure of a genius', 'The Black Swan: The Impact of the Highly Improbable', 'Etrusker AG', 'The blank slate: The modern denial of human nature' }

If we take a closer look at our XML file, we can see that the tag "" also has an attribute "". In the following program we now show how we can also get this information. In the startElement method, we save the value of "lang" in the instance attribute self.language, if the attribute is available. In the endElement method, we then append the language attribute to the title:

from xml.sax import make_parser, handler class BuecherHandler (handler.ContentHandler): def __init __ (self): self.authors = set () self.titles = set () self.current_content = "" self.language = "" def startElement (self, name, attrs): if "lang" in attrs: self.language = attrs ["lang"] self.current_content = "" def characters (self, content): self.current_content + = content.strip () def endElement (self, name): if name == "title": txt = self.current_content + "(" + self.language + ")" self.titles.add (txt) elif name == "author": self. authors.add (self.current_content) def getTitles (self): return self.titles def getAuthors (self): return self.authors parser = make_parser () b = BuecherHandler () parser.setContentHandler (b) parser.parse ("books .xml ") print (" Authors: ") print (b.getAuthors ()) print (" Titles ") print (b.getTitles ()) Our e XML file is then converted into the following output:

$ python3 buecher2.py Authors: {'Steven Pinker', 'Nassim Nicholas Taleb', 'Albert Camus',' Bernd Klein '} Titles {' The Black Swan: The Impact of the Highly Improbable (en) ',' Evariste Galois or the tragic failure of a genius (de) ',' Etrusker AG (de) ',' The blank slate: The modern denial of human nature (de) ',' The Language Instinct: How the Mind Creates Language (PS) (en ) ', "L'Etranger (fr)",' Introduction to Python 3 (de) '}

Another example of a SAX parser

import sys from xml.sax import make_parser, handler class Counter (handler.ContentHandler): def __init __ (self): self._elems = 0 self._attrs = 0 self._elem_types = {} self._attr_types = {} def startElement (self , name, attrs): self._elems = self._elems + 1 self._attrs = self._attrs + len (attrs) self._elem_types [name] = self._elem_types.get (name, 0) + 1 for name in attrs .keys (): self._attr_types [name] = self._attr_types.get (name, 0) + 1 def endDocument (self): print ("There are:", self._elems, "elements.") print (" There are: ", self._attrs," Attributes. ") Print (" --- Element types: ") for pair in self._elem_types.items (): print ("% 20s% d "% pair) print ( "--- Attribute types:") for pair in self._attr_types.items (): print ("% 20s% d"% pair) parser = make_parser () parser.setContentHandler (Counter ()) parser.parse (sys .argv [1]) This s program reads the various types that appear in an XML document. To test this program, we need an XML file. We save the following file under "addresses.xml":
Homer Simpson 742 Evergreen Terrace 90701 Springfield
Charles Montgomery Burns < street> 1000 Mammon Street 90701 Springfield
If we call this program, we get the following output : $ python3 sax_bsp.py addresses.xml There are: 11 elements. There are: 4 attributes. --- Element types: address-book 1 address 2 name 2 street 2 zip 2 location 2 --- Attribute types: id 2 country 2

Example: serial letter

Let us assume that we want to send a letter to several addressees with the same text. We want to enter the address manually.

The text could look like this: To Erika Mustermann Heidestrasse 17 51147 Cologne Hello Erika, We thank you for contacting us and offer you .... The name and address should be replaced by the current address.
This problem can be solved elegantly with the string method "format", which we described in detail in the tutorial under Formatted output. Our contribution on parameter transfer in the tutorial, in which we also describe the transfer of a dictionary with a double asterisk (**), could also be helpful for further understanding. To solve the above problem, we create a dictionary with the data and add this to the format string. We will first show this using a reduced example in the interactive shell: >>> p = {"first name": "Erika"} >>> txt = "Hello {first name}" >>> print (txt.format (** p)) Hello Erika >>> We want to request the data from the user of the program by means of interactive input. We define the questions ("question") and the assignment to the corresponding variable ("var_name") in an XML file: first name First name: surname Surname: True street street: zip code zip code: location Place of residence: We can also define conditions with the tag "condition". In our example, we require that the postcode must only have 4 digits (Switzerland and Austria) or 5 digits (Germany).

We now formulate our "serial letter" as a format string: To {first name} {last name} {street} {zip code} {city} Hello {first name}, we thank you for contacting us and offer you .... Now we "only" have to parse the XML file and create the dictionary with the data. For the SAX parser we define a Questions class: import sys from xml.sax import make_parser, handler class Questions (handler.ContentHandler): def __init __ (self): self.item = {} self.current_content = "" self.params = {} def askQuestion (self): if "question" in self.item: answer = input (self.item ["question"] + "") return answer def startElement (self, name, attrs): if name! = "items": self.item [name] = "" def characters (self, content): self.current_content + = content.strip () def endElement (self, name): if name == "item": value = self .askQuestion () self.params [self.item ["var_name"]] = value p = self.params if "condition" in self.item: if not eval (self.item ["condition"]): print (" warning: condition of "+ self.item [" var_name "] +" not satisfied ") self.item = {} else: if name! = "items": self.item [name] = self.current_content self.current_content = "" def endDocument (self): pass def getParams (self): return self.params In the following program that we under QaA.py ", we then parse our" questions.txt "file. import sys from xml.sax import make_parser, handler from questions import Questions from answers import Answers parser = make_parser () q = Questions () parser.setContentHandler (q) parser.parse ("questions.txt") params = q.getParams ( ) txt = open ("ausgabe_format.txt"). read () txt = txt.format (** params) print (txt) We get the desired result when we start our program with "python3 QaA.py": First name: Erika Family name: Mustermann Street: Musterstrasse 42 Zip code: 42424 Residence: Musterstadt To Erika Mustermann Musterstrasse 42 42424 Musterstadt Hello Erika, we thank you for contacting us and offer you ....

Our example below is about currencies and their exchange rates. We will read in the XML file "> currencies_quote.xml using the SAX parser and create a dictionary with the exchange rates. (Rates as of November 11, 2017) All exchange rates are given relative to the US dollar the Canadian dollar - looks like this: In the SAX program we only read in the values ​​from the tag field if the attribute "name" or "price" is set: from xml.sax import make_parser, handler class CurrencyExtractor (handler .ContentHandler): def __init __ (self): self.currencies = {} self.flag = None self.currency = "" def startElement (self, name, attrs): if name == "field": if attrs ["name "] ==" name ": self.flag =" currency "elif attrs [" name "] ==" price ": self.flag =" price "def characters (self, content): if self.flag ==" currency ": self.flag = None self.currency = c ontent [4:] elif self.flag == "price": self.currencies [self.currency] = content self.flag = None parser = make_parser () extractor = CurrencyExtractor () parser.setContentHandler (extractor) res = parser. parse ("currencies_quote.xml") print (extractor.currencies) {'KRW': '1119.369995', 'ER 1 OZ 999 NY': '0.059259', 'VND': '22702.000000', 'BOB': '6.860000' , 'MOP': '8.033700', 'BDT': '83 .139999 ',' MDL ': '17 .527000', 'VEF': '9.974500', 'GEL': '2.627000', 'ISK': '103.400002', ' BYR ':' 20020.000000 ',' THB ': '33 .110001', 'MXV': '3.253987', 'TND': '2.499000', 'JMD': '125.650002', 'DKK': '6.378170', ... 'SAR': '3.749900', 'UYU': '29 .170000 ',' GBP ':' 0.758220 ',' UZS ':' 8050.000000 ',' GMD ': '46 .869999', 'AWG': '1.780000', 'MNT ':' 2448.000000 ',' HKD ':' 7.799700 ',' ARS ': '17 .479000', 'HUX': '267.679993', 'BRX': '3.265500', 'ECS': '25000.000000'}
Previous chapter: Turing machine
Next chapter: Python and SQL