Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network status Careers

hello@oxylabs.io

English (EN)

English

中文

Proxies

Proxies & Advanced Proxy Solutions

Residential Proxies

Human-like scraping without IP blocking

Mobile Proxies

Harness the power of IP addresses from real mobile devices

Rotating ISP Proxies

Extract the required data without the fear of getting blocked

Web Unblocker

AI-powered proxy solution for block-free scraping

Shared Datacenter Proxies

Fast and reliable proxies for cost-effective scraping

Dedicated Datacenter Proxies

The highest performing proxies on the market

Static Residential Proxies

Combined power of Datacenter and Residential IPs

Tools & Addons

Oxy Proxy Extension for Chrome

Free Chrome proxy manager extension that works with any proxy provider.

Oxy Proxy Manager for Android

Free Android proxy manager app that works with any proxy provider.

Proxy RotatorAdd-on

Rotates your Datacenter Proxies to help increase success rates.

Scraper APIs

SERP Scraper APIFREE TRIAL

Scalable SERP data delivery from major search engines

E-Commerce Scraper APIFREE TRIAL

Enterprise-level data from largest e-commerce marketplaces

Real Estate Scraper APIFREE TRIAL

Real-time data from popular real estate websites

Web Scraper APIFREE TRIAL

Public data delivery from a majority of websites

Features

Web Crawler

Discovers all pages on a website and fetches data at scale.

Scheduler

Schedules multiple scraping and parsing jobs at specified frequencies.

Custom Parser

Parses scraped documents by executing given parsing instructions.

Headless BrowserNEW

Render JavaScript and execute browser instructions.

DatasetsNew

Datasets

Company Data

Comprehensive datasets for business profiling

E-Commerce Product Data

Datasets for product catalog insights from E-Commerce stores

Job Postings Data

Datasets for labour market research and insights

Community and Code Data

Datasets for developer community trends

Product Review Data

Fresh datasets for user sentiment analysis

Pricing

Proxies

Residential Proxies

Human-like scraping

Starts from

$10

Pay as you go

Mobile Proxies

3G/4G/5G Mobile Proxies

Starts from

$22

Pay as you go

Rotating ISP Proxies

Extended sessions

Starts from

$340/month

Shared Datacenter Proxies

Cost-effective solution

Starts from

$50/month

Dedicated Datacenter Proxies

Superior performance

Starts from

$50/month

Scraper APIs

SERP Scraper API

Scalable SERP data delivery

Starts from

$49/month

E-Commerce Scraper API

Enterprise-level product page data

Starts from

$49/month

Web Scraper API

Data from a majority of websites

Starts from

$49/month

Real Estate Scraper API

Real-time real estate data

Starts from

$49/month

Advanced Proxy Solutions

Web Unblocker

AI-powered proxy solution

Starts from

$75/month

Learn

Getting Started

Knowledge Base

Read the latest articles about the world of web scraping, proxies, and more

Webinars

Check our webinars to learn more about data gathering issues and solutions

White papers

Get extensive white papers to understand the most complex scraping topics

OxyCon

Join inspiring discussions at Oxylabs’ annual web scraping conference

Scraping Experts

Watch lessons by industry-leading experts to gain insights on data gathering

Useful Information

Quick Start Guides

Featured

Explore tutorials and code samples to build a web scraping infrastructure with Oxylabs solutions.

Solutions

By Industry

E-Commerce

Get access to valuable e-commerce data with the help of advanced scraping solutions

Cybersecurity

Collect threat intelligence and inspect risky activities anonymously with reliable proxies

Brand protection

Monitor the web on a large scale to ensure no unauthorized product seeped into the market

SERP Monitoring

Monitor SERPs to enhance your business strategy

Travel and hospitality

Gather real-time flight and hotel data to and build a solid strategy for your travel business.

By Use Case

View all

By Target

View all

Back to blog

Tutorials

How to Parse XML in Python

Maryia Stsiopkina

2023-06-026 min read

In this article, you’ll learn how to parse XML data in Python by exploring popular Python libraries. The article will cover the basics of XML, DOM representation, built-in Python libraries for parsing XML documents, and their differences. You’ll also learn the step-by-step procedure of parsing XML files, handling invalid XML, converting to a dictionary, and saving data to a CSV file. Let’s get started.

What is XML?

XML (Extensible Markup Language) is a popular markup language used in a wide range of applications and systems. It’s a structured and hierarchical data format that allows you to store and exchange data between different platforms and applications. XML files are commonly used for data exchange, configuration files, and web services. Take a look at the below example:

<?xml version="1.0" encoding="UTF-8"?>
<menu>
 <food>
   <item name="lunch" type="main">Chicken Biryani</item>
   <price currency="USD">$12.99</price>
   <description>
     Aromatic basmati rice cooked with tender chicken pieces, spices, and herbs.
   </description>
   <calories>780</calories>
   <ingredients>
     <ingredient>Basmati rice</ingredient>
     <ingredient>Chicken</ingredient>
     <ingredient>Spices</ingredient>
     <ingredient>Herbs</ingredient>
   </ingredients>
 </food>
</menu>

This XML file starts with an XML Declaration that lets the parser know the version and encoding of the file. The root element `menu` contains information about a food item. Notice how each of the properties and attributes is structured to convey the information in a hierarchy.

What is DOM?

The Document Object Model (DOM) provides a hierarchical representation of the document, allowing developers to interact with its elements programmatically. The DOM provides a standardized interface to interact with web documents. It also has versatile and wide browser support enabling the creation of dynamic, interactive, and responsive web applications. It’s a platform and language-neutral interface that allows you to dynamically access and update the content, structure, and style of XML and HTML documents.

<!DOCTYPE html>
<html>
<head>
   <title>DOM Example</title>
</head>
<body>
   <h1>Welcome to the DOM Example</h1>
   <p>This is a sample paragraph.</p>
   <ul>
       <li>Item 1</li>
       <li>Item 2</li>
       <li>Item 3</li>
   </ul>
</body>
</html>

The DOM serves as a powerful tool for web development. It enables efficient manipulation of the document's structure, allowing for the addition, removal, or modification of elements, attributes, and text. It provides a tree-like structure, where each element in the document is represented as a node. The root node represents the document itself, with child nodes representing elements, attributes, and text nodes. This hierarchical structure makes node traversal easy and manipulation of the document's content a breeze.

What is XML parsing?

XML parsing is a fundamental process of working with XML data. It has several key steps, such as checking the syntax of the XML document, tokenizing, and building the document structure in a hierarchy. If you’ve worked with any XML file before, then you might already know that XML parsing is surprisingly difficult. Luckily, Python provides tons of libraries that can be utilized for parsing XML files. All these tools have different trade-offs. Some are optimized for speed, and some for memory. You can pick the necessary tool you like based on your requirements.

Built-in Python libraries for parsing XML

Almost all the Python distributions provide a standard XML library that bundles abstract interfaces for Parsing XML documents. It also enables you to supply concrete parser implementation. However, in reality, you’ll hardly use a parser implementation of your own. Instead, you’ll take advantage of the Python bindings of various parsing libraries, such as Expat. Python’s standard XML library automatically binds it for you. Let’s explore some of the sub-modules of Python’s standard XML library.

The xml.dom.minidom library

This library enables you to parse XML files in the DOM interface. This is a relatively old implementation of the W3c specification. All the common objects, such as Document, Element, Attr, etc., are available. This module is less useful as it lacks proper documentation.

from xml.dom.minidom import parseString
xml_string = """<?xml version="1.0" encoding="UTF-8"?>
<library>
 <book>
   <title>The Great Gatsby</title>
   <author>F. Scott Fitzgerald</author>
   <year>1925</year>
 </book>
</library>
"""
document = parseString(xml_string)

The above code will parse the `xml_string` and store it in the `document` object. Now, you can use the DOM interface to access the various nodes of this XML file. Let’s print the title.

print(document.getElementsByTagName("title")[0].firstChild.nodeValue)

The `getElementsByTagName` method returns a list of elements from which the first element was picked. This’ll output the title `The Great Gatsby.` Since the `minidom` library follows the old w3c specification of the DOM interface, it feels old and not Pythonic. Moreover, the library source code hasn’t received any updates in more than 20 years.

The xml.etree.ElementTree library

The ElementTree API is a lightweight, feature-rich Interface for parsing and manipulating XML documents. The implementation is fast and elegant, which attracted many third-party libraries to build on it. The documentation of this library is also better than `minidom`. When it was first introduced in Python 2.5, it had a faster C implementation named `cElementTree.` Nowadays, you don’t have to bother as the current implementation is far better in performance than all the older implementations.

import xml.etree.ElementTree as ET
xml_string = """<?xml version="1.0" encoding="UTF-8"?>
<library>
 <book>
   <title>The Great Gatsby</title>
   <author>F. Scott Fitzgerald</author>
   <year>1925</year>
 </book>
</library>
"""
root = ET.fromstring(xml_string)

The `root` object will contain the parsed XML document. Notice we created an alias `ET` to shorten the library name in the code. This is a common convention for ElementTree-based Python scripts. The `fromstring` method takes an XML string as an argument and returns the parsed ElementTree object. Next, you can iter over all the child nodes of the root node and print the texts using the below code:

for child in root.iter():
   if child.text.strip():
       print(child.text)

The output will be:

The Great Gatsby
F. Scott Fitzgerald
1925

How do I parse an XML file?

So far, you’ve learned how to parse XML files from Python string objects. Now, let’s learn how to parse XML files using these libraries. Fortunately, both `minidom` and `ElementTree` provide a built-in function to parse XML files.

1. Parsing XML from file

minidom

You can use the `parse` method of `minidom` to read XML content from a file.

from xml.dom.minidom import parse
document = parse("sample.xml")
print(document.getElementsByTagName("title")[0].firstChild.nodeValue)

ElementTree

The `ElementTree` library also has a `parse` method to read XML files.

import xml.etree.ElementTree as ET
root = ET.parse("sample.xml")
parsed_dict = dict()
for child in root.iter():
   if child.text.strip():
       parsed_dict[child.tag] = child.text
print(parsed_dict)

First, we create a root XML document of the `sample.xml` file using the `parse` method. Then, we’re iterating over all the child nodes and storing the data in a `dict` object.

2. Converting XML to a dictionary

You can use the `untangle` library to convert XML files directly to a Python dictionary object. The code is self-explanatory.

import untangle
parsed_dict = untangle.parse("sample.xml")

The cool thing about this library is you can pass a URL, filename, or even an XML string to the `parse,` and it’ll still work.

3. Saving parsed XML data to a CSV

You can use the `pandas` library to store the data in a CSV file.

df = pd.DataFrame(parsed_dict)
df.to_csv("parsed_xml_data.csv", index=False)

If you run the above code, it’ll initialize a pandas Data Frame `df` with the `parsed_data.` Then, save the data in a CSV file named `parsed_xml_data.csv`

Parsing Invalid XML

Unfortunately, Python’s standard XML libraries don’t validate the structure of the XML file. So, it can’t parse and extract data from invalid XML files containing invalid characters or broken tags or elements. For example, take a look at the below XML file:

<?xml version="1.0" encoding="UTF-8"?>
<root>
 <person>
 <name> John Doe</name>
 <message>This is a message & an invalid XML example.</message>
 </person>
</root>

At first glance, it might seem like a valid XML, but if you try to parse this XML document, you’ll get an error. Notice the `&` symbol inside the message element, and this symbol is an invalid XML character; thus, all the standard XML libraries will fail to parse this XML file. Now let’s try to parse this invalid XML using a couple of methods.

Method 1: Preprocessing XML documents as strings

For simple XML files like the above, you can preprocess the XML document as a string to remove the invalid elements or symbols from the XML document before passing it to the parser.

from xml.dom.minidom import parseString
invalid_xml = """<?xml version="1.0" encoding="UTF-8"?>
<root>
<person>
<name> John Doe</name>
<message>This is a message & an invalid XML example.</message>
</person>
</root>
"""
# preprocessing
valid_xml = invalid_xml.replace("&", "&amp;")
parsed_data = parseString(valid_xml)

As you can see in the preprocessing step, the replace method will replace the `&` symbol of the `invalid_xml` with `&`. Now, if you run this script, it’ll parse the XML document without any errors. There are many ways to preprocess the XML documents. Sometimes you can take advantage of Python’s `re` module to use RegEx and replace complex text as well. However, if the XML document is too large, this method will become cumbersome.

Method 2: Use a robust parsing Library

If the invalid XML document is too large, it’ll be hard to preprocess. In such cases, instead of using a stricter XML parser that doesn’t handle broken XML files, you can use a more forgiving parsing library such as Beautiful Soup. It can take care of invalid XML literals, missing or broken tags, and elements automatically.

from bs4 import BeautifulSoup
invalid_xml = """<?xml version="1.0" encoding="UTF-8"?>
<root>
<person>
<name> John Doe</name>
<message>This is a message & an invalid XML example.</message>
</person>
</root>
"""
soup = BeautifulSoup(invalid_xml, features="lxml-xml")
print(soup.prettify())

You should keep in mind that Beautiful Soup is slower than the other XML parsing libraries, such as ElementTree or lxml. If performance is an issue, you might’ve to use preprocessing or other robust libraries instead.

Conclusion

In this tutorial, we’ve explored various parsing models, delved into the standard library and third-party parsers, and learned about declarative parsing and safe XML parsing practices. With this knowledge, you are now empowered to choose the most suitable XML parser for your needs and handle XML data effectively in your Python projects. Also, check our GitHub to see the code samples used in this tutorial.

Last but not least, whenever dealing with XML documents, you must pay extra attention, as some XML files can be malicious. Python’s standard XML libraries aren’t secured enough to protect against such maliciously crafted XML files. To learn more about the vulnerabilities and risks, read the official Python XML library documentation.

Frequently asked questions

Is Python good for file parsing?

Python is one of the most popular general-purpose programming languages. It has sets of powerful libraries that make Python great for file parsing. Since Python is an interpreted programming language, it doesn’t provide C, and C++- like performance out of the box. However, you can combine Cython or other faster Python implementations and binaries to improve performance.

Which is the easiest XML parser?

Python’s built-in XML (Extensible Markup Language) library provides multiple sub-modules and libraries with different syntax and styles giving you the flexibility to choose the one that fits best with your preference. If you’re comfortable with DOM manipulation, you might find the `dom` module or `minidom` easier to learn. On the other hand, the `etree` module is also beginner friendly. From the third-party libraries, `lxml` is popular for XML parsing in the Python community. And, if you’re familiar with Beautiful Soup, you can use the `BeautifulSoup` library as well. There is another library named untangle, which enables users to convert XML documents into Python `dict` objects with a single line of code. However, this library is not in active development. So, it’s not recommended for production use.

Which is the fastest XML parsing library?

`lxml` is arguably the fastest parsing library with support for Xpath, XSLT & XML Schema standards. It's the Python binding for the C libraries libxml2 and libxslt. This library is fast and packs a familiar interface as ElementTree API with the broadest spectrum of functionalities. It’s also compatible with Python’s ElementTree. Many libraries, such as Beautiful Soup, can also utilize the `lxml` parser under the hood to get a performance boost.

About the author

Maryia Stsiopkina

Senior Content Manager

Maryia Stsiopkina is a Senior Content Manager at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. At leisure, she does birdwatching with binoculars (some people mistake it for stalking), makes flower jewelry, and eats pickles.

Learn more about Maryia Stsiopkina

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Tutorials Scrapers