Back to blog
Adomas Sulcas
Although web scraping in its totality is a complex and nuanced field of knowledge, building your own basic web scraper isn’t all that difficult. And that’s mostly due to coding languages such as Python. This language makes the process much more straightforward thanks to its relative ease of use and the many useful libraries that it offers. In this tutorial, we’ll be focusing on one of these wildly popular libraries named Beautiful Soup, a Python package used for parsing HTML and XML documents.
If you want to build your first web scraper, we recommend checking our video tutorial below or our article that details everything you need to know to get started with Python web scraping. Yet, in this tutorial, we’ll focus specifically on parsing a sample HTML file in Python and using Selenium to render dynamic pages.
This tutorial is useful for those seeking to quickly grasp the value that Python and Beautiful Soup 4 offer. After following the provided examples, you should be able to understand the basic principles of how to parse HTML data. The examples will demonstrate traversing a document for HTML tags, printing the full content of the tags, finding elements by ID, extracting text from specified tags, and exporting it to a CSV file.
Before getting to the matter at hand, let’s first take a look at some of the fundamentals of this topic.
Data parsing is a process during which a piece of data gets converted into a different type of data according to specified criteria. It’s an important part of web scraping since it helps transform raw HTML data into a more easily readable format that can be understood and analyzed.
A well-built parser will identify the needed HTML string and the relevant information within it. Based on predefined criteria and the rules of the parser, it’ll filter and combine the needed information into CSV, JSON, or any other format.
Our previous article on what is parsing sums up this topic nicely.
Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed web pages based on specific criteria that can be used to extract, navigate, search, and modify data from HTML, which is mostly used for web scraping. Beautiful Soup 4 is supported on Python versions 3.6 and greater. Being a useful library, it can save programmers loads of time when collecting data and parsing it.
Before following this tutorial, you should have a Python programming environment set up on your machine. For this tutorial, we’ll assume that PyCharm is used since it’s a convenient choice even for the less experienced with Python and is a great starting point. Otherwise, simply use your go-to IDE.
On Windows, when installing Python, make sure to tick the PATH installation checkbox. PATH installation adds executables to the default OS Command Prompt executable search. The OS will then recognize commands like pip or python without having to point to the directory of the executable, which makes things more convenient.
The next step is to install the Beautiful Soup 4 library on your system. No matter the OS, you can easily do it by using this command on the terminal to install the latest version of Beautiful Soup:
pip install BeautifulSoup4
If you’re using Windows, it’s recommended to run the terminal as administrator to ensure that everything works out smoothly.
Finally, since this article explores working with a sample file written in HTML, you should be at least somewhat familiar with the HTML structure.
A sample HTML document will help demonstrate the main methods of how Beautiful Soup parses data. This file is much more simple than your average modern website; however, it’ll be sufficient for the scope of this tutorial.
<!DOCTYPE html>
<html>
<head>
<title>What is a Proxy?</title>
<meta charset="utf-8">
</head>
<body>
<h2>Proxy types</h2>
<p>
There are many different ways to categorize proxies. However, two of
the most popular types are residential and data center proxies. Here is a list of the most common types.
</p>
<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>
</body>
</html>
For PyCharm to use this file, simply copy it to any text editor and save it with the .html extension to the directory of your PyCharm project. Alternatively, you can create an HTML file in PyCharm by right-clicking on the project area, then navigating to New > HTML File and pasting the HTML code from above.
Going further, you can create a new Python file by navigating to New > Python File. Congratulations, and welcome to your new playground!
First, you can use Beautiful Soup to extract a list of all the tags used in our sample HTML file. For this step, you can use the soup.descendants generator:
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for child in soup.descendants:
if child.name:
print(child.name)
Click the Run button, and you should get the below output:
html
head
title
meta
body
h2
p
ul
li
li
li
li
li
Beautiful Soup traversed our HTML file and printed all the HTML tags that it found sequentially. Let’s take a quick look at what each line did:
from bs4 import BeautifulSoup
This tells Python to import the Beautiful Soup library.
with open('index.html', 'r') as f:
contents = f.read()
This code snippet above, as you could probably guess, gives an instruction to open our sample HTML file, read its contents, and store them in the contents variable.
soup = BeautifulSoup(contents, "html.parser")
This line creates a Python Beautiful Soup object and passes it to Python’s built-in HTML parser. Other parsers, such as lxml, might also be used, but it’s a separate external library, and for the purpose of this tutorial, the built-in parser will do just fine.
for child in soup.descendants:
if child.name:
print(child.name)
The final piece of code, namely the soup.descendants generator, instructs Beautiful Soup to look for HTML tag names and print them in the PyCharm console. The results can also easily be exported to a CSV file, but we’ll get to this later.
To extract the content of HTML tags, this is what you can do:
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
print(soup.h2)
print(soup.p)
print(soup.li)
It’s a simple parsing instruction that outputs the HTML tag with its full content in the specified order. Here’s what the output should look like:
<h2>Proxy types</h2>
<p>
There are many different ways to categorize proxies. However, two of the most popular types are residential and data center proxies. Here is a list of the most common types.
</p>
<li>Residential proxies</li>
Additionally, you can remove the HTML tags and print the text only by adding .text:
print(soup.li.text)
Which gives the following output:
Residential proxies
Note that this only prints the first instance of the specified tag. Let’s continue to see how to find an HTML element by ID and use the find_all method to filter all elements by specific criteria.
You can use two similar ways to find elements by ID:
print(soup.find('ul', attrs={'id': 'proxytypes'}))
or
print(soup.find('ul', id='proxytypes'))
Both of these will output the same result in the Python Console:
<ul id="proxytypes">
<li>Residential proxies</li>
<li>Datacenter proxies</li>
<li>Shared proxies</li>
<li>Semi-dedicated proxies</li>
<li>Private proxies</li>
</ul>
The find_all method is a great way to extract all the data stored in specific elements from an HTML file. It accepts many criteria that make it a flexible tool allowing users to filter data in convenient ways. Let’s find all the items within the <li> tags and print them as text only:
for tag in soup.find_all('li'):
print(tag.text)
This is how the full code should look like:
from bs4 import BeautifulSoup
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for tag in soup.find_all('li'):
print(tag.text)
And here’s the output:
Residential proxies
Datacenter proxies
Shared proxies
Semi-dedicated proxies
Private proxies
Beautiful Soup has excellent support for CSS selectors as it provides several methods to interact with HTML content using selectors. Under the hood, Beautiful Soup uses the soupsieve package. When you install Beautiful Soup with Python’s package-management system pip, it’ll automatically install the soupsieve dependency for you. Be sure to check out their documentation to learn more about the supported CSS selectors.
Beautiful Soup primarily provides two methods to interact with HTML web page content using CSS selectors: select and select_one. Let’s try out both of them.
You can grab the title from our HTML sample file using the select method. Your code should look like the below:
print(soup.select('html head title'))
Simple, isn’t it? Notice how the CSS selector navigates the HTML by going through the hierarchy of the HTML elements sequentially.
This method is useful when you need to grab only one element using a CSS selector that matches multiple elements. For instance, our HTML sample has several <li> elements. If you want to grab only the first one, you can use the following CSS selector:
print(soup.select_one('body ul li'))
This will pick the first <li> element of the <ul> tag, which has several other <li> elements.
To extract a specific <li> element, you can add :nth-of-type(n) to your CSS selector. For instance, you can extract the third <li> element, which in our HTML file is <li>Shared proxies</li>, using the following line:
print(soup.select_one('body ul li:nth-of-type(3)'))
Most websites these days tend to load content dynamically, meaning data can be left out if JavaScript isn’t triggered to load the content. The requests library and Beautiful Soup libraries aren’t equipped to handle JavaScript-rendered web pages. Consequently, using these libraries to download the HTML document of a website would exclude any dynamically-loaded content.
You’ll have to use other libraries that can render the website by executing JavaScript to parse dynamic elements. Python’s Selenium package offers powerful capabilities to interact with and manipulate DOM elements. In a nutshell, its WebDriver utilizes popular web browsers and renders JavaScript-based dynamic websites quickly. By combining Beautiful Soup with Selenium WebDriver, you can easily parse dynamic content from any website.
Additionally, there are other ways you can scrape dynamic websites that we have explored in our Playwright and Scrapy Splash tutorials.
First, install Selenium with the below command:
pip install selenium
As of Selenium 4.6, the browser driver is downloaded automatically. Yet, if you’re using an older version of Selenium or the driver wasn’t found, you’ll have to manually download the WebDriver. Visit this page to find the driver download links for the supported web browsers.
Now that you’ve installed all the required dependencies, you can jump right into writing the code. Let’s begin by importing the newly installed library and Beautiful Soup:
from selenium import webdriver
from bs4 import BeautifulSoup
Next, you’ll have to initiate a browser instance using the below code:
driver = webdriver.Chrome()
The above code uses the Chrome() driver to launch an instance of a Chrome browser.
Now, you can use this driver object to fetch dynamic content. So let’s extract the HTML of this JavaScript-rendered dummy website http://quotes.toscrape.com/js/:
driver.get("http://quotes.toscrape.com/js/")
js_content = driver.page_source
As soon as you execute the above code, you’ll notice the Chrome browser instance automatically navigating to the desired website and rendering the JavaScript-based content. The new object named js_content contains the HTML content of the website.
Now that you’ve got the HTML content in a string format, you can simply use the BeautifulSoup() constructor to create the Beautiful Soup object with parsed data:
soup = BeautifulSoup(js_content, "html.parser")
You can now navigate the soup object with Beautiful Soup and parse any HTML element using the methods outlined previously. For example, let’s extract the first quote found on our target website. Every quote is within the <span> tag with an attribute set to class="text", so the code line to extract the content from the quote can look like this:
quote = soup.find("span", class_="text")
print(quote.text)
Note the additional underscore _ within class_="text" – you must use it. Otherwise, Python will interpret it as a reserved class keyword.
When parsing dynamic websites, keep in mind that some websites have strong anti-bot measures that can easily detect Selenium-based web scrapers. Mostly, this is achieved by identifying the Selenium web driver's common request patterns and using various other fingerprinting techniques. Thus, it’s extremely difficult to avoid such anti-bot measures. In case your IP address gets blocked, you might want to consider using proxies and implementing other anti-detection methods.
By now you should now have a basic understanding of how Beautiful Soup can be used to parse and extract data. It should be noted that the information presented in this article is useful as introductory material, yet real-world web scraping and parsing with BeautifulSoup is usually much more complicated than this. For a more in-depth look at Beautiful Soup, you’ll hardly find a better source than its official documentation, so be sure to check it out too.
A very common real-world application would be exporting data to a CSV file for later analysis. Although this is outside the scope of this tutorial, let’s take a quick look at how this might be achieved.
First, you would need to install an additional Python library called pandas that helps Python create structured data. This can be easily done by entering the following line in your terminal:
pip install pandas
You should also add this line to the beginning of your code to import the library:
import pandas as pd
Going further, let’s add some lines that’ll export the list we extracted earlier to a CSV file. This is how your full code should look like:
from bs4 import BeautifulSoup
import pandas as pd
with open('index.html', 'r') as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
results = soup.find_all('li')
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')
What happens here exactly? Let’s take a look:
results = soup.find_all('li')
This line finds all instances of the <li> tag and stores it in the results object.
df = pd.DataFrame({'Names': results})
df.to_csv('names.csv', index=False, encoding='utf-8')
And here, we see the pandas library at work, storing our results into a table (DataFrame) and exporting it to a CSV file.
If all goes well, a new file titled names.csv should appear in the running directory of your Python project, and inside, you should see a table with the proxy types list. That’s it! Now you not only know how data extraction from an HTML document works, but you can also programmatically export the data to a new file.
As you can see, Beautiful Soup is a greatly useful HTML parser. With a relatively low learning curve, you can quickly grasp how to navigate, search, and modify the parse tree. With the addition of libraries, such as pandas, you can further manipulate and analyze the data, which offers a powerful package for a near-infinite amount of data collection and analysis use cases.
And if you’d like to expand your knowledge on Python web scraping in general and get familiar with other Python libraries, we recommend heading over to What is Python used for? and Python Requests blog posts. Also, don't miss out on a 1-week free trial of our advanced public data collection solution – Web Scraper API. Try it out and decide whether it fits your data-gathering needs.
Yes, Beautiful Soup is relatively easy to learn. It offers a straightforward way to extract data by navigating and searching through the HTML structure. In addition, the Beautiful Soup documentation offers in-depth explanations with examples, so you can be sure to find most of the answers to your questions.
While the Beautiful Soup library is pretty simple to use, it still requires you to have, at the very least, basic Python coding knowledge and an understanding of HTML structure.
The answer really depends on what you’re trying to achieve. Beautiful Soup is a lightweight Python library that focuses on data parsing, while Scrapy is a full-fledged web scraping infrastructure that allows users to make HTTP requests, scrape data, and parse it.
In essence, Beautiful Soup is better when working with small-scale web scraping projects that don’t require complex web scraping techniques. On the other hand, Scrapy is exceptionally better for medium to large-scale operations. It offers much more features, such as web crawling, the ability to follow links, concurrency and asynchronous web scraping, cookie management, and more. Using Scrapy for larger projects guarantees better overall performance and speed.
Take a look at our blog post on Web Scraping with Scrapy to learn more and see the tool in action.
Yes, Beautiful Soup is highly regarded for most web scraping projects. Its ease of use through intuitive functions makes it one of the most popular Python parsing libraries. It offers all the fundamentals required to parse HTML and XML files and allows users to search for elements based on HTML tags, attributes, text, and more.
While it lacks some functionality for more complex web scraping tasks, it’s certainly one of the better web scraping libraries for beginner and advanced programmers.
About the author
Adomas Sulcas
PR Team Lead
Adomas Sulcas is a PR Team Lead at Oxylabs. Having grown up in a tech-minded household, he quickly developed an interest in everything IT and Internet related. When he is not nerding out online or immersed in reading, you will find him on an adventure or coming up with wicked business ideas.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Maryia Stsiopkina
2023-10-18
Enrika Pavlovskytė
2023-07-21
Yelyzaveta Nechytailo
2023-06-27
Get the latest news from data gathering world
E-Commerce Scraper API with Adaptive Parser
Make the most of our ML-based Adaptive Parser that easily adapts to the site’s layout.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub