Pandas is one of the most popular Python libraries for data analysis. This library has many useful functions. One such function is pandas read_html. It can convert HTML tables into pandas DataFrame efficiently.
This tutorial will show you how useful pd read_html can be, especially when combined with other helpful functions.
As mentioned before, the function that we’ll be using is pd read_html. For details on how to read HTML tables from a variety of sources, you can also head over to the official documentation.
First of all, we’re going to install pandas.
Pandas can be installed using the pip command or conda command if you’re using Anaconda.
pip3 install pandas
conda install pandas
Link to GitHubYou must also install the lxml, html5lib, BeautifulSoup4, and Matplotlib libraries to facilitate reading & parsing the HTML and plotting the information.
Once pandas is installed, you’ll need to create a new Python file. Alternatively, you can also create a Jupyter Notebook.
Let’s start with extracting data from HTML tables in strings.
In the following line of the code, a variable contains HTML. You should note that we’re using Python’s triple quote conventions to store multiline strings in a variable easily.
html = '''
<table>
<thead>
<tr>
<th>Sequence</th>
<th>Country</th>
<th>Population</th>
<th>Updated</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>China</td>
<td>1,439,323,776</td>
<td>1-Dec-2020</td>
</tr>
<tr>
<td>2</td>
<td>India</td>
<td>1,380,004,385</td>
<td>1-Dec-2020</td>
</tr>
<tr>
<td>3</td>
<td>United States</td>
<td>331,002,651</td>
<td>1-Dec-2020</td>
</tr>
</tbody>
</table>'''
Link to GitHubThe next step is to import pandas and call the pandas read_html function. The variable html that was created above will be used for this. The pandas read_html function will extract data from HTML tables and return a list of all the tables.
import pandas as pd
df_list = pd.read_html(html)
Link to GitHubNote that pandas read_html function returns a list of Pandas DataFrame objects.
In this case, there’s only one table. This means that the variable df_list will be a list of pandas DataFrame objects, with a length of one.
This can be verified by checking the length of the df_list variable.
print(len(df_list))
# OUTPUT: 1
Link to GitHubLet’s check the content of the DataFrame by printing it. Add the following print statement and run this file from the terminal.
print(df_list[0])
Link to GitHubWhen you run from the terminal, the data from HTML tables will be extracted and displayed as follows:
$ python3 read_html.py
Sequence Country Population Updated
0 1 China 1439323776 1-Dec-2020
1 2 India 1380004385 1-Dec-2020
2 3 United States 331002651 1-Dec-2020
Link to GitHubIf you’re using Jupyter Notebook, the output of the same command will have a better appearance.
pandas DataFrame created from a string
Now, it’s time to perform a basic cleanup.
The first thing that looks a bit off is the index column.
Whenever a new pandas DataFrame is created, pandas adds the index by default. It’s a numeric index starting with 0. In the HTML table, however, the first column Sequence is the actual index for this table.
This index column can be easily updated by calling the set_index() function of the DataFrame.
It’s intentional that most of the changes to the pandas DataFrame aren’t permanent. Pandas is made for handling a large amount of data. Most functions have an optional parameter inplace. The default value of this parameter is False. Alternatively, this function that sets the index column returns a new pandas DataFrame. This DataFrame can be stored in another variable as follows:
population = df_list[0].set_index('Sequence')
print(population)
Link to GitHubOnce again, let’s take a look at the output from the Jupyter Notebook of this new DataFrame.
pandas DataFrame after updating the index column
As mentioned, pandas can handle a large amount of data. It’s also optimized for numeric and date data types. Therefore, it’s important to ensure that the pandas read_html function has correctly interpreted data from HTML tables.
The data types can be checked by calling the info() function of the DataFrame as follows:
print(population.info())
Link to GitHubThe output will be as follows:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3 entries, 1 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country 3 non-null object
1 Population 3 non-null int64
2 Updated 3 non-null object
dtypes: int64(1), object(2)
Link to GitHubNote the Dtype for the column Updated is object. It means that the pandas read_html function didn’t understand that this column is dated.
There are multiple ways to do this. The easiest of these methods is to use one more parameter of the pandas read_html function. This parameter is parse_dates.
Pandas read_html has one mandatory parameter – io. This parameter can send a string, as we’ve done so far. This parameter can also accept the file path and URL. We’ll cover this in the later paragraphs of this tutorial.
What’s more interesting is that there are thirteen optional parameters for the pandas read_html function. Out of these thirteen, let’s focus on one parameter – parse_dates. This parameter can mainly take a list. This list can be the column numbers or column names. Even if you’re parsing only one column from the html_table, this parameter must be a list.
Looking at the HTML table that we’re working with, the Updated column is at index 3, as the indexing in Python begins with 0. Thus, both the following statement will ensure that the Updated column is correctly parsed into a date-time data type:
pd.read_html(html, parse_dates=[3])
pd.read_html(html, parse_dates=['Updated'])
Link to GitHubBefore we put this together, let’s discuss one more useful parameter. This parameter is useful for setting the correct index. It eliminates the need for calling the set_index function.
In this example, pandas read_html is being sent three parameters.
df_list= pd.read_html(html, parse_dates=['Updated'], index_col=0)
population = df_list[0]
print(population)
Link to GitHubThe immediate change that can be seen is that the values in the Updated column now show dates in a different format than the HTML table. This time, if the .info() function is called, the DataFrame will have correct data types:
DataFrame with date-time data type
The HTML that we used in the previous example is valid. The column headings are in <th>, and the whole row is wrapped in the <thead> tag. Practically, this won’t always be the case.
If the heading in the HTML table is embedded in regular <tr> and <td> tags, the DataFrame will be created with default numeric columns.
Column’s headings as rows
In such cases, you can use another optional parameter of the pandas read_html method as follows:
pd.read_html(html_no_head, header=0)
Link to GitHubThis will set the correct column headings.
If some <td> tags are missing, they will be replaced by NaN. Optionally, you can set the keep_default_na parameter to False, and then these will be blanks in the DataFrame.
For further cleanup, you can also use the pandas replace function. For details on this, see the documentation.
Extracting data from HTML tables that are in HTML files is almost the same as reading from strings. However, you should note the main difference.
Here’s one local file opened in Chrome:
HTML file in local disk
Instead of the HTML string, the pandas read_html needs the file path, relative or absolute.
Assuming that the population.html file contains the HTML table with population information, which is currently located in the tmp folder, you can read the HTML table as follows:
population_file= pd.read_html("/tmp/population.html", parse_dates=['Updated'], index_col=0)
print(population_file[0])
Link to GitHubHTML file converted to DataFrame
Now the HTML file has been read, indexes are set correctly, and the dates have been parsed, just like it was done for an HTML table stored in a string.
Pandas can directly connect to web URLs and read HTML tables. This functionality can be used for further Python web scraping. In this section, we’ll work on a fictional web scraping project. This web scraping project involves extracting the best books.
Extracting data from HTML tables is the same, whether the source is a string containing HTML, a local HTML file path, or, in this case, any website that contains HTML. In this practical example, the URL will be Top 100 Best Sci-Fi Novels.
The first step is to extract the list of tables using pandas read_html function. Next, we’ll check the length of the tables returned.
import pandas as pd
list_of_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels&oldid=1091082777")
print(len(list_of_df))
# OUTPUT: 7
Link to GitHubTo get to the exact table, there are multiple approaches possible. The first approach is to simply check all the indexes and find which index contains the required table. However, this method isn’t efficient because the number of tables can change, and the web scraping script we’re creating will fail.
There are two better ways to extract the required table. The first method is to use a regular expression. To use regular expressions, first, we need to identify any pattern inside the <table> that we want to scrape. Open the URL in a browser, and right-click the table, and click inspect.
HTML table markup
When examining the HTML markup of the site, we can see that the <caption> tag contains the text: “The 100 Best Novels.” This <caption> is inside the <table> tag that we need.
The first step is to create a regular expression. In this case, the regular expression can simply be the text “The 100 Best Novels.” Note that this is case-sensitive and space-sensitive.
This regular expression can now be supplied to the optional parameter match of the pandas read_html function.
import pandas as pd
list_of_df = pd.read_html("https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels&oldid=1091082777", match='The 100 Best Novels')
print(len(list_of_df))
# OUTPUT: 1
Link to GitHubNote that in this particular case, the text “The 100 Best Novels” worked. However, the match parameter can take any regular expression. For example:
[\w\s]+(\d){3}[\w\s]+
It means that the following line of code will also extract an example of the same table:
pd.read_html("https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels&oldid=1091082777", match='[\w\s]+(\d){3}[\w\s]+')
One more way to extract the required table is by using specific attributes. For example, the table that we need has the CSS class wikitable applied to it. This information can be supplied to the attrs parameter of pandas read_html function:
pd.read_html("https://en.wikipedia.org/w/index.php?title=Science_Fiction:_The_100_Best_Novels&oldid=1091082777", attrs={'class':"wikitable"})
Link to GitHubNow that we have the data, it’s time for analysis.
This web scraping project can be taken further by analyzing data. Let’s find the author who has written the most of the books in this Top 100 list:
df=list_of_df[0]
df.value_counts(subset=['Author'])
Link to GitHubThis will print the following pandas series:
Author
Philip K. Dick 6
J. G. Ballard 4
Robert A. Heinlein 3
Brian Aldiss 3
Thomas M. Disch 3
..
Link to GitHubIt gives us the information that Philip K. Dick has written 6 books out of the 100 best books. If needed, you can also plot charts to represent the same information.
First, this series can be converted to a pandas DataFrame for further analysis. This can be done by calling the reset_index function and giving the count of books a name. In this case, the name is BookCount:
df = df.value_counts(subset=['Author']).reset_index(name='BookCount')
Link to GitHubThe next step is to make a subset of this DataFrame, where authors have published 3 or more books out of these Top 100:
top_df = df[df['BookCount'] >= 3]
print(top_df)
Link to GitHubThe output will be the following DataFrame:
Author BookCount
0 Philip K. Dick 6
1 J. G. Ballard 4
2 Robert A. Heinlein 3
3 Brian Aldiss 3
4 Thomas M. Disch 3
Link to GitHubAnd finally, this data can be plotted as a horizontal bar chart:
top_df.plot.barh(x='Author',y='BookCount',figsize=(12,5))
Link to GitHubAuthors with three or more books on the Top 100 list
Click here and check out a repository on GitHub to find the complete code used in this article.
Pandas read_html can be a very powerful and easy tool for web scraping. What makes this tool best is its simplicity and data analysis capabilities. However, no library is perfect, and pandas is also no exception. One limitation of read_html is that the User-Agent and any other header can not be changed. Many websites will reject the request because of this.
Also, the biggest limitation is that it’s only capable of retrieving HTML tables that aren’t loaded with JavaScript. This is quite often the case.
Oxylabs’ web scraper can help retrieve tables that are loaded with JavaScript. These solutions can handle almost any kind of website without having to worry about User-Agent or rendering.
About the author
Iveta Vistorskyte
Lead Content Manager
Iveta Vistorskyte is a Lead Content Manager at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Web Scraper API for smooth scraping
Enjoy efficient web scraping while avoiding CAPTCHA and IP blocks.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub