Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network status Careers

hello@oxylabs.io

English (EN)

English

中文

Proxies

Proxies & Advanced Proxy Solutions

Residential Proxies

Human-like scraping without IP blocking

Mobile Proxies

Harness the power of IP addresses from real mobile devices

Rotating ISP Proxies

Extract the required data without the fear of getting blocked

Web Unblocker

AI-powered proxy solution for block-free scraping

Shared Datacenter Proxies

Fast and reliable proxies for cost-effective scraping

Dedicated Datacenter Proxies

The highest performing proxies on the market

Static Residential Proxies

Combined power of Datacenter and Residential IPs

Tools & Addons

Oxy Proxy Extension for Chrome

Free Chrome proxy manager extension that works with any proxy provider.

Oxy Proxy Manager for Android

Free Android proxy manager app that works with any proxy provider.

Proxy RotatorAdd-on

Rotates your Datacenter Proxies to help increase success rates.

Scraper APIs

SERP Scraper APIFREE TRIAL

Scalable SERP data delivery from major search engines

E-Commerce Scraper APIFREE TRIAL

Enterprise-level data from largest e-commerce marketplaces

Real Estate Scraper APIFREE TRIAL

Real-time data from popular real estate websites

Web Scraper APIFREE TRIAL

Public data delivery from a majority of websites

Features

Web Crawler

Discovers all pages on a website and fetches data at scale.

Scheduler

Schedules multiple scraping and parsing jobs at specified frequencies.

Custom Parser

Parses scraped documents by executing given parsing instructions.

Headless BrowserNEW

Render JavaScript and execute browser instructions.

DatasetsNew

Datasets

Company Data

Comprehensive datasets for business profiling

E-Commerce Product Data

Datasets for product catalog insights from E-Commerce stores

Job Postings Data

Datasets for labour market research and insights

Community and Code Data

Datasets for developer community trends

Product Review Data

Fresh datasets for user sentiment analysis

Pricing

Proxies

Residential Proxies

Human-like scraping

Starts from

$10

Pay as you go

Mobile Proxies

3G/4G/5G Mobile Proxies

Starts from

$22

Pay as you go

Rotating ISP Proxies

Extended sessions

Starts from

$340/month

Shared Datacenter Proxies

Cost-effective solution

Starts from

$50/month

Dedicated Datacenter Proxies

Superior performance

Starts from

$50/month

Scraper APIs

SERP Scraper API

Scalable SERP data delivery

Starts from

$49/month

E-Commerce Scraper API

Enterprise-level product page data

Starts from

$49/month

Web Scraper API

Data from a majority of websites

Starts from

$49/month

Real Estate Scraper API

Real-time real estate data

Starts from

$49/month

Advanced Proxy Solutions

Web Unblocker

AI-powered proxy solution

Starts from

$75/month

Learn

Getting Started

Knowledge Base

Read the latest articles about the world of web scraping, proxies, and more

Webinars

Check our webinars to learn more about data gathering issues and solutions

White papers

Get extensive white papers to understand the most complex scraping topics

OxyCon

Join inspiring discussions at Oxylabs’ annual web scraping conference

Scraping Experts

Watch lessons by industry-leading experts to gain insights on data gathering

Useful Information

Quick Start Guides

Featured

Explore tutorials and code samples to build a web scraping infrastructure with Oxylabs solutions.

Solutions

By Industry

E-Commerce

Get access to valuable e-commerce data with the help of advanced scraping solutions

Cybersecurity

Collect threat intelligence and inspect risky activities anonymously with reliable proxies

Brand protection

Monitor the web on a large scale to ensure no unauthorized product seeped into the market

SERP Monitoring

Monitor SERPs to enhance your business strategy

Travel and hospitality

Gather real-time flight and hotel data to and build a solid strategy for your travel business.

By Use Case

View all

By Target

View all

Back to blog

Tutorials Scrapers

Web Scraping With RegEx

Augustas Pelakauskas

2022-04-293 min read

The demand for digital content has increased exponentially. Since the resulting competition increases, the existing websites are rapidly changing and updating their structure.

Quick updates are beneficial to general consumers. However, it’s a considerable hassle for a specific portion of businesses that collect public data since web scraping uses routines tailored for specific conditions of the individual websites, and frequent updates tend to disrupt them. This is where RegEx comes into play by alleviating some of the more complex elements of certain acquisition and parsing processes.

What is RegEx?

RegEx stands for Regular Expressions, a method to match specific patterns depending on the provided combinations, which can be used as filters to get the desired output.

How to use RegEx for web scraping?

RegEx can be used to validate all types of character combinations, including special characters like line breaks. One of the biggest pros Regular Expressions have is that no matter the type of data/input (irrespective of its size), it’s always compared to the same single regular expression, making the code more efficient.

Regular Expressions are universal and can be implemented in any programming language.

Overview of RegEx tokens

Token	Matches
^	Start of a string
$	End of a string
.	Any character (except \n)
\|	Characters on either side of the symbol
\	Escapes special characters
Char	The character given
*	Any number of previous characters
?	1 previous character
+	1 or more previous characters
{Digit}	Exact number
{Digit-Digit)	Between range
\d	Any digit
\s	Any whitespace character
\w	Any word character
\b	Word boundary character
\D	Inverse of \d
\S	Inverse of \s
\W	Inverse of \w

Collecting data using RegEx

In this tutorial, the RegEx scraping target is product titles and prices from a dummy website intended for training purposes.

Project requirements:

The latest version of Python.
Beautiful Soup 4 library to parse HTML.
Requests library to make HTML requests.

Importing libraries

Let’s begin with creating a virtual environment for the project:

python3 -m venv scrapingdemo

Activate the newly created virtual environment (the example for Linux):

source ./scrapingdemo/bin/activate

Now, install the required Python modules.

Requests is a library responsible for sending requests to the websites on the internet and returning their response. To install Requests, enter the following:

pip install requests

Beautiful Soup is a module used to parse and extract data from the HTML response. To install Beautiful Soup, enter the following:

pip install beautifulsoup4

re is a built-in Python module responsible for working with Regular Expressions.

Next, create an empty Python file, for example, demo.py.

To import the required libraries, enter the following:

import requests
from bs4 import BeautifulSoup 
import re

Sending GET request

Use the Requests library to send a request to a web page from which you want to scrape the data. In this case, https://sandbox.oxylabs.io/products. To commence, enter the following:

page = requests.get('https://sandbox.oxylabs.io/products')

Selecting data

First, create a Beautiful Soup object and pass the page content received from your request during the initialization, including the parser type. As you’re working with an HTML code, select HTML.parser as the parser type.

Inspecting the HTML code element

By inspecting the elements (right-click and select inspect element) in a browser, you can see that each game title and price are presented inside a div element with the class called product-card. Use Beautiful Soup to get all the data inside these elements and then convert it to a string:

soup = BeautifulSoup(page.content, 'html.parser')
products = soup.find_all("div", class_="product-card")
content = str(content)

Processing data using RegEx

Since the acquired content has a lot of unnecessary data, create two regular expressions to get only the desired data.

Content of the acquired data

Expression # 1

Finding the pattern

First, inspect the title of the product to find the pattern. You can see above that every title is present after the same class name in the <h4 class="title css-7u5e79 eag3qlw7">The Legend of Zelda: Ocarina of Time</h4> format.

Generating the expression

Then, create an expression that returns data inside the element tag by specifying "(.*?)".

The first expression is as follows:

re_titles = r'class="title css-7u5e79 eag3qlw7">(.*?)<\/h4>'

Expression # 2

Finding the pattern

First, inspect the price of the product. Every price is present in a div tag in the <div class="price-wrapper css-li4v8k eag3qlw4">91,99 €</div> format.

Generating the expression

Then, create an expression that returns data inside the div element.

The second expression is as follows:

re_prices = r'class="price-wrapper css-li4v8k eag3qlw4">(.*?)<\/div>'

To conclude, use the expressions with re.findall to find substrings matching the patterns. Lastly, save them in the data variables.

title = re.findall(re_titles, product_html)
price = re.findall(re_prices, product_html)
data.append((title, price))

Saving the output

To save the output, loop over the pairs for the titles and prices and write them to the output.txt file.

with open("output.txt", "w", encoding="utf-8") as f:
   for title, price in data:
       f.write(f"{title}\t{price}\n")

The output file

Putting everything together, this is the complete code that can be run by calling python demo.py:

# Importing the required libraries.
import requests
from bs4 import BeautifulSoup
import re
from pprint import pprint
# Requesting the HTML from the target website.
url = "https://sandbox.oxylabs.io/products"
page = requests.get(url)

# Selecting data.
soup = BeautifulSoup(page.content, "html.parser")
products = soup.find_all("div", class_="product-card")

# Processing data using Regular Expressions.
re_titles = r'class="title css-7u5e79 eag3qlw7">(.*?)<\/h4>'
re_prices = r'class="price-wrapper css-li4v8k eag3qlw4">(.*?)<\/div>'

data = []

for product in products:
   product_html = str(product)
   title = re.findall(re_titles, product_html)
   price = re.findall(re_prices, product_html)
   data.append((title, price))

# Saving the output.
with open("output.txt", "w", encoding="utf-8") as f:
   for title, price in data:
       f.write(f"{title}\t{price}\n")

Conclusion

This article explained what Regular Expressions are, how to use them, and what most commonly used tokens do. An example of scraping the titles and prices from a web page utilizing Python and Regular Expressions was also provided. If you're looking for an advanced web scraping solution, feel free to explore the features of our Web Scraper API.

Don’t forget to check our blog for more step-by-step tutorials on web scraping with Python, PHP, Ruby, Golang, and many more, or take a look at a guide on how to use Wget with proxy.

About the author

Augustas Pelakauskas

Senior Copywriter

Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.

Learn more about Augustas Pelakauskas

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Tutorials Scrapers