Not so long ago, OpenAI achieved a major breakthrough in conversational AI by making an advanced chatbot (i.e., ChatGPT) publicly available. ChatGPT uses GPT-3, a large language model, to ingeniously handle the conversations. It is trained on massive amounts of data and can help solve problems in several domains. In this article, we’ll demonstrate using ChatGPT for developing fully-functional Python web scrapers. We’ll also discuss some important tips and tricks to improve the quality of a scraper’s code.
Before moving to the actual topic, let’s briefly introduce our demo target for this tutorial. We will extract data from the Oxylabs Sandbox, a dummy e-commerce store that maintains video game listings in several categories. Here is what the landing page of the store looks like:
Now, let’s delve into the steps required to scrape data from this webpage using ChatGPT.
Create a ChatGPT Account: Visit ChatGPT’s login page and hit Sign-up. You also have the option to sign up using your Google account. On successful sign-up, you will be redirected to the chat window. You can initiate a chat by entering your query in the text field.
Locate the elements to scrape: Before prompting ChatGPT, let’s first locate the elements we need to extract from the target page. Assume that we need only the video game titles and prices.
Right-click one of the game titles and select “Inspect.” This will open the HTML code for this element in the Developer Tools window.
Right-click the element and select “Copy selector” with the game title in it. The following figure explains it all.
Write down the selector and repeat the same to find the selector for the Price element.
3. Prepare the ChatGPT prompt: The prompt should be well-explained, specifying the code’s programming language, tools and libraries to be used, element selectors, output, and any special instructions the code must comply with. Here is a sample prompt that you can use to create the web scraper using Python and & BeautifulSoup:
Write a web scraper using Python and BeautifulSoup.
Sample Target: https://sandbox.oxylabs.io/products
Rationale: Scrape the video game titles and prices of all the games on the target page.
CSS selectors are as follows:
1. Title: #__next > main > div > div > div > div:nth-child(2) > div > div:nth-child(1) > a.card-header.css-o171kl.eag3qlw2 > h4
2. Price: #__next > main > div > div > div > div:nth-child(2) > div > div:nth-child(1) > div.price-wrapper.css-li4v8k.eag3qlw4
Output: Save all the Titles and Prices for all the video games in a CSV file
Additional Instructions: Handle character encoding and remove undesirable symbols in the output CSV.
Notice we have provided CSS selectors for prices and titles that we copied in the earlier step.
The scraped data might contain a few odd characters if not handled properly due to encoding issues. It happens when the web page and the Python script interpret the character encoding differently. Therefore, we can add relevant Additional Instructions to avoid these encoding issues.
4. Review the code: Once ChatGPT replies with a code, it’s always a good idea to review it first. Just overview the code if it seems to do what you want it to do. Double-check if it uses some libraries or packages you may wish to exclude. If you find any inherent glitches, write back to ChatGPT and ask to generate the code again.
5. Execute and Test: Copy the ChatGPT-generated code and check if it is executing fine. Here is the code that ChatGPT generated for us:
import requests
from bs4 import BeautifulSoup
import csv
# URL of the target page
url = "https://sandbox.oxylabs.io/products"
# Send a GET request to the URL
response = requests.get(url)
# Parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "html.parser")
# CSS selectors for title and price
title_selector = "a.card-header h4"
price_selector = "div.price-wrapper"
# Find all game titles and prices
titles = soup.select(title_selector)
prices = soup.select(price_selector)
# Create a list to store the data
data = []
# Iterate through the titles and prices
for title, price in zip(titles, prices):
game_title = title.get_text(strip=True)
game_price = price.get_text(strip=True)
data.append((game_title, game_price))
# Define the CSV filename
filename = "game_data.csv"
# Write the data to a CSV file
with open(filename, "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Title", "Price"]) # Write header
for game in data:
writer.writerow(game)
print(f"Data scraped successfully and saved to '{filename}'.")
Note: Remember to install the `BeautifulSoup` library and `requests` library before running the code. This can be done by opening the terminal and running:
pip install requests beautifulsoup4
Here is the snippet of the output CSV file generated after executing the code:
Congratulations! We just effortlessly scraped the target website. But what if the code executes in some exceptions or the output is not as you desired? Let’s address these concerns in the next section.
There are certain tips and tricks that you can follow to get the most accurate and desired information from ChatGPT.
1. Get Code Editing Assistance
ChatGPT provides one of the remarkable features of code editing. If the generated code doesn’t meet your requirements or generates some wrong output, you can ask to edit the code based on your needs.
Specify the changes you want to make, such as modifying the scraped elements, boosting the effectiveness of the code, or modifying the data extraction procedure. ChatGPT can offer you additional code options or modify suggestions to improve the web scraping process.
2. Linting Codes for Quality
Linting codes increase their readability and maintainability. ChatGPT can assist you with code linting by recommending best practices, spotting potential syntax problems, and enhancing code readability.
To adhere to coding standards and practices, you can ask ChatGPT to review the code and provide recommendations. You can even paste your code and ask ChatGPT to lint it. You can do so by adding the “lint the code” phrase in the additional instructions of the prompt.
3. Code Optimization Assistance
When it comes to web scraping, efficiency is critical, especially when working with large datasets or challenging web scraping tasks. ChatGPT can provide tips on how to increase the performance of your code.
You can ask for advice on how to use frameworks and packages that speed up web scraping, use caching techniques, exploit concurrency or parallel processing, and minimize pointless network calls.
Certain websites produce dynamic content using Javascript libraries or use AJAX requests to produce the content. ChatGPT can help you navigate such complex web content. You can inquire ChatGPT for the techniques to get the dynamic content from such Javascript-rendered pages.
ChatGPT can offer suggestions on using headless browsers, parsing dynamic HTML, or even automating interactions using simulated user actions.
Large language models (LLMs), such as GPT-3, which powers ChatGPT, are fundamentally prone to the hallucination problem. This means ChatGPT can return responses that are factually incorrect or inconsistent with reality.
Understanding the limitations of ChatGPT is crucial. Although it has been trained extensively on a massive amount of data, there are instances where it may produce code snippets unsuitable for direct execution. Therefore, reviewing and verifying the ChatGPT response and the resulting code before executing it is imperative.
There are also some other limitations of using ChatGPT for web scraping. Many websites have implemented strong security measures to block automated scrapers from accessing the sites. Commonly, sites use CAPTCHAs and Request Rate-limiting to prevent automated scraping. Thereby, simple ChatGPT-generated scrapers may fail at these sites. However, Web Unblocker by Oxylabs can help in these scenarios.
Web Unblocker provides features such as rotating proxies, bypassing CAPTCHAs, managing requests, etc. Such measures can help minimize the chances of triggering automated bot detection.
ChatGPT has made writing simple web scrapers a trivial task. However, fundamental peculiarities in the AI model powering ChatGPT may result in peculiar results. Additionally, it does not provide any significant help in bypassing CAPTCHAs. It further lacks the hardware to provide web proxies and more scalable scraping.
If you liked this article, be sure to check out our blog for more content. Whether you're looking to learn web scraping or master advanced skills like overcoming anti-bot systems — we've got something for everyone.
Being an AI language model, ChatGPT can’t directly scrape public website data. However, it can help write web scraping code.
The degree of anonymity while web scraping may vary. In most cases, it depends on the proxy use and scraping patterns.
There are multiple ways to enhance anonymity while web data scraping, including using proxy servers, rotating IP addresses, and implementing scraping logic that will help a scraper resemble an organic user.
About the author
Maryia Stsiopkina
Senior Content Manager
Maryia Stsiopkina is a Senior Content Manager at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. At leisure, she does birdwatching with binoculars (some people mistake it for stalking), makes flower jewelry, and eats pickles.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub