In this tutorial, you’ll learn how to fetch Yellow Pages using Python. Yellow pages typically contain valuable business details such as their address, ratings, services offered, business email, phone numbers, etc. In today's data-driven world, acquiring such information from diverse sources is crucial for every business. It provides insights for market analysis, competitor profiling, lead generation, and more. So, let’s get started.
First, you’ll need to install Python. Please visit the official website and download the latest release from here.
Now that you’ve installed Python, you can use the pip command to install the necessary libraries and their dependencies using the below command.
pip install requests bs4
The above command will install two libraries: requests & Beautiful Soup.
Once installed, you can import those libraries by typing the following code in your favorite code editor.
from bs4 import BeautifulSoup
import requests
To bypass the anti-bot protection challenges of Yellow Pages, you’ll have to use Oxylabs’ Yellow Pages Scraper API. It’s a powerful AI-driven tool that can handle proxy rotation and management. It also has various useful features, such as mimicking network requests of different devices, JavaScript rendering, etc.
Once you sign up and create an Oxylabs account, you’ll get the sub-account credentials. Take note of the username and password, you’ll use them in the next step.
Next, you’ll fetch a Yellow page using the Web Scraper API. You’ll have to send a POST request to the Scraper API with a payload and credential.
url = "https://www.yellowpages.ca/bus/Ontario/North-York/The-Burger-Cellar/6835043.html"
payload = {
'source': 'universal',
'render': 'html',
'url': url,
}
credential = ('USERNAME', 'PASSWORD')
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=credential,
json=payload,
)
print(response.status_code)
To scrape Yellow Pages, the source must be set to universal. The render parameter tells the API to execute JavaScript while rendering the HTML content. Don’t forget to replace USERNAME and PASSWORD with your credentials; otherwise, you’ll get authentication errors from the API. If everything works as expected, you’ll get a status_code of 200 when you run the code.
Now, from the response object, you can extract the HTML content of the web page. Scraper API returns a JSON response that contains the HTML content, so you can take advantage of the json() method.
content = response.json()["results"][0]["content"]
soup = BeautifulSoup(content, "html.parser")
The soup object will have the parsed HTML content from which you can extract the Yellow Page data using CSS Selectors.
The easiest way to find the CSS Selectors of the element is to use the Developer tool of a web browser. In this tutorial, we’re using Google Chrome web browser, but Firefox and other web browsers also have similar tools available. All you need to do is open the target URL in your browser, right-click on the page, and select inspect. Alternatively, you can press the keyboard shortcut CTRL + SHIFT + I.
Now, if you inspect the source code and locate the name element, you’ll notice the business name is wrapped in a span tag.
As you can see, this <span> element has an attribute itemprop set to name. You can use this attribute to locate the element using the find() method as below.
name = soup.find('span', {'itemprop': 'name'}).get_text(strip=True)
print(name)
Next, let’s inspect the address element by finding the <div> element that wraps the address. Inside this <div> element, you’ll find several <span> tags.
Since the address is split into chunks of <span> elements. You can use a for loop to extract the text of each of these <span> elements. And then, use the join() method to reconstruct the address string as below.
itemprops = ["streetAddress", "addressLocality", "addressRegion", "postalCode"]
address_text = []
for itemprop in itemprops:
address_text.append(soup.find('span', {'itemprop': itemprop}).get_text(strip=True))
address = ', '.join(text for text in address_text if text)
print(address)
Similarly, you can find the phone element, which is also a span tag with itemprop set to telephone.
The code to extract phone numbers is similar to the code for extracting business names.
phone = soup.find('span', {'itemprop': 'telephone'}).get_text(strip=True)
print(phone)
Last but not least, inspect the ratings element. Notice that all the stars are wrapped in a <span> element.
This span element doesn’t have the attribute itemprop, so you can use the class jsReviewsChart as CSS Selector instead.
ratings = soup.find('span', {'class': 'jsReviewsChart'})['aria-label']
print(ratings)
And that’s it! You’ve successfully extracted this business's names, phone numbers, ratings, and addresses. If you want to extract other elements, you can inspect those elements to find the appropriate CSS selectors and modify the source code accordingly.
In conclusion, you’ve learned how to extract information from the Yellow Pages with Python using Oxylabs’ Yellow Pages Scraper API. The API made the whole process smooth and hassle-free by handling anti-bot protection bypass, proxy management, and JavaScript rendering so that you can focus on the necessary business data instead of dealing with scraping hurdles. You can also use this API and the techniques to bypass other complex websites’ anti-bot protection and extract data.
Yes, absolutely. It’s possible to scrape data from Yellow Pages and collect business information. The Yellow Pages website is a directory of business contact details. You're not breaking any laws by scraping Yellow Pages data as it’s considered publicly available information. However, our legal team strongly recommends consulting a legal professional before being involved in any scraping activity.
To extract and export data from Yellow Pages to Excel, you can use Python Programming Language. After scraping the desired data from Yellow Pages with Python, you can easily save it to Excel format using the pandas library.
You can write your own custom web scraping tool to scrape Yellow Pages. Python libraries, such as Scrapy, Beautiful Soup, Requests, Selenium, etc., will come in handy. However, you’ll also have to deal with the anti-bot protection challenges and use a pool of proxies by manually rotating proxies periodically.
Alternatively, you can also leverage Oxylabs’ Yellow Pages Scraper API, which will make things simpler as the API will handle proxy pool management and anti-bot protection bypass, so you don’t have to worry about it anymore.
About the author
Maryia Stsiopkina
Senior Content Manager
Maryia Stsiopkina is a Senior Content Manager at Oxylabs. As her passion for writing was developing, she was writing either creepy detective stories or fairy tales at different points in time. Eventually, she found herself in the tech wonderland with numerous hidden corners to explore. At leisure, she does birdwatching with binoculars (some people mistake it for stalking), makes flower jewelry, and eats pickles.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub