Yelp is not only a place to find your next best dinner place. With years of compiling crowd-sourced content, Yelp is the perfect place to look for business data that can provide insights into local economic trends.
In this article, why do businesses scrape Yelp data and what are the benefits of it? We will also show you a solution that can efficiently extract Yelp data at any scale.
As mentioned above, throughout years of operating, Yelp has created a unique dataset filled with business details that can be scraped and used for various purposes – from journalism and academic research to business operations. On the business side, you might want to consider Yelp data for such things as:
Customer sentiment analysis
Market research
Competitor analysis
Location planning
Reputation management
There is a variety of business data that can be gathered from Yelp, but you first need to be familiar with the types of pages that you will be scraping – the search page and the business page.
The search page will provide you with an overview of all the local businesses that fit your search criteria. This means that you will be able to collect such data as:
Business name
URL
Reviews count
Rating
Tags
The business page, on the other hand, offers more detailed information about one specific business. This means that you get the information from the search page plus a more in-depth look at such things as reviews. In a nutshell, you can scrape:
Business name
URL
Extended review information
Contact information
Opening times
Amenities information
So, Yelp offers a variety of data that can be used to uncover excellent business opportunities.
For this tutorial, you’ll be using Python, so please download the latest version from their official website if you don’t have it already.
Next, you’ll have to install some libraries. All these libraries are available in the Python Package Index, so you can install them using a single command given below:
pip install requests bs4 pandas
Now, import all the freshly installed libraries:
from bs4 import BeautifulSoup
import pandas as pd
import requests
As you can see, all three libraries are imported, and ready to use. The requests module will help you to send network requests. Once the server responds to the network requests, the BeautifulSoup module will be used for parsing the HTML content from the response object. And, pandas library will convert the parsed data into a CSV file.
To make things easier, we will use Oxylabs’ Web Scraper API which allows users to extract data from any website. Its main pros are that it has a built-in proxy rotator, custom device type, JavaScript rendering, etc. This means that there is significantly less chance that your scraping operations will encounter IP blocks or CAPTHCAs.
Let’s quickly look at the various parameters available to you.
Parameter | Description |
---|---|
source | Data source. For Yelp it should be set to `universal`. This parameter is required. |
url | Yelp or any other website URL. This parameter is also required. |
user_agent_type | Configure device type and browser. |
geo_location | Custom proxy based on specified geolocation |
locale | Configure Accept-Language Header |
render | Enables JavaScript-based rendering |
callback_url | URL to your callback endpoint (if any) |
parse | If set to `true` returns structured data using the given `parsing_instructions`. |
parsing_instructions | Defines custom parsing & data transformation logic to be executed on HTML |
context:headers | Customize Headers |
context:http_method | Customize HTTP methods. I.e. `POST` |
context:session_id | Allows the same proxy on multiple requests for 10 minutes. |
context:cookies | Allows custom cookies |
Also, check out the complete list of parameters here.
To use the Oxylabs Web Scraper API, you’ll need an Oxylabs account. Use your API user credentials and prepare a payload. The code will be similar to the one below:
page = "https://www.yelp.com/biz/memento-sf-san-francisco-3"
payload = {
"source": "universal",
"render": "html",
"user_agent_type": "desktop",
"url": page,
}
credentials = ("USERNAME", "PASSWORD")
Once the payload is ready with the page url, you must send a POST request to the Web Scraper API.
Don’t forget to pass the authentication credentials.
page = "https://www.yelp.com/biz/memento-sf-san-francisco-3"
payload = {
"source": "universal",
"render": "html",
"user_agent_type": "desktop",
"url": page,
}
credentials = ("USERNAME", "PASSWORD")
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=credentials,
json=payload,
)
print(response.status_code)
If everything works as expected, you should see a response code 200. If you get any other response code then please see the documentation.
Before you start parsing the business page, let’s open it using a Web Browser and use the developer tool to identify the necessary CSS Selectors. You can right-click and select Inspect or just press CTRL + SHIFT + I (Windows) or ⌥ + ⌘ + I (macOS) to open developer tools. It should look like that:
Next, let’s find the CSS selectors for the name, reviews, rating, working hours and location elements.
As you can see, the name is available in an <h1> tag. Similarly, the rating is available in a span element with class css-1fdy0l5:
The review is available in an <a> tag with class css-19v1rkv:
And the address element has an <address> tag:
Last but not least, the working hours are wrapped in a <table> element:
Now using all this information, you can start writing the parser with Beautiful Soup. The Web Scraper API returns a JSON response in which the HTML code is available in the content property.
For both location and working_hours, you’ll have to extract the text of all the child elements of the parent element. Fortunately, Beautiful Soup has a get_text() method which you can use for such cases.
data = []
soup = BeautifulSoup(response.json()["results"][0]["content"], "html.parser")
name = soup.find("h1").text
rating = soup.find("span", class_="css-1fdy0l5").text
review = soup.find("a", class_="css-19v1rkv").text
location = soup.find("address").get_text(strip=True)
working_hours = soup.find("table").get_text(strip=True)
After parsing the data, you can save everything in a structured CSV file by appending the parsed data to a data list and then using pandas to export the list into CSV:
data.append({
"name": name,
"rating": rating,
"review": review,
"location": location,
"working hours": working_hours,
})
df = pd.DataFrame(data)
df.to_csv("yelp_business_data.csv", index=False)
from bs4 import BeautifulSoup
import pandas as pd
import requests
page = "https://www.yelp.com/biz/memento-sf-san-francisco-3"
payload = {
"source": "universal",
"render": "html",
"user_agent_type": "desktop",
"url": page,
}
credentials = ("USERNAME", "PASSWORD")
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=credentials,
json=payload,
)
print(response.status_code)
data = []
soup = BeautifulSoup(response.json()["results"][0]["content"], "html.parser")
name = soup.find("h1").text
rating = soup.find("span", class_="css-1fdy0l5").text
review = soup.find("a", class_="css-19v1rkv").text
location = soup.find("address").get_text(strip=True)
working_hours = soup.find("table").get_text(strip=True)
data.append({
"name": name,
"rating": rating,
"review": review,
"location": location,
"working hours": working_hours,
})
df = pd.DataFrame(data)
df.to_csv("yelp_business_data.csv", index=False)
And that’s it! You’ve successfully extracted the content of a Yelp Business page.
You can also extract data from Yelp search results page using the Web Scraper API. As above, all you need to do is inspect elements using the developer tool and gather the appropriate CSS selectors from the Yelp search page.
Open the Yelp search result page in a web browser and use the developer tool to inspect the elements. Notice each of the search results is wrapped inside a div with a unique property data-testid=’serp-ia-card’:
Now, you can inspect each of the elements and find the CSS selectors for name, review count, rating, neighborhood and URL. For example, the name is wrapped in an <a> tag which is enclosed with an <h3> tag:
Similarly, you can find the rest of the CSS selectors using the developer tool. For your convenience, all of them are given below
Name | CSS Selector |
---|---|
name | h3.a |
rating | span.css-gutk1c |
review count | span.css-chan6m |
neighborhood | div.css-1kiyre6 span.css-chan6m |
url | h3.a |
Now that you’ve all the necessary CSS selectors, you can start parsing the data from the HTML content. You can use the Web Scraper API the same way you did for the Yelp Business page.
Once the HTML content is extracted, use Beautiful Soup to parse it further and extract div elements. Then, you can use a for loop to extract content from each of the div elements. The full source code is given below:
from bs4 import BeautifulSoup
import pandas as pd
import requests
page = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=San%20Francisco%2C%20CA"
payload = {
"source": "universal",
"render": "html",
"url": page,
}
credentials = ("USERNAME", "PASSWORD")
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=credentials,
json=payload,
)
print(response.status_code)
data = []
for result in response.json()["results"]:
soup = BeautifulSoup(result["content"], "html.parser")
for div in soup.find_all("div", {"data-testid": "serp-ia-card"}):
name = div.find("h3").find("a").get_text(strip=True)
rating = div.find("span", class_="css-gutk1c").get_text(strip=True)
review_count = div.find("span", class_="css-chan6m").get_text(strip=True).replace("(", "").replace(" reviews)", "")
neighborhood = div.find("div", class_="css-1kiyre6").find("span", class_="css-chan6m").get_text(strip=True)
url = div.find("h3").find("a")["href"]
data.append({
"name": name,
"rating": rating,
"review": review_count,
"neighborhood": neighborhood,
"url": url,
})
Using the pandas library, you can easily export the extracted data into a CSV file. First, convert the data list into a data frame. Then use the to_csv() method as below:
df = pd.DataFrame(data)
df.to_csv("yelp_data.csv", index=False)
In this tutorial, you’ve learned how to use Web Scraper API to bypass antibot protection and extract Yelp data effortlessly. You also learned how to export the data and store it in a CSV file. By using the Web Scraper API and the techniques described in this article, you can also scrape similar complex websites without ever getting blocked.
Downloading Yelp reviews is possible. To do it quickly and efficiently, you can use our Yelp Scraper API which can deliver localized Yelp data in a matter of seconds.
About the author
Enrika Pavlovskytė
Copywriter
Enrika Pavlovskytė is a Copywriter at Oxylabs. With a background in digital heritage research, she became increasingly fascinated with innovative technologies and started transitioning into the tech world. On her days off, you might find her camping in the wilderness and, perhaps, trying to befriend a fox! Even so, she would never pass up a chance to binge-watch old horror movies on the couch.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub