Back to blog
Iveta Vistorskyte
Baidu is a leading search engine in China, allowing users to search for information online. The search results are displayed similarly to other search engines, with a list of websites and web pages matching the user's search query.
This blog post covers the process of scraping publicly available Baidu organic search results using Python and Oxylabs' Baidu Scraper API.
The Baidu Search Engine Results Page (SERP) consists of various elements that help users find the required information quickly. Paid search results, organic search results, and related searches might appear when entering a search query in Baidu.
Similarly to other search engines, Baidu's organic search results are listed to provide users with the most relevant and helpful information related to their search query.
When you enter a search query on Baidu, you'll see some results marked as "advertise (广告)" Companies pay for these results to appear at the top of the search results.
Baidu's paid results example
Baidu's related search feature helps users find additional information related to their search queries. Usually, this feature can be found at the end of the search results page.
If you've ever encountered gathering public information from Baidu, you should know that it's not an easy task. Baidu uses various anti-scraping techniques, such as CAPTCHAs, blocking suspicious user agents and IP addresses, and employing dynamic elements that make accessing content for automated bots difficult.
Baidu's search result page is dynamic, meaning the HTML code can often change. It makes it hard for web scraping tools to locate and gather certain Baidu search results. You need to constantly maintain and update your web scraper to get hassle-free public information. This is where a ready-made web intelligence tool, such as our own SERP Scraper API, comes in to save time, effort, and resources.
Even if the legality of web scraping is a widely discussed topic, gathering publicly available data from the web, including Baidu search results, may be considered legal. Of course, there are a few rules you must follow when web scraping, such as:
A web scraping tool shouldn't log in to websites and then download data.
Even if there may be fewer restrictions for collecting public data than private information, you still must ensure that you're not breaching laws that may apply to such data, e.g., collecting copyrighted data.
If you're considering starting web scraping, especially for the first time, it's best to get professional legal advice on whether your public data gathering activities won't breach any laws or regulations. For additional information, you can also check our extensive article about the legality of web scraping.
Let's start from the basics. First, you need to create a project environment. For this, you need to install Python on your computer.
Open your terminal or command prompt, and follow the instructions:
1. Navigate to the directory where you want to create your virtual environment.
2. Run the following command to create a new virtual environment:
python -m venv env
This will create a new directory called env that contains the virtual environment. You can replace env with any name you like.
3. Activate the virtual environment by running the appropriate command for your operating system:
source env/bin/activate
4. You'll now see the env at the beginning of your command prompt, indicating that you're working in the virtual environment.
5. To install packages inside the virtual environment, use the pip command as you normally would.
pip install requests
This will install the requests package inside the virtual environment without affecting your global Python installation.
6. If you need to exit the virtual environment, run the command:
deactivate
That's it! You've now created and activated a virtual environment in Python using the venv module. You can use this environment to work on your Python projects without interfering with your global Python installation.
One more thing you should understand before diving into the step-by-step process of scraping Baidu search results using Baidu Scraper API – custom query parameters. Let’s discuss what query parameters are and why they're used.
Query parameters are key-value pairs added to a URL's end to modify a request or retrieve specific information. They’re used to customize web requests and retrieve certain Baidu search results. You can use them to set limits and offsets, specify search queries, and more.
The most common parameters you can use when scraping Baidu search results are listed below:
'source' defines the data source you want to gather results from.
'source': 'baidu_search',
'domain' – it's a domain localization parameterfor the Baidu search engine; by default, it will be set as 'com'
'domain': 'com',
'query' defines the search query of a search results page you want to scrape public data from. It's supposed to be a specific keyword, for example:
'query': 'cat food',
'user_agent_type' specifies what kind of a User-Agent header value should be used to fulfill your request. If you don't specify this parameter, User-Agent, by default, it will be chosen as "desktop."
'user_agent_type': 'desktop',
'start_page' defines a starting page you want to start scraping publicly available information from. If you don’t specify this parameter, the number by default will be 1.
'start_page': 5,
'pages' defines the total number of pages you want to extract public information from. If you don’t specify this parameter, the number by default will be 1.
'pages': 4,
'limit' defines the number of results you want to scrape from each page. If you don’t specify this parameter, the number by default will be 10.
'limit': 8,
Note that not all of these parameters are supported by every search engine. There may be additional parameters that can be used to customize search results.
Now that you have set up your environment and have a basic idea of query parameters, it's time to overview a step-by-step process of scraping Baidu search results using Oxylabs’ Baidu Scraper API.
When you purchase our Baidu Scraper API or start a free trial, you get the unique credentials needed to gather public data from Baidu. When you have all the information, you can start the web scraping process with Python.
Install the requests and pprint libraries in your Python environment using the pip command, and import them in your Python file. You'll also need the json package, which is typically pre-installed with Python.
import requests
from pprint import pprint
import json
import requests imports the Python requests library which allows you to send HTTP requests and get responses.
from pprint import pprint imports the pprint function from the Python pprint module. This function is used to pretty-print Python data structures such as dictionaries and lists.
import json imports the json library, which provides methods to encode and decode JSON data.
The API endpoint URL is a target URL. Define your API endpoint URL as follows:
url = 'https://realtime.oxylabs.io/v1/queries'
You also need to obtain an API key or authorization credentials from us. Once you've received the key, you can use it to make API requests. Define your authentication as follows using your API key:
auth = ('your_api_username', 'your_api_password')
Create a dictionary containing the query parameters you want to customize your search. These can include parameters such as the source, domain, query, etc.
Here's how you can create your Python dictionary called params, which contains all the parameters you want to pass to the Baidu search engine.
params = {
'source': 'baidu_search',
'domain': 'com',
'query': 'nike',
'start_page': 1,
'pages': 1,
'limit': 10,
'user_agent_type': 'desktop',
}
Check our documentation for a full list of available parameters.
Once you've declared everything, you can pass it as a JSON object in your request body.
response=requests.post(url, json=params, auth=auth)
The requests.post() method sends a POST request with the search parameters and authentication credentials to our SERP Scraper API.
The json_data variable contains the JSON-formatted response from the API, loaded using the json.loads() method and finally, the json_data variable is printed to the console using the print() function.
json_data = json.loads(response.text)
print(json_data)
Here's the full code example of how to scrape Baidu search results with Python and our Baidu Scraper API:
import requests
from pprint import pprint
import json
params = {
'source': 'baidu_search',
'domain': 'com',
'query': 'nike',
'start_page': 1,
'pages': 1,
'limit': 10,
'user_agent_type': 'desktop',
}
url = 'https://realtime.oxylabs.io/v1/queries'
auth = ('your_api_username', 'your_api_password')
response=requests.post(url, json=params, auth=auth)
json_data = json.loads(response.text)
print(json_data)
The code above includes necessary libraries, filters the search results by defined parameters on the keyword "nike," passes the target URL and credentials as a JSON request, and waits for the response. Once the response is loaded, the code prints the data in JSON format.
The output of the above code is:
The 'status_code: 200' specifies that the query was executed successfully. This is the URL that we got when the query executed: https://www.baidu.com/s?ie=utf-8&wd=nike&rn=1
Gathering search results from Baidu can be challenging, but we hope this step-by-step guide will help you scrape public data from Baidu easier. With the assistance of Baidu Scraper API, you can bypass various anti-bot measures and extract Baidu organic search results at scale.
If you have any questions or want to know more about gathering public data from Baidu, contact us via email or live chat. We also offer a free trial for our SERP Scraper API, so feel free to try if this advanced web scraping solution works for you.
About the author
Iveta Vistorskyte
Lead Content Manager
Iveta Vistorskyte is a Lead Content Manager at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Try Baidu Scraper API
Choose Oxylabs' Baidu Scraper API to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub