Back to blog
Danielius Radavicius
Google News is a personalized news aggregator that curates and highlights relevant stories worldwide based on user interests. It compiles news articles and headlines from various sources, ensuring easy access from any device. An essential feature is "Full Coverage," which delves deeper into stories by presenting diverse perspectives from different outlets and mediums. In this tutorial, you’ll create a Google News Scraper from scratch using Python. By following the steps outlined, you’ll also learn how to mitigate the anti-bot scraping challenges of Google News. Although, before continuing, check out this article to learn more about news scraping.
Our scraper aims to ensure that your current and future scraping projects will be significantly streamlined while all the possible hassles are dealt with efficiently. The Oxylabs SERP API will manage everything from gathering real-time data to accessing search results from almost any location. So you don’t have to worry about any anti-bot solution issues. Last but not least, Oxylabs also provides a 1-week free trial to thoroughly test and develop your scraper and explore all the functionalities.
Signup and login to the dashboard. From there, you can create and grab your user credentials for the SERP API. They Will be needed in later steps.
Install the requests, bs4, and `pandas modules. Using pandas, you’ll create a CSV file to store the headlines of Google News results.
pip install pandas
Now, let’s prepare the payload and credentials for sending the API requests. Since you need to render Javascript, you’ll have to set render to html. This’ll tell the SERP API to render Javascript. Apart from that, you’ll also have to set source to google and pass the target URL as url. Also, don’t forget to replace the USERNAME and PASSWORD with your sub-account credentials.
payload = {
'source': 'google',
'render': 'html',,
'url': 'https://news.google.com/home',
}
credential = ('USERNAME', 'PASSWORD')
Next, using the post() method of the requests module, you’ll POST the payload and credential to the API.
response = requests.post(
'https://realtime.oxylabs.io/v1/queries',
auth=credential,
json=payload,
)
print(response.status_code)
If everything works, you should see the status code 200. If you get any other response codes please, refer to the documentation.
Before you begin parsing the news headlines, you’ll have to locate the target HTML elements using a web browser. Open the Google News Homepage on a Web Browser and right-click. Now, select inspect. Alternatively, you can also press CTRL + SHIFT + I to open the developer tools. It’ll look similar to what is shown below:
Thoroughly check out the content of the source HTML. You should be able to see the tags and properties of the elements on the elements tab. In the above screenshot, you can see that the Top Stories headlines are wrapped in an <h4> tag.
As you’ve already seen, all the News headlines are wrapped in <h4> tags. You can use the Chrome Browser’s developers tool to check the Source HTML and plan the parser accordingly.
To parse these headlines, you can use the Beautiful Soup module that you’ve imported in the previous steps. Let’s create a list data to store all the headlines.
data = []
soup = BeautifulSoup(response.json()["results"][0]["content"], "html.parser")
for headline in soup.find_all("h4"):
data.append(headline.text)
By using the find_all() method, you can grab all the headlines in one go. You can then add them to the `data` for exporting them in CSV.
Now, let’s store the data into a data frame object first. Then, you can export it to a CSV file using the to_csv() method. You can also set the index to False so that the CSV file won’t include an extra index column.
df = pd.DataFrame(data)
df.to_csv("google_news_data", index=False)
Using Oxylabs web scraping solutions, you can keep up to date with the latest News from Google News. Take advantage of Oxylabs’ powerful Scraper API to enhance your overall scraping experiences. Also, by using the techniques described in the article, you can harness the power of Google News data without worrying about proxy rotation or Anti-bot challenges.
About the author
Danielius Radavicius
Copywriter
Danielius Radavičius is a Copywriter at Oxylabs. Having grown up in films, music, and books and having a keen interest in the defense industry, he decided to move his career toward tech-related subjects and quickly became interested in all things technology. In his free time, you'll probably find Danielius watching films, listening to music, and planning world domination.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub