Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network status Careers

hello@oxylabs.io

English (EN)

English

中文

Proxies

Proxies & Advanced Proxy Solutions

Residential Proxies

Human-like scraping without IP blocking

Mobile Proxies

Harness the power of IP addresses from real mobile devices

Rotating ISP Proxies

Extract the required data without the fear of getting blocked

Web Unblocker

AI-powered proxy solution for block-free scraping

Shared Datacenter Proxies

Fast and reliable proxies for cost-effective scraping

Dedicated Datacenter Proxies

The highest performing proxies on the market

Static Residential Proxies

Combined power of Datacenter and Residential IPs

Tools & Addons

Oxy Proxy Extension for Chrome

Free Chrome proxy manager extension that works with any proxy provider.

Oxy Proxy Manager for Android

Free Android proxy manager app that works with any proxy provider.

Proxy RotatorAdd-on

Rotates your Datacenter Proxies to help increase success rates.

Scraper APIs

SERP Scraper APIFREE TRIAL

Scalable SERP data delivery from major search engines

E-Commerce Scraper APIFREE TRIAL

Enterprise-level data from largest e-commerce marketplaces

Real Estate Scraper APIFREE TRIAL

Real-time data from popular real estate websites

Web Scraper APIFREE TRIAL

Public data delivery from a majority of websites

Features

Web Crawler

Discovers all pages on a website and fetches data at scale.

Scheduler

Schedules multiple scraping and parsing jobs at specified frequencies.

Custom Parser

Parses scraped documents by executing given parsing instructions.

Headless BrowserNEW

Render JavaScript and execute browser instructions.

DatasetsNew

Datasets

Company Data

Comprehensive datasets for business profiling

E-Commerce Product Data

Datasets for product catalog insights from E-Commerce stores

Job Postings Data

Datasets for labour market research and insights

Community and Code Data

Datasets for developer community trends

Product Review Data

Fresh datasets for user sentiment analysis

Pricing

Proxies

Residential Proxies

Human-like scraping

Starts from

$10

Pay as you go

Mobile Proxies

3G/4G/5G Mobile Proxies

Starts from

$22

Pay as you go

Rotating ISP Proxies

Extended sessions

Starts from

$340/month

Shared Datacenter Proxies

Cost-effective solution

Starts from

$50/month

Dedicated Datacenter Proxies

Superior performance

Starts from

$50/month

Scraper APIs

SERP Scraper API

Scalable SERP data delivery

Starts from

$49/month

E-Commerce Scraper API

Enterprise-level product page data

Starts from

$49/month

Web Scraper API

Data from a majority of websites

Starts from

$49/month

Real Estate Scraper API

Real-time real estate data

Starts from

$49/month

Advanced Proxy Solutions

Web Unblocker

AI-powered proxy solution

Starts from

$75/month

Learn

Getting Started

Knowledge Base

Read the latest articles about the world of web scraping, proxies, and more

Webinars

Check our webinars to learn more about data gathering issues and solutions

White papers

Get extensive white papers to understand the most complex scraping topics

OxyCon

Join inspiring discussions at Oxylabs’ annual web scraping conference

Scraping Experts

Watch lessons by industry-leading experts to gain insights on data gathering

Useful Information

Quick Start Guides

Featured

Explore tutorials and code samples to build a web scraping infrastructure with Oxylabs solutions.

Solutions

By Industry

E-Commerce

Get access to valuable e-commerce data with the help of advanced scraping solutions

Cybersecurity

Collect threat intelligence and inspect risky activities anonymously with reliable proxies

Brand protection

Monitor the web on a large scale to ensure no unauthorized product seeped into the market

SERP Monitoring

Monitor SERPs to enhance your business strategy

Travel and hospitality

Gather real-time flight and hotel data to and build a solid strategy for your travel business.

By Use Case

View all

By Target

View all

Back to blog

Tutorials Data acquisition Data utilization Scrapers

News Scraping: Everything You Need to Know

Iveta Vistorskyte

2021-10-187 min read

Public news data can be beneficial for various companies to stay ahead of their competition. However, for companies whose core business isn’t news aggregation or analysis, reading and analyzing articles from thousands of news outlets worldwide is bound to take a lot of unnecessary time, regardless of the articles’ importance. Fortunately, news scraping addresses this problem.

This article discusses everything you need to know about news scraping, including the benefits and use cases of news scraping as well as how you can use Python to create an article scraper.

What is news scraping?

News scraping is a subset of web scraping that mainly targets public online media websites. It refers to automatically extracting news updates and releases from news articles and websites. It also relates to extracting public news data from the news results tab on SERPs or dedicated news aggregator platforms.

On the other hand, web scraping or web data extraction is the automatic retrieval of data from any website using a tool like a web scraper.

From a business point of view, news websites contain plenty of crucial public data, from reviews about newly released products to coverage of a company’s financial results and other vital announcements. These websites also cover several topics and industries, including technology, finance, fashion, science, health, politics, and more.

Benefits of news scraping

The benefits of news scraping include:

Risk identification and mitigation
Source of up-to-date, reliable, and verified information
Improves operations
Improves compliance

Risk identification and mitigation

A recent McKinsey article discussing risk and resilience proposed the use of digital technologies that integrate real-time data from several sources, including weather forecasts, to run scenarios to come up with the most effective solution to a problem. In doing so, the article indirectly recommended using news scraping as a source of real-time public data that can then be used to identify and mitigate risks.

Scraping public news websites increases a company’s ability to anticipate, predict, and observe threats more accurately and quickly.

Source of up-to-date, reliable, and verified information

News websites mainly strive to maintain credibility through their coverage of emerging news. They often have fact-checking departments and libraries against which to verify certain aspects of their updates. In this regard, public news scraping provides companies with access to up-to-date, accurate, and reliable information.

Improve operations

Companies don’t operate in a vacuum, meaning external factors can easily impact them. In this regard, scraping public news websites is a critical tool that ensures they constantly stay updated on emerging trends. It acts as a platform to make informed improvements to operations in a way that leverages favorable trends or counters unfavorable ones.

Improves compliance

News websites cover a wide latitude of topics, including regulations that have already been passed or those still awaiting enactment. Moreover, in some cases, the author of a news article even discusses the implications of such laws on whole industries and even interviews experts for a better picture.

Thus, when companies scrape public news articles and gather news regarding proposed or newly enacted regulations, they can better prepare for their implications, thereby improving compliance.

Use cases of news scraping

News scraping provides access to real-time updates on several issues and topics, which can be used in the following ways:

Reputation monitoring
Obtain competitive intelligence
Discover industry trends
Unearth fresh ideas
Content strategy improvement

Reputation monitoring

According to a 2020 Weber Shandwick study, companies with strong reputations enjoy customer loyalty, competitive advantage, better relationships with partners and suppliers, the attraction of high-quality talent, high employee retention, new market opportunities, higher stock price, and more. More specifically, 76% of a company’s market value is attributed to company reputation.

Media coverage may be positive or negative. Although the saying goes that ‘any publicity is good publicity,’ bad publicity can easily damage people’s perception of a company, significantly affecting its reputation. It could tank the market value substantially. Further, with most companies (87%) holding that customers’ perceptions are the most important to their reputation, it’s important to arrest a problem before it develops even further. Online reputation management and review monitoring are considered crucial processes for every company.

News scraping allows companies to monitor every newly published public news article and, therefore, their reputation.

Obtain competitive intelligence

The business world is synonymous with competition. This makes avenues of collecting the much-needed competitive intelligence all the more important.

Multiple news articles cover topics such as product launches, rebranding initiatives, mergers and acquisitions, financial results, and more. Thus, scraping news websites that cover such business-oriented topics offers insights about competitors. It’s a convenient way of obtaining competitive intelligence.

Discover industry trends

Many factors and impactful events could impact a company’s operations. As such, businesses must develop a mechanism that enables them to monitor trends and emerging issues.

Public news articles are a perfect place to start. They contain information that highlights where a particular industry is headed. For instance, articles summarizing market research reports offer insights into the current status of the industry and factors that are likely to promote growth throughout the forecast period. By web scraping all the public news articles containing such information, companies can discover new industry trends that, in turn, enhance competitiveness.

Additionally, by web scraping articles containing news data about their competitors, businesses can easily establish operational similarities, which automatically point to the industry trends.

Unearth fresh ideas

News websites publish insightful articles that contain input from industry experts or that are authored by acclaimed figures in their respective fields. For companies, such posts can be a source of ideas regarding new opportunities. They can also contain pointers on how to leverage such opportunities. Such articles can help businesses augment their ideation process.

Scraping public news websites provides a reliable way to automatically access these vital resources and, therefore, unearth fresh ideas.

Content strategy improvement

News websites aren’t limited only to conventional media outlets but also include newswire sites and public relations (PR) websites that distribute press releases and provide regular article-based coverage of client companies.

In this regard, companies can gain insights into how they can improve their communication and content strategy using news scraping. Simply put, this process highlights the best industry practices and what can make a company’s PR stand distinct.

How to scrape news data?

When it comes to public news scraping, Python offers one of the easiest ways to get started, especially given that it is an object-oriented language. Basically, scraping public news data involves two steps – downloading the webpage and parsing the HTML.

One of the most popular libraries to download web pages is Requests. This library can be installed using the pip command on Windows. On Mac and Linux, we suggest using the pip3 command to ensure that you’re using Python3. So, you should open the terminal and run the following command:

pip3 install requests

Link to GitHub

Create a new Python file and enter the following code:

import requests
response = requests.get(https://quotes.toscrape.com')
print(response.status_code)

Link to GitHub

If you run this code, it will print the HTTP status code. If the web page is successfully downloaded, the status code will be 200. To access the HTML of the web page, access the text attribute of the response object.

print(response.text) # Prints the entire HTML of the webpage.

The HTML returned by response.text is a string. This needs to be parsed into a Python object that can be queried for specific data. There are multiple libraries for parsing available for Python. This example uses the lxml, along with the Beautiful Soup library. Beautiful Soup works as a wrapper over the parser. This makes extracting data from HTML efficient.

To install these libraries, use the pip command. You should open the terminal and enter the following:

pip3 install lxml beautifulsoup4

Link to GitHub

In the code file, import Beautiful Soup and create an object as follows:

from bs4 import BeautifulSoup
response = requests.get('https://quotes.toscrape.com')
soup = BeautifulSoup(response.text, 'lxml')

Link to GitHub

In this example, we’re working with a website with quotes. If you’re working with any other site, this method will still work. The only thing that will change is how to locate the element. To locate an HTML element, find() method can be used. This method takes the tag name and returns the first match.

title = soup.find('title')

The text inside this tag can be extracted using the get_text() method.

print(title.get_text()) # Prints page title.

Link to GitHub

To fine-tune it further, other attributes such as class, id, etc. can be used as well.

soup.find('small',itemprop="author")

Link to GitHub

Note that to use the class attribute, you should use the class_ because class is a reserved keyword in Python.

soup.find('small',class_="author")

Link to GitHub

Similarly, to get more than one element, the find_all() method can be used. If these quotes are considered as news headlines, you can simply get all the elements in headline using the following statement:

headlines = soup.find_all(itemprop="text")

Link to GitHub

You should note that the object headlines is a list of tags. To extract the text from these tags, a for loop can help you:

for headline in headlines:
    print(headline.get_text())

Link to GitHub

It’s important to mention that scraping public news data isn’t very difficult. However, when collecting large amounts of public data, you can face issues such as IP blocks or CAPTCHAs. International news websites also provide their content according to the country. In this case, you should think about using Residential or Datacenter proxies.

Is it legal to scrape news websites?

Web scraping is one of the least time-consuming methods to access large amounts of the latest public news articles and monitor multiple news websites. In fact, with the increased sophistication of article scrapers, it has become increasingly possible to bypass anti-scraping measures that websites put in place to stop web scraping APIs.

The unmatched convenience of news scraping, or web scraping in general, however, doesn’t negate the existence of a few legal questions regarding the practice. So, is it legal to scrape news websites or is web scraping legal?

Well, as our legal team would say, it depends. Web scraping isn’t illegal as such, but it totally depends on the intention behind the practice. As long as web scraping news websites doesn’t violate any laws or infringe any intellectual property rights, regarding the data you intend to scrape or the source target, it should be considered as a legal activity. Accordingly, before engaging in any scraping activities, you should get appropriate professional legal advice regarding your specific situation.

Conclusion

Web scraping news websites provides a convenient and fast route of extracting real-time, reliable, and accurate data about competitors, the weather, economic environment, and more. To create tools that scrape news articles, Python is an ideal programming language that provides this capability, on top of multiple other benefits such as its extensive libraries and more. And with news scraping being legal and ethical when used appropriately and for the right purpose, companies can enjoy the benefits of this noble practice, all the while using it to monitor their reputation, gather competitive intelligence, unearth fresh ideas, and more.

Click here and check out a repository on GitHub to find the complete code used in this article.

About the author

Iveta Vistorskyte

Lead Content Manager

Iveta Vistorskyte is a Lead Content Manager at Oxylabs. Growing up as a writer and a challenge seeker, she decided to welcome herself to the tech-side, and instantly became interested in this field. When she is not at work, you'll probably find her just chillin' while listening to her favorite music or playing board games with friends.

Learn more about Iveta Vistorskyte

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Scrapers Tutorials