Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network status Careers

hello@oxylabs.io

English (EN)

English

中文

Proxies

Proxies & Advanced Proxy Solutions

Residential Proxies

Human-like scraping without IP blocking

Mobile Proxies

Harness the power of IP addresses from real mobile devices

Rotating ISP Proxies

Extract the required data without the fear of getting blocked

Web Unblocker

AI-powered proxy solution for block-free scraping

Shared Datacenter Proxies

Fast and reliable proxies for cost-effective scraping

Dedicated Datacenter Proxies

The highest performing proxies on the market

Static Residential Proxies

Combined power of Datacenter and Residential IPs

Tools & Addons

Oxy Proxy Extension for Chrome

Free Chrome proxy manager extension that works with any proxy provider.

Oxy Proxy Manager for Android

Free Android proxy manager app that works with any proxy provider.

Proxy RotatorAdd-on

Rotates your Datacenter Proxies to help increase success rates.

Scraper APIs

SERP Scraper APIFREE TRIAL

Scalable SERP data delivery from major search engines

E-Commerce Scraper APIFREE TRIAL

Enterprise-level data from largest e-commerce marketplaces

Real Estate Scraper APIFREE TRIAL

Real-time data from popular real estate websites

Web Scraper APIFREE TRIAL

Public data delivery from a majority of websites

Features

Web Crawler

Discovers all pages on a website and fetches data at scale.

Scheduler

Schedules multiple scraping and parsing jobs at specified frequencies.

Custom Parser

Parses scraped documents by executing given parsing instructions.

Headless BrowserNEW

Render JavaScript and execute browser instructions.

DatasetsNew

Datasets

Company Data

Comprehensive datasets for business profiling

E-Commerce Product Data

Datasets for product catalog insights from E-Commerce stores

Job Postings Data

Datasets for labour market research and insights

Community and Code Data

Datasets for developer community trends

Product Review Data

Fresh datasets for user sentiment analysis

Pricing

Proxies

Residential Proxies

Human-like scraping

Starts from

$10

Pay as you go

Mobile Proxies

3G/4G/5G Mobile Proxies

Starts from

$22

Pay as you go

Rotating ISP Proxies

Extended sessions

Starts from

$340/month

Shared Datacenter Proxies

Cost-effective solution

Starts from

$50/month

Dedicated Datacenter Proxies

Superior performance

Starts from

$50/month

Scraper APIs

SERP Scraper API

Scalable SERP data delivery

Starts from

$49/month

E-Commerce Scraper API

Enterprise-level product page data

Starts from

$49/month

Web Scraper API

Data from a majority of websites

Starts from

$49/month

Real Estate Scraper API

Real-time real estate data

Starts from

$49/month

Advanced Proxy Solutions

Web Unblocker

AI-powered proxy solution

Starts from

$75/month

Learn

Getting Started

Knowledge Base

Read the latest articles about the world of web scraping, proxies, and more

Webinars

Check our webinars to learn more about data gathering issues and solutions

White papers

Get extensive white papers to understand the most complex scraping topics

OxyCon

Join inspiring discussions at Oxylabs’ annual web scraping conference

Scraping Experts

Watch lessons by industry-leading experts to gain insights on data gathering

Useful Information

Quick Start Guides

Featured

Explore tutorials and code samples to build a web scraping infrastructure with Oxylabs solutions.

Solutions

By Industry

E-Commerce

Get access to valuable e-commerce data with the help of advanced scraping solutions

Cybersecurity

Collect threat intelligence and inspect risky activities anonymously with reliable proxies

Brand protection

Monitor the web on a large scale to ensure no unauthorized product seeped into the market

SERP Monitoring

Monitor SERPs to enhance your business strategy

Travel and hospitality

Gather real-time flight and hotel data to and build a solid strategy for your travel business.

By Use Case

View all

By Target

View all

Back to blog

Data acquisition Scrapers

CTO's Checklist: 6 Things to Know About Web Scraping

Gabija Fatenaite

2020-10-136 min read

Perhaps your company decided to gather data to fulfil their business needs more efficiently. Or maybe you are midway through this process but not sure what further steps you should take. No matter what stage you are in, it is always nice to have a checklist of things for you to do to get your project running as smoothly as possible.

This goes as an obvious one, but building and developing a working infrastructure will be difficult for an army of one. Naturally, it is hard to estimate how many people you will need to build and maintain the whole project. This checklist aims to help you figure out the resources you will need and how the overall web scraping flow looks like.

Well begun is half done: choosing a language

The first steps will determine the further process of your scraping project. And choosing the right language to build your scraping infrastructure will determine what development team you will need to be hiring.

The most popular languages for web scraping are Python or NodeJS. You can build scrapers with PHP, C++, or Ruby if you like as well. However, there are some downfalls to these options. You can read why we think Python is the best choice in this occasion in our blog post what is Python used for, but for a general summary, check the table below comparing Python to other languages.

Hiring your team members will rely on what is their language proficiency, and of course their skills. Our recommendation would be to choose Pythonists.

Exploring libraries and integrations

Your team will be working with various libraries, integration tools, etc. We have already written several tutorials on the most popular libraries and tools you might need when building your infrastructure. So here is a list of libraries you will most likely need:

Puppeteer tutorial for JavaScript-heavy websites. If you are scraping hotel listings, e-commerce product pages, or similar – this will become your main headache. Many modern sites use JavaScript to load content asynchronously (i.e., hides part of the content to not be visible during the initial page load). The easiest way to manage JavaScript-heavy sites is to use a headless browser – a browser, but without a graphical user interface. This is where Puppeteer comes into the picture.
Selenium tutorial. Similarly to Puppeteer, it is a solution that helps control headless browsers. It is one of the more popular browser automation tools out there, so experimenting with both is suggested.
lxml tutorial. lxml is one of the fastest and feature-rich libraries for processing XML and HTML in Python. By using the lxml library, XML and HTML documents can be created, parsed, and queried.
Beautiful soup for parsing. We will cover parsing a little bit later in this article, but to put it simply, there is no real point to data scraping without being able to parse your data to make it more readable. Beautiful soup is a Python package used for parsing HTML and XML documents.

One of the bigger challenges of web scraping is browser fingerprinting. Browser fingerprinting is already impacting web scraping, and it will only get harder to bypass (you can learn what is browser fingerprinting in our blog post). Luckily, some integrations help overcome it:

Collecting URLs and building paths

Establishing a crawling path is the first thing you must do in data gathering. To better understand why this is the first step, let us visualize the web scraping process in a value chain:

As you can see, web scraping takes up four distinct actions:

Crawling path building and URL collection.
Scraper development and its support.
Proxy acquisition and management.
Data fetching and parsing.

So why is the first step building a crawling path and collecting URLs? Very simply, there is no way you can build a scraper without knowing your targets. Well, at least not a functional one.

So what is a crawling path? It is a library of URLs from which the data will be extracted. The biggest challenge will be obtaining all the necessary URLs of the initial targets. That could mean dozens, if not hundreds of URLs that will need to be scraped, parsed, and identified as important URLs for your case. Of course, at the beginning of creating your scraper, several main targets will do the trick.

Developing infrastructure and maintaining it

Once you’ve decided on your team’s language, hired developers, researched several libraries, and built a URL path, the fun part begins – building a scraper. We have written a whole tutorial on how to start web scraping with Python, so you can study in greater detail how to build a scraper from scratch.

When it comes to maintenance, it will be a daily process for your development team. Including updating the infrastructure, fixing bugs, and having a stable system monitoring that might require putting your development team on night duty (to fix any crashes in the system).

The few main things to keep in mind in this stage are:

Build with the future in mind. Analyze the current systems and inner workings. Anti-bot measures are getting smarter, and so should your future scraper tool.
Be wary that it takes time. Like any development project, it will probably take more time than you think. That can be unforeseen challenges, businesses need changes, and so on.
Create a simple testing area for other higher-ups to understand what it is that you are building. Showcasing the struggles you might be facing may help convince superiors to give more time or resources.
Make it scalable. Ensure that your tool is scalable and its features do not cause issues in other areas (e.g., data storage)
Have a dedicated crisis response team. Breakdowns are inevitable.

We have our own in-house scraper tools that we built from scratch called Scraper APIs. If you are curious about what challenges we encountered during the whole process (and still do), we shared how we built our very first tool in a featured article on Towards Data Science.

Acquiring and managing proxies

There is no web scraping without proxies. Choosing the right proxy provider can be a little bit of a hassle, but so are most things when you start digging into different providers and available solutions. Here are the general steps for a good provider analysis:

See what is on the market. Several review sites concentrate on proxies. One of our favorites is Proxyway. The most important thing to compare is the success rates, proxy pool sizes, dashboard functionalities, price, support.
Check what others say. Whether reading case-studies or Trustpilot reviews, see what their current clients have to say about them.
Check their documentation. It might be an obvious one. See how their proxies work, how they are integrated, how difficult it will be, etc.
Check for any additional resources. Do they have any quick-start guides, webinars, or guides that will make your life easier?
Ask for a demo. In most cases, especially if it is for a company, proxy providers will give a free trial to let you test out their solutions.

When it comes to proxy management, it will be a challenge – especially to those new to scraping. There are so many little mistakes one can make to block batches of proxies before reaching the desired result of scraping. A good practice is proxy rotation, but all issues do not disappear with just rotation, and constant management and upkeep of the infrastructure will be needed. The best practices to keep your proxies block-free will most likely be provided in the documentation or by the support or dedicated Account Managers.

Fetching and parsing data

We have briefly mentioned data parsing in this article already. It is a process of making acquired data understandable and usable. Most data gathering methods return results that are incredibly hard to understand as they are in a raw code format. This is why parsing is necessary to create structured results to make them ready to use.

Creating a parser is not too difficult. However, like most of our other mentioned issues, maintenance will cause the biggest problems down the road. Adapting to different page formats and website changes will be a constant struggle and will take up time from your development teams’ day more often than you would expect.

Conclusion

We hope this checklist will help you dot the i’s and cross the t’s in your scraping project. No matter if you are starting a new scraping project or checking for tips in the middle of it.

If you are curious to see how our own-made in-house crawler looks, check out Web Scraper API. In this page you will find a free to test playground to see how it works.

About the author

Gabija Fatenaite

Lead Product Marketing Manager

Gabija Fatenaite is a Lead Product Marketing Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

Learn more about Gabija Fatenaite

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Data acquisition Scrapers