Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network status Careers

hello@oxylabs.io

English (EN)

English

中文

Proxies

Proxies & Advanced Proxy Solutions

Residential Proxies

Human-like scraping without IP blocking

Mobile Proxies

Harness the power of IP addresses from real mobile devices

Rotating ISP Proxies

Extract the required data without the fear of getting blocked

Web Unblocker

AI-powered proxy solution for block-free scraping

Shared Datacenter Proxies

Fast and reliable proxies for cost-effective scraping

Dedicated Datacenter Proxies

The highest performing proxies on the market

Static Residential Proxies

Combined power of Datacenter and Residential IPs

Tools & Addons

Oxy Proxy Extension for Chrome

Free Chrome proxy manager extension that works with any proxy provider.

Oxy Proxy Manager for Android

Free Android proxy manager app that works with any proxy provider.

Proxy RotatorAdd-on

Rotates your Datacenter Proxies to help increase success rates.

Scraper APIs

SERP Scraper APIFREE TRIAL

Scalable SERP data delivery from major search engines

E-Commerce Scraper APIFREE TRIAL

Enterprise-level data from largest e-commerce marketplaces

Real Estate Scraper APIFREE TRIAL

Real-time data from popular real estate websites

Web Scraper APIFREE TRIAL

Public data delivery from a majority of websites

Features

Web Crawler

Discovers all pages on a website and fetches data at scale.

Scheduler

Schedules multiple scraping and parsing jobs at specified frequencies.

Custom Parser

Parses scraped documents by executing given parsing instructions.

Headless BrowserNEW

Render JavaScript and execute browser instructions.

DatasetsNew

Datasets

Company Data

Comprehensive datasets for business profiling

E-Commerce Product Data

Datasets for product catalog insights from E-Commerce stores

Job Postings Data

Datasets for labour market research and insights

Community and Code Data

Datasets for developer community trends

Product Review Data

Fresh datasets for user sentiment analysis

Pricing

Proxies

Residential Proxies

Human-like scraping

Starts from

$10

Pay as you go

Mobile Proxies

3G/4G/5G Mobile Proxies

Starts from

$22

Pay as you go

Rotating ISP Proxies

Extended sessions

Starts from

$340/month

Shared Datacenter Proxies

Cost-effective solution

Starts from

$50/month

Dedicated Datacenter Proxies

Superior performance

Starts from

$50/month

Scraper APIs

SERP Scraper API

Scalable SERP data delivery

Starts from

$49/month

E-Commerce Scraper API

Enterprise-level product page data

Starts from

$49/month

Web Scraper API

Data from a majority of websites

Starts from

$49/month

Real Estate Scraper API

Real-time real estate data

Starts from

$49/month

Advanced Proxy Solutions

Web Unblocker

AI-powered proxy solution

Starts from

$75/month

Learn

Getting Started

Knowledge Base

Read the latest articles about the world of web scraping, proxies, and more

Webinars

Check our webinars to learn more about data gathering issues and solutions

White papers

Get extensive white papers to understand the most complex scraping topics

OxyCon

Join inspiring discussions at Oxylabs’ annual web scraping conference

Scraping Experts

Watch lessons by industry-leading experts to gain insights on data gathering

Useful Information

Quick Start Guides

Featured

Explore tutorials and code samples to build a web scraping infrastructure with Oxylabs solutions.

Solutions

By Industry

E-Commerce

Get access to valuable e-commerce data with the help of advanced scraping solutions

Cybersecurity

Collect threat intelligence and inspect risky activities anonymously with reliable proxies

Brand protection

Monitor the web on a large scale to ensure no unauthorized product seeped into the market

SERP Monitoring

Monitor SERPs to enhance your business strategy

Travel and hospitality

Gather real-time flight and hotel data to and build a solid strategy for your travel business.

By Use Case

View all

By Target

View all

Back to blog

Tutorials Data acquisition

Crawlee Tutorial: Easy Web Scraping and Browser Automation

Yelyzaveta Nechytailo

2023-04-046 min read

Web scraping and browser automation have emerged as essential tools for businesses looking to stay competitive in the digital market. This easy tutorial covers everything you need to get started with Crawlee – a tool for web scraping and browser automation.

What is Crawlee?

Crawlee is a Node.JS package that offers a straightforward and adaptable interface for web scraping and browser automation. Users can retrieve web pages, apply CSS selectors to extract data from them, and navigate the DOM tree to follow links and scrape several sites.

Crawlee is a versatile tool that provides a uniform interface for web crawling via HTTP and headless browser approaches. It has an integrated persistent queue for handling URLs to crawl in either breadth-first or depth-first order.

Users can also benefit from integrated proxy rotation and session management, pluggable storage solutions for files and tabular data, and other features. Moreover, Crawlee offers hook-based customized lifecycles, programmable routing, error handling, and retries.

For speedy project setup, a CLI is accessible, and Dockerfiles are included to streamline deployment. Crawlee is a robust and effective tool for web scraping and crawling written in TypeScript using generics.

The benefits of using Crawlee for web scraping and browser automation

The following are some of the most common pros of using Crawlee for browser automation and web scraping:

Single interface: Crawlee offers a single interface for headless browser crawling as well as HTTP crawling, making it simple to switch between the two based on your needs.
Customizable lifecycles: Crawlee allows developers to alter their crawlers' lifecycles using hooks. These hooks can be used to carry out operations before or after specific events, such as before a request is made or after data is collected.
Pluggable storage: Crawlee supports pluggable storage methods for both tabular data and files, making storing and managing the data you extract simple.
Proxy rotation and session management: Crawlee has built-in support for these features, which can be used to manage complex web interactions and avoid IP blocking.
Configurable request routing, error handling, and retries: Crawlee enables developers to control request routing, deal with errors, and retry requests as needed, making it simpler to handle edge cases and unexpected issues.
Docker-capable: Crawlee comes with Dockerfiles that let developers rapidly and easily deploy their crawlers to production environments.

Crawlee web scraping tutorial

This section discusses the installation steps of Crawlee and explains how it works. Additionally, it contains a working example of scraping a website using Crawlee.

Installation

To use Crawlee on your system, you must install Node.JS version 16.0 or above. Along with it, NPM should also be installed.

The Crawlee CLI is the quickest and most efficient way to build new projects with Crawlee. The following command will create a new Crawlee project inside the “my-crawler” directory:

npx crawlee create my-crawler

The npx CLI tool runs the crawlee package locally without installing it globally on your machine. Running the above command will show you a prompt message to choose a template, as shown in the following snippet:

Since we are doing Node.JS development, we can select Getting Started Example (Javascript). This option will install all required dependencies and create a new directory named my-crawler in your current working directory. It will also add a package.json to this folder. Additionally, it will include example source code that you may use immediately. The image below shows the completed installation message:

Remember, the Crawlee project was created inside the my-crawler folder. You first need to change your current directory to this folder.

cd my-crawler

Next, run the following command to start the Crawlee project:

npm start

The start command will start the default crawler of the project. This crawler crawls the Crawlee website and outputs the titles of all the links on the website.

Congratulations! You have successfully installed Crawlee and run one of the crawlers with it.

Crawlee working

Crawlee has three types of crawlers: CheerioCrawler, PuppeteerCrawler, and PlaywrightCrawler. They all share some basic characteristics.

Every crawler is designed to visit a webpage, carry out specific tasks, save the results, navigate to the next page, and repeat this cycle until the job is finished. Each crawler must therefore respond to two queries: Where should I go? What should I do there?

The crawler can start working when these queries are resolved since most other settings are pre-configured for the crawlers.

Crawlee web scraping example

Our target for this web scraping demonstration is books.toscrape.com. We will be scraping the titles of the books listed on the website.

Open the main.js file from the src folder of your project and overwrite the following code in it:

import { PlaywrightCrawler } from 'crawlee';

const crawler = new PlaywrightCrawler({
    requestHandler: async ({ page }) => {
        
        // Waiting for book titles to load
        await page.waitForSelector('h3');
        
        // Execute a function in the browser that targets
        // the book title elements and allows their manipulation
        const bookTitles = await page.$$eval('h3', (els) => {
        
            // Extract text content from the titles list
            return els.map((el) => el.textContent);
        });
        
         bookTitles.forEach((text, i) => {
            console.log(`Book_${i + 1}: ${text}\n`);
        });
    },
});

await crawler.run(['https://books.toscrape.com/']);

This code illustrates how to scrape data from a website using Crawlee's PlaywrightCrawler class.

The code first imports the PlaywrightCrawler class from the crawlee package. Then it creates a new crawler of the PlaywrightCrawler type.

Instantiating the PlaywrightCrawler class requires an options object which includes a requestHandler function. The requestHandler function is an asynchronous function executed for every page visited by the crawler.

The requestHandler function first waits for the page's <h3> elements, representing the book titles, to be rendered using the page. waitForSelector() call. The page.$$eval() method is executed in the browser's context, extracting the text content from all of the <h3> elements on the page. Lastly, the crawler code logs the book titles to the console.

The last line of code initiates the crawling operation using the crawler.run() method.

The following command runs the project:

npm start

Here is what the code output when once you run the crawlee project:

Using headless browsers with Crawlee

Crawlee supports headless control over the browsers like Chromium, Firefox, and WebKit. You can combine headless browsers with Crawlee's PuppeteerCrawler and PlaywrightCrawler classes to perform true browser crawling and extract valuable data from complex websites.

You just need to set up a few variables, such as the browser type, start options, and context options, to use headless browsers with Crawlee.

Here is a simple code example to launch a Firefox headless instance and scrape the titles of the books (as we did in the earlier section):

import { PlaywrightCrawler } from 'crawlee';
import { firefox } from 'playwright';


const crawler = new PlaywrightCrawler({
     launchContext: {
        // Set the Firefox browser to be used by the crawler.
      
        launcher: firefox,
    },
    requestHandler: async ({ page }) => {
        // Wait for the actor cards to render.
        await page.waitForSelector('h3');
        // Execute a function in the browser which targets
        // the actor card elements and allows their manipulation.
        const bookTitles = await page.$$eval('h3', (els) => {
            // Extract text content from the actor cards
            return els.map((el) => el.textContent);
        });
        bookTitles.forEach((text, i) => {
            console.log(`Book_${i + 1}: ${text}\n`);
        });
    },

});

await crawler.run(['https://books.toscrape.com/']);

Crawlee offers rich features for HTTP and real browser crawling. These features include:

HTTP crawling:

Automation configuration of browser-like headers
Replication of browser TLS fingerprints
Integrated fast HTML parsers
Zero config HTTP2 support, even for proxies
Scraping JSON APIs

Real browser crawling

JavaScript rendering and screenshots
Headless and headful support
Zero-config generation of human-like fingerprints
Automatic browser management
Use Playwright and Puppeteer with the same interface
Chrome, Firefox, Webkit and many others

How to manage proxies with Crawlee

Crawlee includes built-in support for managing proxies. It lets you quickly choose between a list of proxies to avoid IP-based restrictions or website blocking.

You can construct a new instance of a crawler and give in a ProxyConfiguration object with the list of proxies. You can optionally specify the rotation technique. For example, you can set proxies to rotate every request or after a specified number of requests.

Moreover, you can also use a third-party solution such as Oxylabs' Web Unblocker to ensure that your Crawlee web scraping is not blocked or restricted.

Integrating proxies with Crawlee

To integrate the proxy endpoints with Crawlee, you can use the ProxyConfiguration class. You can create an instance of this class using the constructor and provide the necessary options. You can visit the ProxyConfigurationOptions class page to learn about proxy configuration options.

The following code demonstrates setting a proxy list Crawlee. You can get a list of residential proxy endpoints by registering an account at Oxylabs' Web Unblocker page.

import { PlaywrightCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration(
{
	proxyUrls: [
		'http://username:password@unblock.oxylabs.io:60000',
		'https://username:password@unblock.oxylabs.io:60000',
		],
	},);

proxyConfiguration.isManInTheMiddle = true;

const crawler = new PlaywrightCrawler({
	proxyConfiguration,
	requestHandler: async ({page}) => {
		// Waiting for book titles to load
		await page.waitForSelector('h3');

		// Execute a function in the browser that targets
		// the book title elements and allows their manipulation
		const bookTitles = await page.$$eval('h3', (els) => {

		// Extract text content from the titles list
		return els.map((el) => el.textContent);
	});

	bookTitles.forEach((text, i) => {
	console.log(`Book_${i + 1}: ${text}\n`);
	});
},
navigationTimeoutSecs: 120,
});

await crawler.run(['https://books.toscrape.com/']);

The above code snippet creates a new instance of the ProxyConfiguration class by passing a list of Oxylabs’ Web Unblocker proxy endpoints. Then it sets the isManInTheMiddle property to true. This property indicates that the proxy server will be used as a Man-in-the-Middle (MITM) proxy.

After that, it uses this ProxyConfiguration object (i.e., stored in the proxyList) to initialize a Playwright crawler instance. The proxyList contains a list of the Oxylabs’ Web Unblocker proxies.

Make sure to replace the username and password with your account credentials.

The rest of the crawler code remains the same as we wrote in the previous example.

Conclusion

Crawlee is a powerful web scraping and browser automation solution with a unified interface for HTTP and headless browser crawling. It supports pluggable storage, headless browsing, automatic scaling, integrated proxy rotation and session management, customized lifecycles, and much more. Crawlee is an effective solution for developers and data analysts who want to automate browser actions and retrieve and extract data effectively.

For your convenience, we also have this article presented on our GitHub. Have questions about this tutorial or any other web scraping topic? Don’t hesitate to contact us at hello@oxylabs.io or via our 24/7 live chat.

About the author

Yelyzaveta Nechytailo

Senior Content Manager

Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.

Learn more about Yelyzaveta Nechytailo

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Scrapers Tutorials