Back to blog
Yelyzaveta Nechytailo
In today's data-driven world, effective web scraping is crucial. Managing a fleet of servers to perform web scraping operations is only sometimes the most convenient or cost-effective for engineers and developers. This is where the power of serverless computing and web scraping frameworks comes into play. This article will explore using Scrapy, a popular web scraping framework, with AWS Lambda, a serverless computing platform from Amazon Web Services.
Serverless web scraping harnesses the power of serverless computing, like AWS Lambda, and a web crawling framework, such as Scrapy, to efficiently extract data from the web. By combining these technologies, developers can create robust, scalable, and cost-effective web scraping solutions without needing to manage any servers or pay for idle time.
What is AWS Lambda?
AWS Lambda is a serverless computing service provided by Amazon Web Services. This technology allows developers to run their code without managing servers, automatically scaling to handle any workload. In simpler terms, it takes care of all the infrastructure to run your code on a high-availability, distributed compute infrastructure, ensuring your code is optimally allocated, and the right resources are used.
The serverless aspect ensures optimal allocation of resources, allowing the application to scale up during high demand and scale down during quieter periods. This elasticity makes serverless web scraping a smart choice for projects with unpredictable load or where high-scale operations are needed sporadically.
On the other hand, using a powerful web crawling framework like Scrapy offers comprehensive tools to scrape websites effectively and with greater control. This framework enables handling complex data extraction and storing the scraped data in the desired format.
What is Scrapy?
Scrapy is an open-source and collaborative web crawling framework written in Python. It's designed to handle various web scraping tasks including web scraping and processing scraped data. With its built-in functionality for extracting and storing data in your preferred structure and format, Scrapy stands out as a robust framework for web scraping.
In order to perform web scraping with Scrapy effectively, we recommended integrating it with Oxylabs’ Residential or Datacenter Proxies. The biggest advantage of integrating proxies with Scrapy is that they will allow you to hide your actual IP address from being visible to the original site's server while scraping. Using proxies protects your privacy and prevents you from being banned from the target sites when you’re using an automated tool rather than manually copying and pasting data from the site.
Setting up a Scrapy crawler is the first step towards serverless web scraping. Scrapy uses spiders, which are self-contained crawlers that are given a set of instructions.
Here's an example of a simple Scrapy spider:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for s in response.css("article"):
yield {
"title": s.css("h3 a::attr(title)").get(),
"price": s.css("p.price_color::text").get(),
}
next_page = response.css("li.next > a::attr(href)").get()
if next_page:
yield scrapy.Request(response.urljoin(next_page))
In this example, we've created a simple spider that will start at https://books.toscrape.com, extract the title and price of all the books, go to the next page, and repeat. The final result is scraped data from 1,000 books.
If you run the spider on your machine, you can use the -o switch to redirect the output to a file. For example, the following command creates a books.json file:
scrapy runspider books.py -o books.json
We cannot access the terminal or the file system when we run the Scrapy spider as an AWS Lambda function. It means that we send the output to a local file system and retrieve it. The alternative is to store the output in a S3 bucket.
Modify the spider code to add these custom settings:
class BooksSpider(scrapy.Spider):
name = "books"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com/"]
custom_settings = {
"FEED_URI": "s3://YOUR-BUCKET-NAME/items.json",
"FEED_FORMAT": "json",
}
#rest of the code
Replace YOUR-BUCKET-NAME-HERE with the actual S3 bucket you created.
Before we can configure the local environment for AWS lambda, install the following executables:
Docker
AWS CLI
Serverless Framework
boto3 package
Configuring the Lambda function requires creating a Docker image of our Scrapy spider and uploading it to the AWS platform. Docker allows you to package an application with its environment and dependencies into a container, which can be quickly shipped and run anywhere. This process ensures that our application will run the same, regardless of any customized settings or previously installed software on the AWS Lambda server that could differ from our local development environment.
You can download and install Docker Personal from docker.com. Ensure that you can run the docker executable from the command line.
The AWS Command Line Interface (AWS CLI) is a powerful tool that allows users to interact with Amazon Web Services (AWS) using the command-line interface of your operating system. It provides a convenient and efficient way to manage various AWS services and resources without a graphical user interface.
To install AWS CLI, visit https://aws.amazon.com/cli/ and download the package for your operating system.
For AWS Lambda to handle Scrapy, we'll need to create a Docker image of our Scrapy spider and upload it to the AWS platform.
You can install Serverless Framework using npm. If you still need to set it up, go to the official Node.js website: https://nodejs.org/, download LTS, and install it. After that, run the following command:
$ npm install -g serverless
Botocore is a Python library developed by Amazon Web Services (AWS). It is the foundation for the AWS SDK for Python (Boto3). It is a low-level interface that provides the core functionality for interacting with AWS services through Python code.
Ensure that you have created a virtual environment and activated it, and run the following:
(venv) $ pip install boto3
Not that this virtual environment should also have Scrapy installed. Install it if you haven't installed it using pip:
(venv) $ pip install scrapy
To ensure the Docker container is manageable and contains only the necessary dependencies, creating a requirements.txt file that lists all Python packages your Scrapy spider needs to run is crucial. This file might look like this:
requests==2.31.0
requests-file==1.5.1
Scrapy==2.6.0
boto3==1.28.14
service-identity==21.1.0
cryptography==38.0.4
# more packages
The next step is to create a docker image. We can use the Create a file and name it Dockerfile. This file will have the following contents:
FROM public.ecr.aws/lambda/python:3.10
# Required for lxml
RUN yum install -y gcc libxml2-devel libxslt-devel
COPY . ${LAMBDA_TASK_ROOT}
RUN pip3 install -r requirements.txt
CMD [ "lambda_function.handler" ]
In this Dockerfile, we're starting with a basic Python image, setting the working directory to /app, copying our application into the Docker container, installing any necessary requirements, and setting the command to run our spider. This command is the script entrypoint.sh.
Create a new file and save it as lambda_function.py. Enter the following in this file:
import sys
def handler(event, context):
# Run the Scrapy spider
import subprocess
subprocess.run(["scrapy", "runspider", "books.py"])
return {
'statusCode': '200', # a valid HTTP status code
'body': 'Lambda function invoked',
}
Finally, for deployment, we will need a YML file. Add a new file, save it as serverless.yml, and add the following code:
service: scrapy-lambda
provider:
name: aws
runtime: python3.9
stage: dev
region: us-east-1
environment:
BUCKET: my-bucket
iamRoleStatements:
- Effect: "Allow"
Action:
- "s3:*"
Resource: "arn:aws:s3:::${self:provider.environment.BUCKET}/*"
functions:
scrapyFunction:
image: YOUR_REPO_URI:latest
events:
- http:
path: scrape
method: post
cors: true
We will update the YOUR_REPO_URI in the next section. Also, note that scrapy-lambda is just the name of the docker image we will create in the next section.
The first step is to create a user using AWS IAM. Take note of the Access Key and Secret Access Key.
Execute the following and enter these keys when prompted:
aws configure
Next, create a new ECR repository by running the following:
$ aws ecr create-repository --repository-name YOUR_REPO_NAME
From the JSON output, take note of the repositoryUri value. It will look like 76890223446.dkr.ecr.us-east-1.amazonaws.com/scrapy-images.
Replace the YOUR_REPO_URI in the serverless.yml file with this value.
Go to the AWS console if you still need to create an S3 bucket.
Take note of the bucket name and update the Scrapy spider code. Replace YOUR-BUCKET-NAME-HERE in the books.py with the actual S3 bucket name.
Now, build your Docker image with the following command:
$ docker build -t scrapy-lambda
Tag and push your Docker image to Amazon ECR using the following commands:
$ aws ecr get-login-password --region region | docker login --username AWS --password-stdin YOUR_REPO_URI
$ docker tag scrapy-lambda:latest YOUR_REPO_URI:latest
$ docker push YOUR_REPO_URI:latest
Replace region with your AWS region and YOUR_REPO_NAME with your Amazon ECR repository.
Finally, deploy the images using the following command:
$ sls deploy
The output of this command should be as follows:
Service deployed to stack scrapy-lambda-dev (79s)
endpoint: POST - https://abcde.execute-api.us-east-1.amazonaws.com/dev/scrape
When you run the sls deploy command, you will see the service URL endpoint in the command output.
To execute this function, send a POST request to this URL as follows:
curl -X POST https:// abcde.execute-api.us-east-1.amazonaws.com/dev/scrape
The lambda function execution will begin, and the output will be stored in S3 bucket.
This article covered the topic of serverless web scraping and how to run Scrapy as a Lambda function. We discussed the prerequisites, setting up a Scrapy crawler, configuring AWS Lambda for serverless scraping, and storing the scraped data in an AWS S3 bucket. With this knowledge, you should be able to harness the power of serverless web scraping, helping you perform more efficient and cost-effective data collection.
If you’re looking for more similar content, check out our Scrapy Splash tutorial and Puppeteer on AWS Lambda, covering the main challenges of getting Puppeteer to work properly on AWS Lambda. As always, we’re ready to answer any questions you have via the live chat or at hello@oxylabs.io.
About the author
Yelyzaveta Nechytailo
Senior Content Manager
Yelyzaveta Nechytailo is a Senior Content Manager at Oxylabs. After working as a writer in fashion, e-commerce, and media, she decided to switch her career path and immerse in the fascinating world of tech. And believe it or not, she absolutely loves it! On weekends, you’ll probably find Yelyzaveta enjoying a cup of matcha at a cozy coffee shop, scrolling through social media, or binge-watching investigative TV series.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub