There are a few challenges when it comes to getting Puppeteer to work properly on AWS Lambda, and we’ll address all of them in this post.
But first, let’s start with introducing both Puppeteer and AWS Lambda.
Jordan Hansen, Owner of Cobalt Intelligence
Simply put, Puppeteer is a software for controlling a (headless) browser. It’s a piece of open-source software developed and supported by Google’s developer tools team. It allows you to simulate user interaction with a browser through a simple API.
This is very helpful for doing things like automated tests or, my personal use case, web scraping.
A picture’s worth a thousand words. How much is a gif worth? With a little bit of code shown in the gif below, I can log in to a Google account. You simply need to click, enter text, paginate, and scrape all the publicly available data you need.
AWS Lambda is what Amazon calls “Run code without thinking about servers or clusters.” You can simply create a function on Lambda and then execute it. It’s that easy.
Simply put, you can do everything on AWS Lambda. Okay, everything is a strong word, but almost. For example, I scrape thousands of public web pages every night with AWS Lambda functions. I also manage to insert data into databases. All of my back-end API routes are hosted on AWS Lambda when users sign up for my services.
Getting started with AWS Lambda is simple and inexpensive. You only need to pay for what you use, and they also have a generous free trial.
AWS Lambda has a 50 MB limit on the zip file you push directly to it. Due to the fact that it installs Chromium, the Puppeteer package is significantly larger than that. However, this 50 MB limit doesn’t apply when you load the function from S3! See the documentation here.
AWS Lambda quotas can be tight for Puppeteer:
The 250 MB unzipped can be bypassed by uploading directly from an S3 bucket. So I create a bucket in S3, use a node script to upload to S3, and then update my Lambda code from that bucket. The script looks something like this:
"zip": "npm run build && 7z a -r function.zip ./dist/* node_modules/",
"sendToLambda": "npm run zip && aws s3 cp function.zip s3://chrome-aws && rm function.zip && aws lambda update-function-code --function-name puppeteer-examples --s3-bucket chrome-aws --s3-key function.zip"
By default, Linux (including AWS Lambda) doesn’t include the necessary libraries required to allow Puppeteer to function.
Fortunately, there already exists a Chrome AWS Lambda package that uses Chromium. You can find it here. You will need to install it and puppeteer-core in your function that you are sending to Lambda.
The regular Puppeteer package will not be needed and, in fact, counts against your 250 MB limit.
`npm i --save chrome-aws-lambda puppeteer-core`
And then, when you are setting it up to launch a browser from Puppeteer, it will look like this:
const browser = await chromium.puppeteer
.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: await chromium.executablePath,
headless: chromium.headless
});
Puppeteer requires more memory than a regular script, so keep an eye on your max memory usage. When using Puppeteer, I recommend at least 512 MB on your AWS Lambda function.
Also, don’t forget to run `await browser.close()` at the end of your script. Otherwise, you may end up with your function running until timeout for no reason because the browser is still alive and waiting for commands.
About the author
Jordan Hansen
Founder of Cobalt intelligence
Jordan Hansen is a professional web scraper who lives in Eagle, ID in the United States. His company, Cobalt Intelligence, gets Secretary of State business data for banks and business lenders via API.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub