Back to blog
Roberta Aukstikalnyte
Scrapy Splash is a lightweight browser with an HTTP API; it’s used to scrape websites that render data with JavaScript or AJAX calls. In today’s article, we’ll demonstrate how to use Scrapy Splash to your advantage when scraping data from websites with JavaScript rendering.
First, let’s take a look at the steps for installing Scrapy Splash.
Scrapy Splash uses the Splash API, so you’ll also need to install it or use the Docker image (in this tutorial, we’re going to use Docker.)
If you want to configure Splash manually without Docker, you can check the detailed instructions in the official installation documentation.
Docker is an open-source containerization technology that’ll help us to run the Splash instance in a virtual container. To install Docker, you can use the below command in Linux:
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
If you’re using Windows, macOS, or any other operating system, you can find the relevant installation information on the Docker website.
Once Docker is installed, you can use the `docker` command to pull the Splash Docker image from the Docker cloud.
docker pull scrapinghub/splash
The above command will download the Splash Docker image. Now, you’ll have to run it using the below command:
docker run -it -p 8050:8050 --rm scrapinghub/splash
The Splash instance will be available and ready to use on port 8050. So, if you visit: localhost:8050 on your browser, you’ll see the default Splash page.
Next, you’ll have to install Scrapy and the `scrapy-splash` plugin. The latter can be done with a single command as below:
pip install scrapy scrapy-splash
The `pip`command will download all the necessary dependencies from the Python package index(PyPI) and install them.
The next step is to create a Scrapy project. You’ll need to run the following command:
scrapy startproject splashscraper
It’s as simple as that! You’ve now created your first Scrapy project named splashscraper. Depending on your Scrapy version, this command will create a project structure similar to the below:
splashscraper
├── scrapy.cfg
└── splashscraper
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
└── __init__.py
2 directories, 7 files
Now, you’ll have to update the `settings.py` file to include some Splash specific settings. First, you’ll need to create a `SPLASH_URL` variable that points to the Splash instance that you booted in the second step.
SPLASH_URL = 'http://localhost:8050'
Then, you’ll have to modify the `DOWNLOADER_MIDDLEWARES` to include the Splash middlewares.
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
}
After that, you’ll have to update the `SPIDER_MIDDLEWARES` with another Splash middleware which is necessary for deduplication, and add a Splash aware duplicate filter.
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
The rest of the settings can be left with the default values for now.
Scrapy has a built-in command to generate spiders. You’ll use this command to create the boilerplate of the spider. We’ll use the quotes.toscrape.com website as the target.
As you can see, this website has a lot of quotes. You can extract all the information available; also, you can use the pagination to navigate to all the other pages. First, let’s generate the spider using the below command:
scrapy genspider quotes quotes.toscrape.com
Once you execute this command, it’ll create a new file in the spiders directory. Let’s take a look at this file.
In the spiders directory, you’ll find a file named quotes.py. Open it up in any text editor tool. It’ll look like below:
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
pass
Notice all the class variables of the `QuotesSpider` class: the `allowed_domains` list restricts the spider to the listed domains. This way, it’s ensured that the spider won’t send network requests to websites or domains outside its scope.
When the spider starts, it’ll scrape all the URLs listed in the `start_urls` asynchronously. The generator also created the `parse` method, which’ll get invoked for each url by default.
Right now, the spider isn’t doing anything. You’ll have to update the parse method to bring it to life.
First, let’s check the HTML source of the website in a web browser using the dev tool.
As you can see, each quote is wrapped in a div tag, which has a class name quote. The quote’s text is enclosed in a span tag. The author's name has a small tag wrapped around it. And lastly, the tags are available in a meta tag.
Now, you’ll have to create a class to map the data into Scrapy items. By default, Scrapy uses the items.py file to store the class definition. Fortunately, you don’t have to write it from scratch – Scrapy already generated this file for you. Navigate to the items.py file in the project root directory. Then, modify it to include three fields: author, text, and tags.
import scrapy
class SplashscraperItem(scrapy.Item):
author = scrapy.Field()
text = scrapy.Field()
tags = scrapy.Field()
Next, you’ll have to implement the parse method. You’ll begin by importing the `SplashscraperItem` class that you created in the previous step:
from items import SplashscraperItem
Then, you’ll have to implement the parse method as shown below:
def parse(self, response):
for quote in response.css("div.quote"):
text = quote.css("span.text::text").extract_first("")
author = quote.css("small.author::text").extract_first("")
tags = quote.css("meta.keywords::attr(content)").extract_first("")
item = SplashscraperItem()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
Following this `parse()` method, you can see the code is using the response object to extract all the quote `div` elements. After that, it iterates over each item to extract the text authors and tags. Once this data is extracted, it then creates a `SplashscraperItem()` with the data and yields the `item` object.
The code that you’ve written so far only works on a single page. So, let’s modify it further to navigate through all the pages using the pagination of the website.
Head on to the website again using a web browser and scroll to the bottom, where you’ll see the “Next” button. As soon as you click it, it loads the second page.
Now, you’ll need code so the process of getting to the next page is automated.
By taking a closer look, we can see that the website is using an anchor tag nested inside a li element. You’ll have to add the following lines after the for loop that you’ve written earlier:
next_url = response.css("li.next>a::attr(href)").extract_first("")
if next_url:
yield scrapy.Request(next_url, self.parse)
The next_url variable will contain the url of the next page. When it reaches the last page, the next_url will be an empty string. So, the if statement checks this condition to properly handle that case.
To use SplashRequest, you’ll have to make some changes to the current spider.
First, you’ll have to import the SplashRequest from the scrapy_splash library.
from scrapy_splash import ScrapyRequest
If the start_urls list is used, Scrapy uses scrapy.Request by default. So, you’ll also have to replace it with the start_requests method.
def start_requests(self):
url = 'https://quotes.toscrape.com/'
yield SplashRequest(url, self.parse, args={'wait': 1})
Last but not least, you’ll need to update the parse method to use SplashRequest as well. Once you make these changes, the code will look like the below:
import scrapy
from scrapy_splash import ScrapySplash
from items import SplashscraperItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
def start_requests(self):
url = 'https://quotes.toscrape.com/'
yield SplashRequest(url, self.parse, args={'wait': 1})
def parse(self, response):
for quote in response.css("div.quote"):
text = quote.css("span.text::text").extract_first("")
author = quote.css("small.author::text").extract_first("")
tags = quote.css("meta.keywords::attr(content)").extract_first("")
item = SplashscraperItem()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
next_url = response.css("li.next>a::attr(href)").extract_first("")
if next_url:
yield scrapy.SplashRequest(next_url, self.parse, args={'wait': 1})
Splash responses have all the properties and attributes of a standard Scrapy Request. So, you don’t have to make any changes to the parse method.
Scrapy Splash returns various Response subclasses for Splash requests depending on the request type, for example:
SplashResponse - returned for binary Splash responses that contain media files, e.g., image, video, audio, etc.;
SplashTextResponse - returned when the result is text;
SplashJsonResponse - returned when the result is a JSON object.
You can use Scrapy’s built-in parser and Selector classes to parse Splash Responses. In the last example, you’ve used the response.css() method to use CSS Selectors to extract the desired data.
text = quote.css("span.text::text").extract_first("")
author = quote.css("small.author::text").extract_first("")
tags = quote.css("meta.keywords::attr(content)").extract_first("")
Note the `::text` inside the CSS selectors, it tells Scrapy to extract the text property from the element. Similarly, the `::attr(content)` denotes the content attribute of the meta tag.
Scrapy Splash is an excellent tool for extracting content from dynamic websites in bulk. It enables Scrapy to leverage the Splash headless browser and scrape websites with JavaScript rendering at ease.
If you enjoyed this blog post, make sure to check out our post on dynamic web scraping with Python and Selenium.
Scrapy Splash is an interface to the lightweight headless browser Splash. Meanwhile, Selenium is a web testing and automation framework. Scrapy Splash doesn’t require a third-party browser since it uses the Splash browser instance, but Selenium relies on various third-party web drivers, such as Chromium, Geckodriver, and so on.
If you’re curious to learn more about the differences between Scrapy and Selenium, see our detailed Scrapy vs. Selenium comparison.
Beautiful Soup is a parsing library. Beautiful Soup can parse HTML pages using various parsers such as html.parser, lxml, etc. However, it can’t make any network requests, so it depends on network libraries such as requests, httpx, etc. Similarly, Scrapy Splash can interact with the Splash browser using the API but can not parse the content of the response.
Just like Selenium, Playwright is also primarily a testing framework that emphasizes software testing and automation. It’s a great tool for website interaction and human browsing simulation.
Playwright uses third-party browser engines, making it slightly more resource-heavy than Scrapy Splash. Due to this, it can also be more stealthy than Scrapy Splash and bypass complex anti-bot measures, making web scraping easier.
About the author
Roberta Aukstikalnyte
Senior Content Manager
Roberta Aukstikalnyte is a Senior Content Manager at Oxylabs. Having worked various jobs in the tech industry, she especially enjoys finding ways to express complex ideas in simple ways through content. In her free time, Roberta unwinds by reading Ottessa Moshfegh's novels, going to boxing classes, and playing around with makeup.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Maryia Stsiopkina
2023-08-03
Enrika Pavlovskytė
2023-07-21
Get the latest news from web data intelligence world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub