Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

Web Scraping with Selenium and Python

Gabija Fatenaite

2023-08-106 min read
Share

In order to understand the fundamentals of data scraping with Python and what web scraping is in general, it's important to learn how to leverage different frameworks and request libraries. By understanding various HTTP methods (mainly GET and POST), web scraping can become a lot easier. 

For instance, Selenium is one of the better-known and often-used tools that help automate web browser interactions. By using it together with other technologies (e.g., Beautiful Soup), you can get a better grasp of web scraping basics.

How does Selenium work? It automates your written script processes, as the script needs to interact with a browser to perform repetitive tasks like clicking, scrolling, etc. As described on Selenium's official web page, it's “primarily for automating web applications for testing purposes, but is certainly not limited to just that.”

In this guide on how to web scrape with Selenium, we'll be using Python 3.x. as our main input language (as it's not only the most common scraping language but the one we closely work with as well).

Setting up Selenium 

Firstly, to download the Selenium WebDriver package, execute this pip command in your terminal:

pip install selenium 

Before moving on, make sure you have a preferred web browser installed on your device, as it’s one of the essential components of Selenium. In this article, we’ll use the Chrome browser. Selenium will automatically install the corresponding browser driver, which will enable Python to control the browser on OS-level interactions.

Yet, if you’re having troubles during installation, you can download the drivers for Chrome, Firefox, Edge, and other browsers from here. Once completed, the driver can then be accessible via the PATH variable in Python.

Quick starting Selenium

Let's begin the automatization process by starting up your browser:

  • Open up a new browser window (in this instance, Chrome) 

  • Load the web page of your choice (our provided URL)

from selenium import webdriver
browser = webdriver.Chrome()
browser.get("http://oxylabs.io/")

This will launch it in the headful mode. In order to run your browser in headless mode and run it on a server, the code can be written like this:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
options.add_argument("--window-size=1920,1200")

driver = webdriver.Chrome(options=options)
driver.get("https://www.oxylabs.io/")
print(driver.page_source)
driver.quit()

Data extraction with Selenium by locating elements

find_element 

Locating elements in web pages can be tricky. Thankfully, Selenium provides two methods that you can use to extract data from one or multiple elements. These are:

  • find_element

  • find_elements

As an example, let’s locate the h1 tag on a dummy E-Commerce website, https://sandbox.oxylabs.io/products. Selenium offers various methods to select HTML elements, such as by tag names, class names, XPath expressions, CSS selectors, and others:

from selenium.webdriver.common.by import By

h1_tag = driver.find_element(By.TAG_NAME, "h1").text
h1_class = driver.find_element(By.CLASS_NAME, "css-1128ess").text
h1_xpath = driver.find_element(By.XPATH, "//h1").text
h1_css = driver.find_element(By.CSS_SELECTOR, "h1").text

print('\n'.join([h1_tag, h1_class, h1_xpath, h1_css]))

You can also use the find_elements (plural form) to find and return a list of all elements. For example:

all_h4 = driver.find_elements(By.TAG_NAME, "h4")
for h4 in all_h4:
    print(h4.text)

This way, you’ll get all product titles that are under the h4 tag on the page.  

However, some elements aren’t easily accessible with an ID or a simple class. This is why you’ll need XPath.

XPath

XPath is a syntax language that helps find a specific object in DOM. XPath syntax finds the node from the root element either through an absolute path or by using a relative path. e.g.: 

  • / : Select a node from the root. /html/body/div[1] will find the first div

  • //: Select a node from the current node no matter where they are. //form[1] will find the first form element

  • [@attributename='value']: a predicate. It looks for a specific node or a node with a specific value.

Example:

//input[@name='email'] will find the first input element with the name "email".

<html> 
 <body> 
   <div class = "content-login"> 
     <form id="loginForm"> 
         <div> 
            <input type="text" name="email" value="Email Address:"> 
            <input type="password" name="password"value="Password:"> 
         </div> 
        <button type="submit">Submit</button> 
     </form> 
   </div> 
 </body> 
</html>

WebElement

WebElement in Selenium represents an element from HTML pages. Here are the most commonly used actions: 

  • element.text – access text element;

  • element.click() – click on the element;

  • element.get_attribute(‘class') – access a specific attribute; 

  • element.send_keys(‘mypassword') – send a text to an input.

Slow website render solutions

Some websites use a lot of JavaScript to render dynamic web page content, and they can be tricky to deal with as they use a lot of AJAX calls. There are a few ways to solve this:

  • time.sleep(ARBITRARY_TIME)

  • WebDriverWait()

Example:

from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "h1"))
    )
    print(element.text)
finally:
    driver.quit()

This will allow the located element to be loaded after 10 seconds. To dig deeper into this topic, go ahead and check out the official Selenium documentation.

Executing Javascript with Selenium

To execute JavaScript, we can use the execute_script method of the WebDriver module. We can pass the JavaScript code as a string argument to the method as shown below:

import time
from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://sandbox.oxylabs.io/products")
driver.execute_script('alert("Hello World")')
time.sleep(5)

In the above code, we initiate a WebDriver instance of a Chrome browser. Then, we navigate to our desired website. Once the website loads, we then use the execute_script parameter to run a simple JavaScript snippet that shows an alert box with the text “Hello World” on a website.

The execute_script method also accepts additional arguments passed to the JavaScript. So, for example, if we want to click a button using JavaScript, we can do it with the following code snippet:

import time
from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
driver.get("https://sandbox.oxylabs.io/products")
button = driver.find_element(By.CSS_SELECTOR, '[aria-label="Next page"]')
driver.execute_script("arguments[0].click();", button)
time.sleep(5)

As you can see, we’re simply grabbing the button element using the tag name and then passing it to execute_script, which uses JavaScript to click the button. Note that we’re using arguments[0] inside the JavaScript to reference the first argument passed to execute_script.

Capture Screenshots using Selenium

Selenium WebDriver also provides an option to capture screenshots of websites. These screenshots can be stored in the local storage for later inspection. For example:

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://sandbox.oxylabs.io/products")
driver.save_screenshot("screenshot.png")
driver.close()

We use the save_screenshot() method to take a screenshot of the website. We also pass the argument "screenshot.png" to name the image file that'll be saved in the current folder. Selenium will automatically save this image in the PNG format based on the file extension used.

Scrape Multiple URLs using Selenium

We can leverage Selenium to scrape multiple URLs with Python. This way, we can use the same WebDriver instance to browse multiple websites or web pages and gather data in one go. Let’s take a look at the following example:

urls = ["https://sandbox.oxylabs.io/products?page={}".format(i) for i in range(1, 11)]
for url in urls:
   driver.get(url)
   # do something

We want to browse the first ten pages of the website, so we use Python’s list comprehension to create a list of 10 URLs. After creating the list, we can simply iterate over it and use Selenium to navigate to each URL using Python’s for loop.

Scroll Down using Selenium

To scroll down a website using Selenium and Python, we can take advantage of Selenium’s JavaScript support and use the execute_script parameter to execute a JavaScript code that scrolls the page. See the following example:

time.sleep(2)
driver.execute_script("window.scrollBy(0, 1000);")
time.sleep(3)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
time.sleep(3)

The scrollBy method takes two arguments. So, in this example, we’re first instructing Selenium to scroll the page 1000 pixels down and later to scroll to the bottom of the page.

Selenium vs. Puppeteer

The biggest reason for Selenium's popularity and complexity is that it supports writing tests in multiple programming languages. This includes C#, Groovy, Java, Perl, PHP, Python, Ruby, Scala, and even JavaScript. It supports multiple browsers, including Chrome, Firefox, Edge, Internet Explorer, Opera, and Safari. 

However, web scraping with Selenium is perhaps more complex than it needs to be. Remember that Selenium's real purpose is functional testing. For effective functional testing, it mimics what a human would do in a browser. Selenium thus needs three different components:

  • A driver for each browser

  • Installation of each browser

  • The package/library depending on the programming language used

In the case of Puppeteer, though, the node package includes Chromium. It means no browser or driver is needed. It makes it simpler. It also supports Chrome browser if that’s what you need.

On the other hand, multiple browser support is missing. Firefox support is limited. Google announced Puppeteer for Firefox, but it was soon deprecated. As when writing this, Firefox support is experimental. So, to sum up, if you need a lightweight and fast headless browser to perform web scraping, Puppeteer would be the best choice. You can check our Puppeteer tutorial for more information.

Selenium vs. scraping tools

Selenium is great if you want to learn web scraping. We recommend using it together with Beautiful Soup as well as focus on learning HTTP protocols, methods on how the server and the browser exchange data, and how cookies and headers work. Another option is to use Selenium together with Scrapy for larger-scale projects that require dynamic rendering. To learn more, check out this blog post on Scrapy vs. Selenium.

However, if you're seeking easier data collection methods, there are various tools to help you out with this process. Depending on the scale of your scraping project and targets, implementing a web scraping tool will save you a lot of time and resources.

At Oxylabs, we provide a group of tools called Scraper APIs.

  • SERP Scraper API –  focuses on scraping SERP data from the major search engines.

  • E-Commerce Scraper API – focuses on e-commerce and allows you to receive structured data in JSON.

  • Real Estate Scraper API – designed for effortless data extraction from the popular real estate websites.

  • Web Scraper API – it allows you to carry out scraping projects for most websites in HTML code.

Our tools also have easy integration, here's for Python:

    import requests
  from pprint import pprint

  # Structure payload.
  payload = {
    'source': 'universal',
    'url': 'https://stackoverflow.com/questions/tagged/python',
    'user_agent_type': 'desktop',
  }

  # Get response.
  response = requests.request(
    'POST',
    'https://realtime.oxylabs.io/v1/queries',
    auth=('user1', 'pass1'),
    json=payload,
  )

  # This will return the JSON response with results.
  pprint(response.json())

More integration examples for other languages are available (shell, PHP, cURL) in the documentation, and you can learn how to use cURL with proxy in our blog post. 

The main benefits of Scraper APIs when comparing with Selenium are: 

  • All web scraping processes are automated

  • No need for extra coding

  • Easily scalable 

  • Guaranteed 100% success rates per successful requests

  • Has a built-in proxy rotation tool

Conclusion

Web scraping using Selenium is a great practice, especially when learning the basics. But, depending on your goals, it's sometimes easier to choose an already-built tool that does web scraping for you. Building your own scraper is a long and resource-costly procedure that might not be worth the time and effort.

Learn how to bypass CAPTCHA with Selenium and Python or deal with infite scroll and dive deeper into Scraper APIs and how to integrate them, you can check out our quick start guides for SERP Scraper API, E-Commerce Scraper API, Real Estate Scraper API, and Web Scraper API, or if you have any product related questions, contact us at hello@oxylabs.io.

Frequently asked questions

What is Selenium?

Selenium is a set of three open-source tools: Selenium IDE, Selenium WebDriver, and Selenium Grid.

Selenium IDE is a browser automation software that allows you to record browser actions and play them back. You can use it for web testing or automation of routine tasks. 

Selenium WebDriver also allows you to control and automate actions on a web browser. However, it’s designed to do so programmatically through the OS. In turn, the WebDriver is faster and can remotely control browsers for web testing.

Selenium Grid is a tool that allows web testing and browser automation through Selenium WebDriver to be run on multiple devices simultaneously, on different browser versions, and across various platforms.

What is Selenium used for?

Selenium is mainly used for browser automation and web testing. Selenium is an excellent tool for testing website and web application performance on various traffic loads, different browsers, operating systems, and separate versions of them. With such tools, website owners can provide an unhindered user experience.

While Selenium web scraping is a possible use case, it's still better suited for web automation and testing purposes.

How to use proxies with Selenium?

You can read this Selenium proxies article to use proxies with Selenium. You'll learn how to set up Selenium, authenticate proxies, test the connection, and how the full code should look in Python.

About the author

Gabija Fatenaite

Lead Product Marketing Manager

Gabija Fatenaite is a Lead Product Marketing Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • Setting up Selenium 

  • Quick starting Selenium

  • Data extraction with Selenium by locating elements

  • Executing Javascript with Selenium

  • Selenium vs. Puppeteer

  • Selenium vs. scraping tools

  • Conclusion

Web Scraper API for smooth data gathering

Collect data at scale from any target without CAPTCHA and IP bans.

Scale up your business with Oxylabs®