Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network status Careers

hello@oxylabs.io

English (EN)

English

中文

Proxies

Proxies & Advanced Proxy Solutions

Residential Proxies

Human-like scraping without IP blocking

Mobile Proxies

Harness the power of IP addresses from real mobile devices

Rotating ISP Proxies

Extract the required data without the fear of getting blocked

Web Unblocker

AI-powered proxy solution for block-free scraping

Shared Datacenter Proxies

Fast and reliable proxies for cost-effective scraping

Dedicated Datacenter Proxies

The highest performing proxies on the market

Static Residential Proxies

Combined power of Datacenter and Residential IPs

Tools & Addons

Oxy Proxy Extension for Chrome

Free Chrome proxy manager extension that works with any proxy provider.

Oxy Proxy Manager for Android

Free Android proxy manager app that works with any proxy provider.

Proxy RotatorAdd-on

Rotates your Datacenter Proxies to help increase success rates.

Scraper APIs

SERP Scraper APIFREE TRIAL

Scalable SERP data delivery from major search engines

E-Commerce Scraper APIFREE TRIAL

Enterprise-level data from largest e-commerce marketplaces

Real Estate Scraper APIFREE TRIAL

Real-time data from popular real estate websites

Web Scraper APIFREE TRIAL

Public data delivery from a majority of websites

Features

Web Crawler

Discovers all pages on a website and fetches data at scale.

Scheduler

Schedules multiple scraping and parsing jobs at specified frequencies.

Custom Parser

Parses scraped documents by executing given parsing instructions.

Headless BrowserNEW

Render JavaScript and execute browser instructions.

DatasetsNew

Datasets

Company Data

Comprehensive datasets for business profiling

E-Commerce Product Data

Datasets for product catalog insights from E-Commerce stores

Job Postings Data

Datasets for labour market research and insights

Community and Code Data

Datasets for developer community trends

Product Review Data

Fresh datasets for user sentiment analysis

Pricing

Proxies

Residential Proxies

Human-like scraping

Starts from

$10

Pay as you go

Mobile Proxies

3G/4G/5G Mobile Proxies

Starts from

$22

Pay as you go

Rotating ISP Proxies

Extended sessions

Starts from

$340/month

Shared Datacenter Proxies

Cost-effective solution

Starts from

$50/month

Dedicated Datacenter Proxies

Superior performance

Starts from

$50/month

Scraper APIs

SERP Scraper API

Scalable SERP data delivery

Starts from

$49/month

E-Commerce Scraper API

Enterprise-level product page data

Starts from

$49/month

Web Scraper API

Data from a majority of websites

Starts from

$49/month

Real Estate Scraper API

Real-time real estate data

Starts from

$49/month

Advanced Proxy Solutions

Web Unblocker

AI-powered proxy solution

Starts from

$75/month

Learn

Getting Started

Knowledge Base

Read the latest articles about the world of web scraping, proxies, and more

Webinars

Check our webinars to learn more about data gathering issues and solutions

White papers

Get extensive white papers to understand the most complex scraping topics

OxyCon

Join inspiring discussions at Oxylabs’ annual web scraping conference

Scraping Experts

Watch lessons by industry-leading experts to gain insights on data gathering

Useful Information

Quick Start Guides

Featured

Explore tutorials and code samples to build a web scraping infrastructure with Oxylabs solutions.

Solutions

By Industry

E-Commerce

Get access to valuable e-commerce data with the help of advanced scraping solutions

Cybersecurity

Collect threat intelligence and inspect risky activities anonymously with reliable proxies

Brand protection

Monitor the web on a large scale to ensure no unauthorized product seeped into the market

SERP Monitoring

Monitor SERPs to enhance your business strategy

Travel and hospitality

Gather real-time flight and hotel data to and build a solid strategy for your travel business.

By Use Case

View all

By Target

View all

Back to blog

Tutorials Scrapers

Web Scraping With PowerShell: The Ultimate Guide

Roberta Aukstikalnyte

2022-10-178 min read

PowerShell is a configuration and automation engine for solving tasks and issues designed by Microsoft. It consists of a scripting language with object-oriented support and a command line shell. Users, especially System Administrators, can automate, configure, and manage their network-related tasks using this engine.

PowerShell core is an advanced version of Windows PowerShell with open-source and cross-platform properties. Windows PowerShell is only compatible with Windows OS, whereas the Core version works well with UNIX-compliant operating systems including macOS and Linux.

PowerShell is often used in the data acquisition field. Today’s tutorial answers why PowerShell is a reliable engine for web scraping and goes through each step for using it for our data acquisition needs – let’s get started.

Can you scrape the web with PowerShell?

Web scraping refers to extracting and saving useful information from online sources, including web pages. It is the art of parsing the HTML contents to retrieve specific information.

Designing a good web scraping tool requires sufficient knowledge of HTML and the target website structure. The web scraping tool is reliable if it is robust to minor changes in the target web pages.

Python and Java support several libraries for performing complex web scraping tasks. Libraries like AutoScraper are trivial to use, allowing an absolute beginner to do highly robust web scraping tasks without any in-depth understanding of the HTML and web page structure.

PowerShell provides two cmdlets to scrape HTML data from the target web page: Invoke-WebRequest and Invoke-RestMethod – they will be explained later in the article. However, one must have a sufficient background in HTML and regular expressions to design a robust and reliable web scraping tool.

If you’re a process automation engineer or a DevOps professional, chances are, you’d like everything automated with PowerShell scripts that can work in cross-platform contexts – that’s why the PowerShell engine is a great choice for web scraping.

How to scrape the web with PowerShell

This section provides a practical hands-on approach to web scraping with PowerShell scripts. We’ll learn how to scrape public URLs, relevant information, and images from web pages using Invoke-WebRequest and Invoke-RestMethod cmdlets. Moreover, we’ll also discover content parsing with simple regular expressions and PowerHTML.

If you prefer, here's a link to the same tutorial on GitHub.

This tutorial takes Books to Scrape as a target for our web scraping tool. The target website features hundreds of books under 52 categories. The link to each category is available on the index page, as shown here:

Using Invoke-WebRequest

Invoke-WebRequest cmdlet tells PowerShell to get a web page. It sends a request to any web page or service and receives a response including the contents, HTTP request status code, and metadata, just like any web browser would receive them.

For instance, let’s look at a very basic use case where we invoke a web request to www.google.com.

Invoke-WebRequest 'www.google.com'

The output of Invoke-WebRequest returns an object that has Status Code, StatusDescription, RawContent, Links, and all other metadata as its properties, as shown in the following output snippet:

Scraping book category URLs

To scrape links of all the categories from the target’s index page using Invoke-WebRequest, we can do it using the following script:

$scraped_links = (Invoke-WebRequest -Uri 'https://books.toscrape.com/').Links.Href  | Get-Unique 
$reg_expression = 'catalogue/category/books/.*'
$all_matches = ($scraped_links | Select-String $reg_expression -AllMatches).Matches
 
$urls = foreach ($url in $all_matches){
    $url.Value
}
$urls

Here, the Invoke-WebRequest returns an object having all the content at the target URL. The Link.Href property filters for all the hyper-reference links in the contents. Then, we pipe it to the Get-Unique cmdlet to have only the unique links. Therefore, the $scraped_links object has all the unique links present on the target URL.

To get links for categories only, we further parse the $scraped_links with a regular expression $reg_expression. Therefore, the $all_matches object will have link objects for categories only.

Finally, we extract values for link objects from $all_matches and store them in the $urls list. Let’s see how it looks on the output console:

Let’s look at another example where Invoke-WebRequest is used to scrape all the URLs for images on the target web page.

(Invoke-WebRequest -Uri 'https://books.toscrape.com/').Images | Select-Object src

Similar to the earlier example, we first invoke an Invoke-WebRequest and get the Images section. The resultant is then piped to the Select-Object cmdlet, which then fetches the source links.

The output of the above command is:

It’s not like scraping the links is the only use case with the Invoke-WebRequest method; we can surely scrape contents and related data. However, for the sake of demonstration, the next subsection discusses Invoke-RestMethod for web scraping.

Using Invoke-RestMethod

Invoke-RestMethod is also used to send requests on web pages or web services, including web APIs. Like the Invoke-WebRequest cmdlet, it also retrieves the HTML or content of the target URI. However, in contrast to the Invoke-WebRequest, the Invoke-RestMethod does not receive the metadata section.

Invoke-RestMethod is particularly useful for requesting APIs where the response data is usually in JSON format. The Invoke-RestMethod method automatically parses the JSON responses into objects.

Let’s request google.com with Invoke-RestMethod and see what we get:

Invoke-RestMethod 'www.google.com'

Response:

As expected, the output shows that the response for the Invoke-RestRequest contains only the HTML content of the target URL.

Scraping book information from a single webpage

As discussed at the start of this article, the Invoke-RestRequest cmdlet can also be used to scrape web pages. Now, let’s see it in action.

Assume that we want to scrape some specific information from a book page at the target bookstore’s website; the Invoke-RestRequest can do it in the following way:

$book_html = Invoke-RestMethod 'https://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html'
 
$reg_exp = <li class="active".*>(?<name>.*)</li>(.|\n)*<th>UPC</th><td.*>(?<upc_id>.*)</td>(.|\n)*<th>Product Type</th><td.*>(?<product_type>.*)</td>(.|\n)*<th>Price.*</th><td.*>(?<price>.*)</td>(.|\n)* <th>Availability</th>(.|\n)*<td.*>(?<availability>.*)</td>'
 
$all_matches = ($book_html | Select-String $reg_exp -AllMatches).Matches
 
$BookDetails =[PSCustomObject]@{
  'Name' = ($all_matches.Groups.Where{$_.Name -like 'name'}).Value
  'UPC_id' = ($all_matches.Groups.Where{$_.Name -like 'upc_id'}).Value
  'Product Type' = ($all_matches.Groups.Where{$_.Name -like 'product_type'}).Value
  'Price' = ($all_matches.Groups.Where{$_.Name -like 'price'}).Value
  'Availability' = ($all_matches.Groups.Where{$_.Name -like 'availability'}).Value
}
$BookDetails

The example above requests the target book page and stores the received HTML content in the $book_html object. Next, it creates a regular expression to parse the name, UPC id, product type, price, and availability information from the HTML content stored in $book_html. Let’s have a look at the image of the target page along with its source to understand the formulation of this regular expression:

Note that the book’s title is inside a <li> tag with an active class. Moreover, the product information is inside a <table> tag where each type of information (e.g., UPC, price, etc.) is in a <td> tag which is preceded by a relevant <th> tag.

Keeping the above observations in mind, we designed the regular expression to match the <li> tag with an active class and capture everything enclosed in this tag to a name group. After that, the regular expression skips everything, including newlines, until it finds a <th> tag with UPC as its inner text. The inner text of the adjacent <td> tag is then captured in the upc_id group. Similarly, we follow the same pattern for the remaining product information.

Sidenote: We should use utmost care while designing a regular expression. A minor mistake can cause the web scraper to extract undesired information or even nothing at all. For example, in the case of our previous scraper script, missing any single symbol can cause the regular expression to match nothing, hence failing the scraper to scrape anything. Therefore, it’s recommended to use an online regular expression tester like the regixtester to check the validity of the expression over the example page source.

Once the regular expression is applied to the received HTML content, the select-string cmdlet, along with the flag -AllMatches, returns a MatchInfo object with detailed information about all the matching strings.

Finally, the resultant $all_matches MatchInfo object is converted into a PowerShell custom object, containing only the desired information. The output of the above script is as follows:

Let’s apply the above script on another book page URL and see the results:

code

Now that we know how to extract information from a specific web page, let’s see how scraping data from a specific category would work.

Scraping all the books of a specific category

Say that we want titles and prices of all the books in a specific category – it also can be achieved with the PowerShell engine. However, before looking at the script, we need to look at one of the category pages at the target bookstore (i.e., Books to Scrape) along with its source code.

The above snippet shows the web page for Sports and Games and the corresponding HTML source code. The web page has a total of five books; the price and title of each book are available on the page.

If we closely look into the web page's source, the full book title is provided as a title attribute to the book's page href tag. Moreover, the price is in a paragraph tag with a price_color class.

Now, having a sufficient understanding of the underlying page structure, the script can be introduced to our web scraper.

$category_page_html=Invoke-RestMethod 'https://books.toscrape.com/catalogue/category/books/sports-and-games_17/index.html'
 
$reg_exp = '<h3><a href=.* title=\"(?<title>.*)\">.*<\/a><\/h3>(\n.*){13}<p class="price_color">(?<price>.*)<\/p>'
 
$all_matches = ($category_page_html | Select-String $reg_exp -AllMatches).Matches
 
$BookList = foreach ($book in $all_matches)
{
    [PSCustomObject]@{
        'title' = ($book.Groups.Where{$_.Name -like 'title'}).Value
        'price' = ($book.Groups.Where{$_.Name -like 'price'}).Value
       
    }
}
$BookList

The above script first retrieves the HTML of the Sports and Games category’s web page. Then, it applies the regular expression (stored in $reg_exp) on the retrieved HTML to select all the matching strings. As the target page has only five books, the $all_match (a selectInfo object) will be of length 5 and will have detailed information on all the matches along with the matched strings.

We don’t need details associated with the matches; rather, we are concerned with the titles and prices of the books. So, the script creates a list of PowerShell custom objects, where each PSCustomObject has just the name and title of a particular match.

The output of the above script looks like this:

code

Let's try using our scraper on another target by replacing the link in the Invoke-RestMethod with a link to the travel category. As expected, the script will output titles and prices of the books on the travel category page.

Parsing data with PowerHTML

Until now, we’ve been using regular expressions to extract the required information. Designing a robust regular expression to extract relevant strings is very tricky. Moreover, modifying a pre-written regular expression is always problematic due to its poor readability.

Thanks to PowerHTML, we have a robust, more readable, and highly maintainable way to make PowerShell parse HTML data. PowerHTML is a powerful wrapper over the HtmlAgilityPack that supports XPath syntax to parse the HTML. It’s particularly useful in scenarios where the HTML Document Object Model (DOM) is unavailable, as in the case of content received in response to an Invoke-WebRequest.

We can install the PowerHTML module using the following command:

Install-Module -Name PowerHTML

Scrape information from a book’s web page

Assume we want Product Information from a book web page, A Light in the Attic. The required information is inside the striped table, as depicted in the following snippet:

We can use the following script using PowerHTML to retrieve the Product Information from this web page:

$web_page = Invoke-WebRequest 'https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html'
 
$html = ConvertFrom-Html $web_page
 
$BookDetails=[System.Collections.ArrayList]::new()
 
$name_of_book =$html.SelectNodes('//li') | Where-Object { $_.HasClass('active') }
$name=$name_of_book.ChildNodes[0].innerText
$n = New-Object -TypeName psobject
$n | Add-Member -MemberType NoteProperty -Name Name -Value $name
$BookDetails+=$n
 
$table = $html.SelectNodes('//table') | Where-Object { $_.HasClass('table-striped') }
 
foreach ($row in $table.SelectNodes('tr'))
{
    $cnt += 1
    $name=$row.SelectSingleNode('th').innerText.Trim() 
    $value=$row.SelectSingleNode('td').innerText.Trim() -replace "\?", " "
    $new_obj = New-Object -TypeName psobject
    $new_obj | Add-Member -MemberType NoteProperty -Name $name -Value $value
    $BookDetails+=$new_obj 
}
 
Write-Output 'Extracted Table Information'
$table
 
Write-Output 'Extracted Book Details Parsed from HTML table'
$BookDetails

The above code first retrieves the HTML contents of the target web page using the Invoke-WebRequest method. Then, the HTML content is converted to an HTMLAgilityPack htmlNode object using the ConvertFrom-Html command and stored in $html. This conversion allows us to use XPath syntax for further content parsing.

Afterward, the book title is parsed by selecting the <li> tag with class=active. This <li> tag has the book name as its inner text. We can fetch this inner text, add it to a new psObject against a Name property, and append it to the $BookDetails array list.

The product information is displayed in a <table> tag on the webpage. The below figure shows the product information table rendered by the browser on the left and the corresponding HTML source on the right.

To get the product information, we further parse the $html object and select the <table> tag using the $html.SelectNodes('//table') | Where-Object { $_.HasClass('table-striped') } command.

Then, a foreach loop selects all the rows or <tr> tags of the table and builds an object for each row having the <th> tag’s innerText value as the object’s Name property and the <td> tag’s innerText as a value for the Name property of the object.

Further, the loop also appends all the product information objects to the $BookDetails array list. The last line of the script displays this list.

The output of the above scripts is as follows:

Integrating proxies with PowerShell

Requesting a web page without using a proxy address has several risks associated with it. It can reveal our IP, exposing our location information. Moreover, we may want to scrape some region-specific data that can only be accessed by IP addresses of a particular region. Luckily, using a proxy server can help with both cases. If you're dealing with especially difficult targets, we'd recommend choosing a Residential Proxy.

Both, the Invoke-WebRequest and the Invoke-RestMethod, support using a proxy. These cmdlets support the -Proxy flag to provide the URI of the proxy. We can also pass the proxy credentials with the proxy address using the -ProxyCredential flag.

For example, the following snippets showcase the use of proxy endpoints while requesting www.google.com.

Invoke-RestMethod 'http://www.google.com ' -Proxy 'PROXY_ENDPOINT'

Invoke-WebRequest 'http://www.google.com ' -Proxy 'PROXY_ENDPOINT'

The PROXY_ENDPOINT refers to a URI of a proxy, that is comprised of a protocol, an optional authentication information, an IP or a hostname, as well as an optional port number (e.g., http://user:pass@127.0.0.1:8081 or https://127.0.0.1:8081 ).

Conclusion

PowerShell is a powerful cross-platform task automation tool that can also be used for public web data acquisition. We can scrape data using either Invoke-WebRequest or Invoke-RestMethod cmdlets in conjunction with classic regular expressions or PowerHTML-like robust parsing tools that parse the data retrieved by these request cmdlets.

If you want to make web scraping simple and block-free, take a look at our advanced web intelligence solutions, such as Web Scraper API.

Web Scraping With PowerShell: The Ultimate Guide

Can you scrape the web with PowerShell?

How to scrape the web with PowerShell

Using Invoke-WebRequest

Scraping book category URLs

Using Invoke-RestMethod

Scraping book information from a single webpage

Scraping all the books of a specific category

Parsing data with PowerHTML

Scrape information from a book’s web page

Integrating proxies with PowerShell

Conclusion

People also ask

Why does PowerShell Core offer cross-platform support, while Windows PowerShell doesn’t?

What is a cmdlet in PowerShell?

Related articles

How to Run Python Script as a Service (Windows & Linux)

How to Use Wget With Proxy

Web Scraping With PHP