YouTube is one of the largest content-sharing platforms in the world, with more than 500 hours of content uploaded each minute. In November 2022, YouTube even secured the second position as the most visited website globally, with 74.8 billion monthly visits, according to Statista.
The sheer volume of public data and traffic on YouTube unlocks various research opportunities for businesses and individuals. Web scraping is the go-to method for extracting data from publicly available YouTube pages, such as video details, comments, channel information, as well as search results. Hence, in this guide, you’ll learn how to leverage Python, Oxylabs’ YouTube Scraper API, and Custom Parser to scrape YouTube videos and harness the potential of YouTube data.
First, install the latest version of Python, which you can download from the official Python website.
Next, run the following command in your terminal to install the necessary modules:
pip install yt-dlp requests
To use the Oxylabs’ YouTube Scraper API, you’ll need an Oxylabs account. Head to the Oxylabs dashboard and sign up to create a new account. Once you create your account, you’ll get a one-week free trial together with your user credentials. You’ll later need these credentials to extract channel information, subscriber count, and search results.
Please note that all information provided herein is for informational purposes only and does not grant you any rights with regard to the described data, videos, or images, which may be protected by copyright, intellectual property, or other rights. Before engaging in scraping activities, you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Now, let’s download a YouTube video using the yt-dlp library, which is popular for downloading YouTube videos. For this example, you can use this video as your target URL.
To download this video, you’ll first need to import the library. Then, use the download() method as shown below:
from yt_dlp import YoutubeDL
video_url = "https://www.youtube.com/watch?v=mDveiNIpqyw"
opts = dict()
with YoutubeDL(opts) as yt:
yt.download([video_url])
When you run this code, the script will download the video and store it in the current folder of your project.
Scraping YouTube videos is also possible with the yt-dlp library. You can extract public video data like the title, video dimensions, and the language used.
Let’s extract video details from the video we’ve downloaded previously. For this task, you can use the extract_info() method with the download=False parameter so that it doesn’t download the video file again. This method will return a dictionary with all the video-related info:
from yt_dlp import YoutubeDL
video_url = "https://www.youtube.com/watch?v=mDveiNIpqyw"
opts = dict()
with YoutubeDL(opts) as yt:
info = yt.extract_info(video_url, download=False)
video_title = info.get("title", "")
width = info.get("width", "")
height = info.get("height", "")
language = info.get("language", "")
print(video_url, video_title, width, height, language)
Please note that all information provided herein is for informational purposes only and does not grant you any rights with regard to the described data, which may be protected by corresponding privacy rights or other rights. Before engaging in scraping activities, you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
To extract all the video comments, you’ll need to pass an additional option getcomments while initializing the yt-dlp library.
Once you set getcomments to True, the extract_info() method will fetch all the comment threads along with the other information about the video. So, you can extract just the comments from the info dictionary like below:
from yt_dlp import YoutubeDL
from pprint import pprint
video_url = "https://www.youtube.com/watch?v=mDveiNIpqyw"
opts = {
"getcomments": True
}
with YoutubeDL(opts) as yt:
info = yt.extract_info(video_url, download=False)
comments = info["comments"]
thread_count = info["comment_count"]
print("Number of threads: {}".format(thread_count))
pprint(comments)
For this example, let’s use the Oxylabs channel's “About” section to extract the channel name and description. Here, you’ll have to use your YouTube Scraper API credentials to authenticate with the API.
The first step is to find the necessary XPath selectors to extract the channel name and description. If you want to use CSS selectors, visit our Custom Parser documentation for more information.
So, open the “About” page in a web browser and use the Developer Tools to inspect elements. You can simply press CTRL + SHIFT + I on Windows or Option + Command + I on macOS to open the Developer Tools:
By inspecting the elements, you can easily construct the relative XPath selector using the IDs associated with the elements. Thus, the XPath selectors are:
Channel name XPath
//ytd-channel-name[@id="channel-name"]/div/div/yt-formatted-string[@id="text"]
Description XPath
//yt-formatted-string[@id="description"]
Now, using the XPath selectors, you can prepare the parsing instructions for YouTube Scraper API. It’s a dictionary that lists all the functions to execute when parsing the data from the HTML content. Let’s begin by importing the requests module and defining the variable instructions that'll contain the parsing instructions:
import requests
url = "https://www.youtube.com/@oxylabs/about"
instructions = {
"Channel Name": {
"_fns": [{
"_fn": "xpath_one",
"_args": ['//ytd-channel-name[@id="channel-name"]/div/div/yt-formatted-string[@id="text"]/text()']
}]
},
"Description": {
"_fns": [{
"_fn": "xpath_one",
"_args": ['//yt-formatted-string[@id="description"]/text()']
}]
}
}
Note the xpath_one function, which tells the API to select only the first matched element when parsing.
Create a new variable payload that'll contain the scraping parameters and parsing instructions that you’ll send to the API:
payload = {
"source": "universal",
"render": "html",
"parse": "true",
"parsing_instructions": instructions,
"url": url,
}
The render parameter is set to html, so the API will execute JavaScript to render all dynamic content. parse is also set to true to tell the API that the payload includes parsing_instructions.
To POST the payload to the API, you’ll have to use the credentials that you’ve obtained from the Oxylabs dashboard:
credentials = ("USERNAME", "PASSWORD")
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=credentials,
json=payload,
)
print(response.status_code)
Replace the USERNAME and PASSWORD with your credentials, run the code, and If everything works as expected, you’ll get a status_code of 200.
YouTube Scraper API sends a JSON response from which you can extract the parsed channel name and description, as showcased below:
channel_name = response.json()["results"][0]["content"]["Channel Name"]
description = response.json()["results"][0]["content"]["Description"]
print(channel_name)
print(description)
Here’s the complete code:
import requests
url = "https://www.youtube.com/@oxylabs/about"
instructions = {
"Channel Name": {
"_fns": [{
"_fn": "xpath_one",
"_args": ['//ytd-channel-name[@id="channel-name"]/div/div/yt-formatted-string[@id="text"]/text()']
}]
},
"Description": {
"_fns": [{
"_fn": "xpath_one",
"_args": ['//yt-formatted-string[@id="description"]/text()']
}]
}
}
payload = {
"source": "universal",
"render": "html",
"parse": "true",
"parsing_instructions": instructions,
"url": url,
}
credentials = ("USERNAME", "PASSWORD")
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=credentials,
json=payload,
)
print(response.status_code)
channel_name = response.json()["results"][0]["content"]["Channel Name"]
description = response.json()["results"][0]["content"]["Description"]
print(channel_name)
print(description)
You can extract the subscriber count of a YouTube channel using the same approach. Let’s again use the Oxylabs channel’s “About” page:
By inspecting elements with Developer Tools, you can see the element has an ID subscriber-count, so building XPath is relatively easy: //*[@id="subscriber-count”]. With this information, you can create parsing instructions as follows:
instructions = {
"subscribers": {
"_fns": [{
"_fn": "xpath_one",
"_args": ['//*[@id="subscriber-count"]/text()'],
}]
},
}
And, just like before, the xpath_one function picks only the first match. The rest of the code is almost the same. Here’s the full source code:
import requests
url = "https://www.youtube.com/@oxylabs/about"
instructions = {
"subscribers": {
"_fns": [{
"_fn": "xpath_one",
"_args": ['//*[@id="subscriber-count"]/text()'],
}]
},
}
payload = {
"source": "universal",
"render": "html",
"parse": "true",
"parsing_instructions": instructions,
"url": url,
}
credentials = ("USERNAME", "PASSWORD")
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=credentials,
json=payload,
)
print(response.status_code)
subscribers = response.json()["results"][0]["content"]["subscribers"]
print(subscribers)
As the data is in the JSON response, you can extract the parsed subscriber count from the response and print it as an output.
You can also use YouTube Scraper API to scrape public data from search results.
To scrape video titles and video links of every search result, first, you need to find the related XPath selectors, and then you can modify the instructions as below:
instructions = {
"titles": {
"_fns": [{
"_fn": "xpath",
"_args": ['//*[@id="video-title"]/yt-formatted-string/text()']
}]
},
"links": {
"_fns": [{
"_fn": "xpath",
"_args": ['//*[@id="video-title"]/@href']
}]
}
}
In this instance, we’re using xpath instead of xpath_one because there are multiple search results, and we want to extract all of them. The complete code for scraping the search page looks like this:
import requests
url = "https://www.youtube.com/results?search_query=oxylabs"
instructions = {
"titles": {
"_fns": [{
"_fn": "xpath",
"_args": ['//*[@id="video-title"]/yt-formatted-string/text()']
}]
},
"links": {
"_fns": [{
"_fn": "xpath",
"_args": ['//*[@id="video-title"]/@href']
}]
}
}
payload = {
"source": "universal",
"render": "html",
"parse": "true",
"parsing_instructions": instructions,
"url": url,
}
credentials = ("USERNAME", "PASSWORD")
response = requests.post(
"https://realtime.oxylabs.io/v1/queries",
auth=credentials,
json=payload,
)
print(response.status_code)
titles = response.json()["results"][0]["content"]["titles"]
links = response.json()["results"][0]["content"]["links"]
base_url = "https://www.youtube.com"
for title, link in zip(titles, links):
full_url = f"{base_url}{link}"
print(title, full_url)
Since both titles and links variables are Python lists, you can simply use the zip() method to map the relevant titles with the links.
Feel free to expand the source codes with additional functionalities and adjust the target URLs for your YouTube data needs. If you want to store your scraped public data in a CSV or Excel file, check out this in-depth Python web scraping guide for more details. Additionally, visit our API documentation to find more information about the payload parameters and other code examples.
In case you prefer visual tutorials, take a look at this extensive playlist of Oxylabs’ video guides to get an even easier head-start into web scraping.
Need to collect data from other sources? See these detailed guides on how to scrape Google Search Results, Bing Search Results, Google News, Google Shopping, as well as Amazon data.
The legality of web scraping YouTube videos solely relies on what data you gather and how you use it. It’s important to follow all the regulations and laws that govern online data, including privacy laws and copyright. In addition, it’s always best to seek professional legal advice before engaging in scraping activities.
It’s also recommended to adhere to the website’s terms of use and follow web scraping best practices. To better understand this topic, we recommend reading this article about the legal frameworks behind web scraping.
Yes, YouTube may block suspicious requests coming from web scrapers. It uses various anti-scraping measures and constantly monitors incoming web requests for any indication of bot-like behavior.
If you want to learn more about web scraping and bot detection systems, check out this great article on 13 tips for block-free scraping and hear about the bypassing methods from our scraping expert in this free webinar.
About the author
Vytenis Kaubre
Copywriter
Vytenis Kaubre is a Copywriter at Oxylabs. As his passion lay in creative writing and curiosity in anything tech kept growing, he joined the army of copywriters. After work, you might find Vytenis watching TV shows, playing a guitar, or learning something new.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Try YouTube Scraper API
Choose Oxylabs' YouTube Scraper API to gather real-time product data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub