Back to blog
Enrika Pavlovskytė
Have you considered creating a sitemap manually? Hopefully not, as it would quickly become tedious, error-prone, and unsustainable. Instead, it’s best to use automated tools, such as web crawlers, to generate sitemaps.
Writing a crawler from scratch undoubtedly requires a certain level of technical expertise and computing resources. Fortunately, Oxylabs’ Web Crawler can make this task easier without straining your resources.
So, this step-by-step guide demonstrates how to create a sitemap using Oxylabs’ Web Crawler and then upload it on the Google Search Console.
Let’s dive in!
A sitemap is a resource that lists information about the contents of your website – from videos to text – and the relationships among them. But why do we need sitemaps in the first place? Well, it depends on the type of sitemap – XML or HTML – we’re talking about, as both have different purposes.
An XML sitemap, which is written in a machine-readable format, is the most relevant for search engines. For example, an XML sitemap lists URLs, the last modified date, change frequency, and priority for a web page on the website. This information helps search engines crawlers index your website more efficiently.
An XML sitemap
It also means that not everything should be included in your sitemap, but rather the most important pages and only the ones you want to be discoverable via search engines. So, there’s no need to include page duplicates, admin pages, thank you pages, or similar.
An HTML sitemap, or a visual sitemap, is meant to ease navigation for users of the website. HTML sitemaps consist of formatted links placed at the bottom of the page. A visual sitemap also presents the site structure and hierarchy to the user so they can find what they need easier. As such, an HTML sitemap isn’t all that useful for robots.
An HTML sitemap
There are numerous ways to create a sitemap, but first, you need to figure out whether you need one in the first place. If your site is relatively small and linked properly (which should be the case anyway), search engines should be able to crawl and index your site without a sitemap. However, there are some cases where you might need to create one. For example:
Large sites
The classification of a 'large site' varies on the search engine. For example, Google considers sites with 500 pages or fewer as small. Maintaining proper interlinking across all pages of a large site can be quite challenging. Thus, a sitemap proves to be invaluable, as it assists search engines in effectively crawling and indexing a website.
New sites
If your site is new, there’s a big chance that you have few external links leading to your website, making it difficult for search engines to index it. In cases like these, submitting a sitemap to Google Search Console might help.
Sites containing rich media
You might need to create a sitemap when your website has rich media (video, images, news articles) that you want to appear in search results. Indeed, a sitemap helps Googlebot to find and understand rich files on your website.
If you decide that your website does need an XML sitemap after all, there are several approaches you can consider. For those using a content management system (CMS), chances are the platform can generate a sitemap automatically. If not, you can also use a third-party XML sitemap generator. Finally, you can also employ a web crawler to perform the job.
A web crawler is a tool that allows you to select the required website content, crawl all the URLs, or index all the web pages on your website. A web crawler isn’t used for generating sitemaps exclusively, but it certainly does the job well. In fact, the Web Crawler feature that comes with Oxylabs’ Scraper APIs can deliver three types of data output: a URL list (sitemap), parsed results, or HTML files. These outputs can be applied to a variety of use cases, from gathering competitor intelligence to cybersecurity.
To make sure you can follow this tutorial as smoothly as possible, install Python 3.7+ or greater on your system. For demonstration, we’ll generate a sitemap for oxylabs.io.
1. Start by installing the requests module using the following command:
pip install requests
2. Then, create a new payload for the request. You can set multiple parameters depending on your requirements (see complete documentation).
payload = {
"url": "https://oxylabs.io/",
"filters": {"crawl": [".*"], "process": [".*"], "max_depth":10},
"scrape_params": {"source": "universal", "user_agent_type": "desktop"},
"output": {"type_": "sitemap"},
}
The url parameter sets the base URL of the website for which you want to get URLs. The filters parameters control the scope and extent of the web crawling task. For an in-depth explanation, check out the web-crawler documentation regarding filters.
The scrape_params are used to modify our scraping tasks. For example, you can specify the region for proxies or enable JavaScript rendering when crawling a website. Output describes the output type. Since we need a sitemap, the output type must be set to sitemap.
3. Once the payload is ready, post an HTTP request to the API Endpoint for creating a new crawling job.
headers = {"Content-Type": "application/json"}
response = requests.request(
"POST",
"https://ect.oxylabs.io/v1/jobs",
auth=(USERNAME, PASSWORD),
headers=headers,
json=payload,
)
json_resp = response.json()
The job may take some time to complete and create a sitemap. You may periodically check for the job status by getting the job information. Once the job status is done, you can get the sitemap.
4. You can get the job-related information by sending a GET request to the (https://ect.oxylabs.io/v1/jobs/{id}) endpoint. Make sure to replace the {id} with your actual job ID.
print("Waiting for Sitemap to create.. This may take a while..")
STATUS = False
while not STATUS:
url = "https://ect.oxylabs.io/v1/jobs/" + json_resp["id"]
time.sleep(10)
info = requests.request(
"GET",
url,
auth=(USERNAME, PASSWORD), # Your credentials go here.
)
information = info.json()
if len(information["events"]) == 0:
continue
for event in information["events"]:
if event["event"] == "job_results_aggregated" and event["status"] == "done":
STATUS = True
In each while loop iteration, the information variable stores the latest request response in JSON format. The event key in this JSON response contains status information of the job indexing and aggregation.
When the job aggregation is finished, the response JSON will contain an event named job_results_aggregated with the status done. At that time, we set the STATUS flag to True, which causes the control to exit the loop.
5. Once the crawling job finishes, request at https://ect.oxylabs.io/v1/jobs/{id}/aggregate to get the aggregated results. Again, make sure to replace the {id} with your job ID.
url = "https://ect.oxylabs.io/v1/jobs/" + json_resp["id"] + "/aggregate"
sitemap = requests.request(
"GET",
url,
auth=(USERNAME, PASSWORD),
)
urls = sitemap.json()
On execution of the above script, the urls variable will now contain the list of aggregated results chunks. This list can be any one of the following types depending on the output chosen:
A list of URLs (if the output type is “sitemap”)
A file containing all the parsed results (if the output type is “parsed”)
A file containing all the HTML results (if the output type is “HTML”)
Since we have selected the sitemap output type, the aggregated result will be a list of URLs.
6. In the next step, we will loop through these chunks. For each chunk, you must request the relevant chunk endpoint and parse the response to get the URLs.
no_of_chunks = urls["chunks"]
for n in range(0, no_of_chunks):
chunk_url = urls["chunk_urls"][n]["href"]
chunk_data = requests.get(chunk_url,
auth = HTTPBasicAuth(USERNAME, PASSWORD),
)
We will send the request to the chunk_url to get the data of that particular chunk. This response data is JSON data lines (in byte format) with a single URL information encoded in each line. Here is what the response in chunk_data looks like:
Chunk data
For each chunk, decode and convert this JSONL byte response into JSON. Next, extract and append the URL from each line to the site_urls list. Here is how you would do this:
lines = chunk_data.content.decode('utf-8').split("\n")
json_objects = [json.loads(line) for line in lines if line]
for obj in json_objects:
site_urls.append(obj['url'])
7. In the next step, you can create an extensible markup language (XML) file from the obtained list of links. The following function can create an XML file of the URLs for us.
def generate_sitemap(site_urls):
sitemap_template = """<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{}
</urlset>"""
url_template = """
<url>
<loc>{}</loc>
</url>"""
url_elements = [url_template.format(site_url) for site_url in site_urls]
sitemap_content = sitemap_template.format("\n".join(url_elements))
with open("sitemap.xml", "w") as file:
file.write(sitemap_content)
8. Now, calling this generate_sitemap() function will create a sitemap.xml file in your current working directory.
generate_sitemap(urls["results"][0]["sitemap"])
Here is the complete code:
# Import the required Modules
import json
import requests
from requests.auth import HTTPBasicAuth
import time
# Function to generate sitemap XML
def generate_sitemap(site_urls):
sitemap_template = """<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
{}
</urlset>"""
url_template = """
<url>
<loc>{}</loc>
</url>"""
url_elements = [url_template.format(site_url) for site_url in site_urls]
sitemap_content = sitemap_template.format("\n".join(url_elements))
with open("sitemap.xml", "w") as file:
file.write(sitemap_content)
# Set the content type to JSON.
headers = {"Content-Type": "application/json"}
# Your credentials go here.
USERNAME = "Username"
PASSWORD = "Password"
# Crawl all URLs inside the target URL.
payload = {
"url": "https://oxylabs.io/",
"filters": {"crawl": [".*"], "process": [".*"], "max_depth": 2},
"scrape_params": {"source": "universal", "user_agent_type": "desktop"},
"output": {"type_": "sitemap"},
}
# Create a job and store the JSON response.
response = requests.request(
"POST",
"https://ect.oxylabs.io/v1/jobs",
auth=(USERNAME, PASSWORD),
headers=headers,
json=payload,
)
json_resp = response.json()
print("Waiting for sitemap to create.. This may take a while..")
STATUS = False
while not STATUS:
url = "https://ect.oxylabs.io/v1/jobs/" + json_resp["id"]
time.sleep(10)
info = requests.request(
"GET",
url,
auth=(USERNAME, PASSWORD),
)
information = info.json()
if len(information["events"]) == 0:
continue
for event in information["events"]:
if event["event"] == "job_results_aggregated" and event["status"] == "done":
STATUS = True
# Create the endpoint to get the list of aggregated chunks
url = "https://ect.oxylabs.io/v1/jobs/" + json_resp["id"] + "/aggregate"
sitemap = requests.request(
"GET",
url,
auth=(USERNAME, PASSWORD),
)
urls = sitemap.json()
no_of_chunks = urls["chunks"]
site_urls = []
for n in range(0, no_of_chunks):
chunk_url = urls["chunk_urls"][n]["href"]
chunk_data = requests.get(chunk_url,
auth = HTTPBasicAuth(USERNAME, PASSWORD),
)
lines = chunk_data.content.decode('utf-8').split("\n")
json_objects = [json.loads(line) for line in lines if line]
for obj in json_objects:
print(obj['url'])
site_urls.append(obj['url'])
generate_sitemap(site_urls)
The final XML file looks like this:
Once we have the sitemap.xml file, we can proceed further to upload it to Google Search Console.
Google Search Console, a free-of-cost service, monitors your site's visibility in Google’s search engine results pages (SERPs). While it isn't mandatory to register your website with Search Console, doing so can assist you in understanding and improving how Google perceives your site.
To upload the sitemap to Google Search Console, we need to follow the following steps:
Upload the sitemap.xml file to your website and note down its URL.
Navigate to Google Search Console and log in using your Google account.
Choose the web property for which you wish to submit a sitemap.
Click "Sitemaps" on the left-hand menu.
5. Select Add/Test Sitemap from the menu.
6. Enter the URL of your sitemap file in the Add a new sitemap area. You would type "sitemap.xml" in the form if your sitemap file's address is "https://example.com/sitemap.xml".
7. Click Submit from the menu. The uploaded sitemap will be shown in the Submitted sitemaps area:
8. You can click the sitemap to check the performance status. It may require a few days to start the tracking.
Once your sitemap has been processed, Google Search Console will display the results. Any issues or cautions that may have arisen will also be indicated.
Please note that it may take some time for Google to crawl and index the pages listed in your sitemap. Proper compliance with the XML sitemap format and the proper URL entry for each page of your website ensures full and faster indexing.
Uploading XML sitemaps of the websites to Google Search Console is a great way to improve your visibility in search engines, but creating XML sitemaps for large websites can be tricky. Oxylabs’ Web Crawler can help create full sitemaps at any scale easily and significantly conserve your time and resources.
If you found this tutorial useful, you might also want to read our lxml tutorial or improving your SERP rankings with data.
About the author
Enrika Pavlovskytė
Copywriter
Enrika Pavlovskytė is a Copywriter at Oxylabs. With a background in digital heritage research, she became increasingly fascinated with innovative technologies and started transitioning into the tech world. On her days off, you might find her camping in the wilderness and, perhaps, trying to befriend a fox! Even so, she would never pass up a chance to binge-watch old horror movies on the couch.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Scraper APIs for smooth data extraction
Scrape quality data from any target hassle-free while avoiding CAPTCHA and IP blocks.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub