Back to blog
Daniel Heredia Mejias
Doing competitors’ or benchmark analysis for SEO can be a burdensome task as it requires taking into account many factors which usually are extracted from different data sources.
The purpose of this article is to help you automate the data extraction processes as much as possible. After learning how to do this, you can dedicate your time to what matters: the analysis itself and coming up with actionable insights to strategize.
Daniel Heredia Mejias, SEO Marketer at Casumo
The logic that we’ll follow to automate this process is as follows:
We’ll use Oxylabs’ SERP Scraper API to gather the public data from SERPs and get the top results for a specific keyword.
We’ll scrape the URLs that are ranking in the first positions of the SERPs and obtain the required on-page content that we’ll use for our analysis.
We’ll connect with MOZ API and obtain the necessary off-page metrics.
We’ll use PageSpeed Insights API to put together some metrics related to Core Web Vitals.
Finally, we’ll convert the Python list into a dataframe and export it as an Excel file.
In short, we’ll be able to dump into our Excel sheet up to 18 metrics from the best-performing pages for a keyword for further analysis and get an idea of what it takes to be in the best SERPs spots. The metrics that we’re going to obtain in our Excel sheet as columns are:
Meta Title SERPs: the meta title appearing on the SERPs;
Meta Title On Page: the meta title that is written on-page;
Meta Title Equal: this column can be False or True depending if the meta title on the SERPs matches with the on-page text;
Meta Description: on-page meta description;
H1: on-page H1;
Paragraphs: content contained in <p> tags;
Text length: number of characters from the paragraphs;
Keyword Occurrences Paragraphs: how many times the keyword is used in the paragraphs;
Meta Title Occurrence: if the keyword is used in the meta title;
Meta Description Occurrence: if the keyword is used in the meta description;
Equity Backlinks MOZ: the backlinks that are giving value according to MOZ;
Total Backlinks MOZ: total number of backlinks from MOZ;
Domain Authority: metric used by MOZ to show how authoritative a domain is;
FCP (First Contentful Paint): measures the time from when a page starts loading to when any part of that page’s content is rendered on the screen;
FIP (First Input Delay): the time it takes for the browser to respond to the user’s first interaction;
LCP (Largest Contentful Paint): the amount of time to render the largest content element visible in the viewport, from when the user requests the URL;
CLS (Cumulative Layout Shift): proportion of the viewport that was impacted by layout shifts and the movement distance of the elements that were moved;
Overall PSI Score: page speed overall score that ranges from 0 to 100.
So, we’ve already explained the logic of the code and the variables that we’re going to obtain. Let’s get started with the process of getting the required public data for the automated analysis.
First of all, we’ll use Oxylabs’ SERP Scraper API to extract the top results from the SERPs for an inputted keyword. Remember that you’ll need to get your API username and password to use this piece of code. You’ll also need to input a keyword.
import requests
keyword = "<your_keyword>"
payload = {
"source": "SEARCH_ENGINE_search",
"domain": "com",
"query": keyword,
"parse": "true",
}
response = requests.request(
"POST",
"https://realtime.oxylabs.io/v1/queries",
auth=("<your_username>", "<your_password>"),
json=payload,
)
list_comparison = [
[x["url"], x["title"]]
for x in response.json()["results"][0]["content"]["results"]["organic"]
]
Link to GitHub* You’ll need to specify the exact source, for example, the largest search engine.
Example content of list_comparison:
>>> print(list_comparison)
[
["https://example.com/result/example-link", "Example Link - Example"],
["https://more-examples.net", "Homepage - More Examples"],
["https://you-searched-for.com/query=your_keyword", "You Searched for 'your_keyword'. Analyze your search now!"],
]
Link to GitHubFrom my point of view, Oxylabs offers a very competitive and robust service to scrape the SERPs. Previously I’ve written on my blog about Oxylabs and how you could get the most out of it for SERPs scraping.
After scraping the SERPs with Oxylabs’ scraping solution and getting the best-performing pages that rank for a particular keyword, we’ll scrape their URLs and extract their on-page contents with the Python library Requests. We’ll also parse the content with BeautifulSoup.
import requests
from bs4 import BeautifulSoup
for y in list_comparison:
try:
print("Scraping: " + y[0])
html = requests.request("get", y[0])
soup = BeautifulSoup(html.text, "lxml")
try:
metatitle = (soup.find("title")).get_text()
except Exception:
metatitle = ""
try:
metadescription = soup.find("meta", attrs={"name": "description"})["content"]
except Exception:
metadescription = ""
try:
h1 = soup.find("h1").get_text()
except Exception:
h1 = ""
paragraph = [a.get_text() for a in soup.find_all('p')]
text_length = sum(len(a) for a in paragraph)
text_counter = sum(a.lower().count(keyword) for a in paragraph)
metatitle_occurrence = keyword in metatitle.lower()
h1_occurrence = keyword in h1.lower()
metatitle_equal = metatitle == y[1]
y.extend([metatitle, metatitle_equal, metadescription, h1, paragraph, text_length, text_counter, metatitle_occurrence, h1_occurrence])
except Exception as e:
print(e)
y.extend(["No data"]*9)
Link to GitHubNow, we’ll use MOZ’s API to obtain the off-page metrics. Of course, you’ll need to get your MOZ username and password and input them in the code below. It’s also worth mentioning that MOZ enables customers to make up to 2.000 API requests for free.
Install the Moz API library as follows:
pip install
git+https://github.com/seomoz/SEOmozAPISamples.git#egg=mozscape&subdirectory=python
import time
from mozscape import Mozscape
client = Mozscape("<MOZ username>", "<MOZ password>")
for y in list_comparison:
try:
print("Getting MOZ results for: " + y[0])
domainAuthority = client.urlMetrics(y[0])
y.extend([domainAuthority["ueid"], domainAuthority["uid"], domainAuthority["pda"]])
except Exception as e:
print(e)
time.sleep(10) # Retry once after 10 seconds.
domainAuthority = client.urlMetrics(y[0])
y.extend([domainAuthority["ueid"], domainAuthority["uid"], domainAuthority["pda"]])
Link to GitHubFinally, we’ll obtain the Page Speed metrics using the Page Speed API. To use Page Speed API, you’ll need to set up a project and get the API key from the Google Cloud Platform.
import json
pagespeed_key = "<your page speed key>"
for y in list_comparison:
try:
print("Getting results for: " + y[0])
url = "https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url=" + y[0] + "&strategy=mobile&locale=en&key=" + pagespeed_key
response = requests.request("GET", url)
data = response.json()
overall_score = data["lighthouseResult"]["categories"]["performance"]["score"] * 100
fcp = data["loadingExperience"]["metrics"]["FIRST_CONTENTFUL_PAINT_MS"]["percentile"]/1000
fid = data["loadingExperience"]["metrics"]["FIRST_INPUT_DELAY_MS"]["percentile"]/1000
lcp = data["loadingExperience"]["metrics"]["LARGEST_CONTENTFUL_PAINT_MS"]["percentile"]
cls = data["loadingExperience"]["metrics"]["CUMULATIVE_LAYOUT_SHIFT_SCORE"]["percentile"]/100
y.extend([fcp, fid, lcp, cls, overall_score])
except Exception as e:
print(e)
y.extend(["No data", "No data", "No data", "No data", overall_score])
Link to GitHubAs a final step, you can download all the data from your notebook as an Excel file with Pandas:
import pandas as pd
df = pd.DataFrame(list_comparison)
df.columns = ["URL","Metatitle SERPs", "Metatitle Onpage","Metatitle Equal", "Metadescription", "H1", "Paragraphs", "Text Length", "Keyword Occurrences Paragraph", "Metatitle Occurrence", "Metadescription Occurrence", "Equity Backlinks MOZ", "Total Backlinks MOZ", "Domain Authority", "FCP", "FID","LCP","CLS","Overall Score"]
df.to_excel('<filename>.xlsx', header=True, index=False)
Link to GitHubClick here and check out a repository on GitHub to find the complete code used in this article.
That’s it! You don’t even need to know how to code. You can use the input forms and insert your credentials to run the code and extract all the publicly available data for the best-performing sites ranking for particular keywords.
About the author
Daniel Heredia Mejias
SEO Marketer at Casumo
Daniel Heredia is a Spanish SEO manager who lives in Barcelona and works for Casumo. In love with SEO and especially automation, the Python snake bit him for the very first time around three years ago. He writes about SEO and Python on his site and in his spare time he runs some SEO side projects.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub