Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

Scraping the Web With a High Success Rate

Gabija Fatenaite

2019-10-103 min read
Share

OxyCon, Oxylabs’ very first annual web data harvesting conference, was packed with in-depth talks and workshops. On the second day of the event, Eivydas Vilcinskas, Software Engineer at Oxylabs, took the stage to share some tactical advice on how to reach a high success rate using Oxylabs Scraper APIs (formerly known as Real-Time Crawler). Scraper APIs include SERP Scraper API, E-Commerce Scraper API, Real Estate Scraper API. and Web Scraper API.

According to Eivydas, 99.7% of the time, Scraper APIs successfully deliver data. However, as with all services, there is always that 0.03% chance of the system downtime. Fortunately, in his workshop, Eivydas walked through all possible issues and explained how to solve each and every one of them.

How to access Oxylabs’ web crawling tools, Scraper APIs

Before we go into error codes, let’s quickly recap how you could access the service of Scraper APIs.

Using Proxy Endpoint

This method is the simplest (but also the most limited) way of accessing Scraper APIs. Apart from providing the target URL, you can only provide headers to select the wanted User-Agent-Type and Geo-Location to spoof. We also don’t allow the use of our javascript rendering service or submitting jobs in batches.

Proxy Endpoint acts as a standard proxy with some added functionality and returns the body of the response from the target verbatim. We don’t wrap it in any structures like JSON or add any additional data.

Via real-time data delivery method

When using the real-time data delivery method, you POST the job, and Scraper APIs return the requested data on an open connection. If done correctly, the data should come back with the HTTP status code 200 and should contain a JSON with the data you requested.

Via callback data delivery method

The callback method allows you to decide when to retrieve the requested data (but no later than after 24 hours) and lets you manage the full range of options as well as request/response timings. Check our previous blog post callback vs. real-time data delivery methods to learn more.

Scraping the web with Scraper APIs: error types

According to Eivydas, the majority of errors that you might encounter while integrating or using Scraper APIs fall into three categories: request, response, and content.

Request errors

These types of errors are related to the request path and usually arise when the signal doesn’t reach the intended destination. It might mean that the Scraper APIs servers are physically not reachable over the network and/or the services are not running correctly.

Response errors

Usually, response error means that the network is running smoothly, the services are available to return something, and the issue is most probably related to the way a request is made and the data it contains. For example, you might be using a wrong HTTP method for contacting the service endpoint, or you request for the data that we cannot provide.

Content errors

Once you get the data from the Scraper API, you can process it further. We report the job as completed successfully, but during the data analysis on your part, you find out that the data is not exactly as you requested, or that it has some flaws.

The most common errors and how to solve them

Depending on which type of access method you are using and which kind of error you get, there might be different ways to solve an issue. We summarized all of them in the classy table down below.

Access typeError typeError codeSolutionPlan B
Proxy Endpoint / Real-Time / CallbackRequestServers are not reachableWait a few minutes before retryingIf the server is still down after 5 minutes, contact your account manager
Proxy Endpoint / Real-Time / CallbackRequestScraper API is not reachableCheck if you’re not hitting the wrong endpointTroubleshoot your connection. If the connection is ok, contact Oxylabs
Proxy Endpoint / Real-Time / CallbackResponse400Look for the message in the body to see the reason
Real-Time / CallbackResponse401Check if you’re using correct credentials or if your user wasn’t disabled. To fix this, contact your account manager You might get this error code if the source that you’ve given to the Scraper API is not supported or disabled for you. Contact your account manager to discuss implementing the necessary source in our system or to have it enabled for you.
Proxy Endpoint / Real-Time / CallbackResponse404Check the documentation for correct endpoints
Real-Time / CallbackResponse405Use POST to submit your jobs. Any other HTTP method will return a response with this status code
Proxy EndpointResponse407Check if you’re not using incorrect credentialsIf you want to reset your credentials, contact your account manager
Proxy Endpoint / Real-Time Response408Increase the default timeout value to 120 s and try againIf timeout comes from the Scraper API’s side, contact Oxylabs
CallbackResponse408Increase the default timeout value to 30 s and try againIf timeout comes from the Scraper API’s side, contact Oxylabs
Proxy Endpoint / Real-Time / CallbackResponse429You have reached the limit of requests per week/month/etc. Reach out to your account manager to increase the limitYou might be making too many requests per minute. Contact your account manager to increase this limit
Real-Time / CallbackResponse5xxIf this error appears for more than 5 minutes, contact Oxylabs
Proxy Endpoint / Real-Time / CallbackContentStatus code is not 200Retry the jobThe reason for this error might be incorrect job parameters. This status code should be handled from your side, but you can always contact Oxylabs for assistance
Proxy Endpoint / Real-Time / CallbackContent200 but data for given parameters are incorrectTarget website might have changed their algorithms. There might be a way to get the required data by using different parameters. Contact Oxylabs for assistance
Proxy Endpoint / Real-Time / CallbackContentCorrupted contentHave you decoded or decompressed the data? If yes and the data is still corrupted, contact Oxylabs

Structured data errors

While accessing Scraper API via callback data delivery method, you might get structured data errors. According to Eivydas, the results for structured data contain two separate status codes. The one in the root of the object contains the status code of the HTTP response that the target has given us. There is a different one in the content.parse_status_code, which marks the status of the parsing efforts:

  • 12000 – parse successful. The content should be pristine, contain all the fields that are expected to be parsed.

  • 12004 – parse successful with errors. There might be some fields that were not parsed correctly somewhere in the tree. Such fields contain the text “Could not parse xyz: ReasonForFailure”.

  • 12003 – we could not parse the content because it is not supported. For now, only the target responses of 200 are attempted to be parsed.

  • 12002 – failed the attempt to parse. A real failure on our side. This might be caused by a significant change in the HTML structure, and we have to adapt our code to continue parsing the format.

Wrapping up

So, we’ve covered the most common errors which you can encounter while integrating and using Scraper APIs. Reaching a high success rate should be a piece of cake now! Moreover, detailed documentation on what Eivydas has covered in his workshop is currently in the works and will be published on our website soon. In the meantime, if you have any questions regarding Scraper APIs, feel free to contact us.

About the author

Gabija Fatenaite

Lead Product Marketing Manager

Gabija Fatenaite is a Lead Product Marketing Manager at Oxylabs. Having grown up on video games and the internet, she grew to find the tech side of things more and more interesting over the years. So if you ever find yourself wanting to learn more about proxies (or video games), feel free to contact her - she’ll be more than happy to answer you.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • How to access Oxylabs’ web crawling tools, Scraper APIs

  • Scraping the web with Scraper APIs: error types

  • The most common errors and how to solve them

  • Wrapping up

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

Scale up your business with Oxylabs®