Back to blog
Vytautas Kirjazovas
During the first day of OxyCon, our guest speaker Slavcho Ivanov, CTO @ Intelligence Node, gave a great presentation on using cloud-based Chrome browsers for web data gathering. He described the method his company has developed and briefly discussed its limitations. In this article, let’s take a look at how Intelligence Node was able to achieve it.
First, the company relies on C++, Javascript, Go and Ruby programming languages. As for the tools, you would need Ubuntu 18.04, a VNC server and XFCE – combined, these allow for smooth window management in the cloud.
Slavcho Ivanov, CTO @ Intelligence Node
The requirements for this setup are:
Crawlers need to be able to send HTML requests to browsers
Browsers need to send back HTML
The crawlers’ numbers can change
The browsers’ numbers can change
To communicate with the cloud browsers, Javascript (+ Chrome libraries) and HTTP are used since browsers already excel at these. Finally, Go (a.k.a. Golang) is used as a middle-man, since it is built for concurrency (routines, channels) and offers great microservices with extensive HTTP support and the possibility for an integrated web server.
It is also worth pointing out that, according to Mr. Ivanov, in order to reduce the risk of fingerprinting, cookies should be cleaned periodically.
During the Q&A, some OxyCon participants noted that the Chrome browser has some limitations, such as hogging RAM and only a low-level ability for using proxies.
Addressing the first issue, Mr. Ivanov recommended restarting each browser every 30 minutes. As for proxies, one can either launch a new browser with a different proxy in certain intervals, have one endpoint and do proxy rotation on your own or use a proxy rotation service, offered by the provider.
Mr. Ivanov also noted that this solution is only viable for limited scale scraping. It would not work in the case of thousands of requests per second.
The Intelligence Node CTO also pointed out that although using Selenium and headless browsers could essentially achieve the same, it is easier and more convenient to use cloud-based Chrome browsers.
About the author
Vytautas Kirjazovas
Head of PR
Vytautas Kirjazovas is Head of PR at Oxylabs, and he places a strong personal interest in technology due to its magnifying potential to make everyday business processes easier and more efficient. Vytautas is fascinated by new digital tools and approaches, in particular, for web data harvesting purposes, so feel free to drop him a message if you have any questions on this topic. He appreciates a tasty meal, enjoys traveling and writing about himself in the third person.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub