Back to blog
Augustas Pelakauskas
The web is a living, breathing organism – it constantly adapts and changes. In this dynamic environment, gathering time-sensitive data such as E-commerce listings only once is useless as it quickly becomes obsolete. To be competitive, you must keep your data fresh and run your web scraping scripts repeatedly and regularly.
The easiest way is to run a script in the background. In other words, run it as a service. Fortunately, no matter the operating system in use – Linux or Windows – you have great tools at your disposal. This guide will detail the process in a few simple steps.
In this article, information from a list of book URLs will be scraped. When the process reaches the end of the list, it loops over and refreshes the data again and again.
First, make a request and retrieve the HTML content of a page. Use the Requests module to do so:
urls = [
'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
'https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
]
index = 0
while True:
url = urls[index % len(urls)]
index += 1
print('Scraping url', url)
response = requests.get(url)
Once the content is retrieved, parse it using the Beautiful Soup library:
soup = BeautifulSoup(response.content, 'html.parser')
book_name = soup.select_one('.product_main').h1.text
rows = soup.select('.table.table-striped tr')
product_info = {row.th.text: row.td.text for row in rows}
Make sure your data directory-to-be already exists, and then save book information there in JSON format.
Protip: make sure to use the pathlib module to automatically convert Python path separators into a format compatible with both Windows and Linux systems.
data_folder = Path('./data')
data_folder.mkdir(parents=True, exist_ok=True)
json_file_name = re.sub('[: ]', '-', book_name)
json_file_path = data_folder / f'{json_file_name}.json'
with open(json_file_path, 'w') as book_file:
json.dump(product_info, book_file)
Since this script is long-running and never exits, you must also handle any requests from the operating system attempting to shut down the script. This way, you can finish the current iteration before exiting. To do so, you can define a class that handles the operating system signals:
class SignalHandler:
shutdown_requested = False
def __init__(self):
signal.signal(signal.SIGINT, self.request_shutdown)
signal.signal(signal.SIGTERM, self.request_shutdown)
def request_shutdown(self, *args):
print('Request to shutdown received, stopping')
self.shutdown_requested = True
def can_run(self):
return not self.shutdown_requested
Instead of having a loop condition that never changes (while True), you can ask the newly built SignalHandler whether any shutdown signals have been received:
signal_handler = SignalHandler()
# ...
while signal_handler.can_run():
# run the code only if you don't need to exit
Here’s the code so far:
import json
import re
import signal
from pathlib import Path
import requests
from bs4 import BeautifulSoup
class SignalHandler:
shutdown_requested = False
def __init__(self):
signal.signal(signal.SIGINT, self.request_shutdown)
signal.signal(signal.SIGTERM, self.request_shutdown)
def request_shutdown(self, *args):
print('Request to shutdown received, stopping')
self.shutdown_requested = True
def can_run(self):
return not self.shutdown_requested
signal_handler = SignalHandler()
urls = [
'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
'https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
]
index = 0
while signal_handler.can_run():
url = urls[index % len(urls)]
index += 1
print('Scraping url', url)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
book_name = soup.select_one('.product_main').h1.text
rows = soup.select('.table.table-striped tr')
product_info = {row.th.text: row.td.text for row in rows}
data_folder = Path('./data')
data_folder.mkdir(parents=True, exist_ok=True)
json_file_name = re.sub('[\': ]', '-', book_name)
json_file_path = data_folder / f'{json_file_name}.json'
with open(json_file_path, 'w') as book_file:
json.dump(product_info, book_file)
The script will refresh JSON files with newly collected book information.
If you’re wondering how to run Python script in Linux, there are multiple ways to do it on startup. Many distributions have built-in GUI tools for such purposes.
Let’s use one of the most popular distributions, Linux Mint, as an example. It uses a desktop environment called Cinnamon that provides a startup application utility.
System settings
It allows you to add your script and specify a startup delay.
Adding a script
However, this approach doesn’t provide more control over the script. For example, what happens when you need to restart it?
This is where systemd comes in. Systemd is a service manager that allows you to manage user processes using easy-to-read configuration files.
To use systemd, let’s first create a file in the /etc/systemd/system directory:
cd /etc/systemd/system
touch book-scraper.service
Add the following content to the book-scraper.service file using your favorite editor:
[Unit]
Description=A script for scraping the book information
After=syslog.target network.target
[Service]
WorkingDirectory=/home/oxylabs/Scraper
ExecStart=/home/oxylabs/Scraper/venv/bin/python3 scrape.py
Restart=always
RestartSec=120
[Install]
WantedBy=multi-user.target
Here’s the basic rundown of the parameters used in the configuration file:
After – ensures you only start your Python script once the network is up.
RestartSec – sleep time before restarting the service.
Restart – describes what to do if a service exits, is killed, or a timeout is reached.
WorkingDirectory – current working directory of the script.
ExecStart – the command to execute.
Now, it’s time to tell systemd about the newly created daemon. Run the daemon-reload command:
systemctl daemon-reload
Then, start your service:
systemctl start book-scraper
And finally, check whether your service is running:
$ systemctl status book-scraper
book-scraper.service - A script for scraping the book information
Loaded: loaded (/etc/systemd/system/book-scraper.service; disabled; vendor preset: enabled)
Active: active (running) since Thu 2022-09-08 15:01:27 EEST; 16min ago
Main PID: 60803 (python3)
Tasks: 1 (limit: 18637)
Memory: 21.3M
CGroup: /system.slice/book-scraper.service
60803 /home/oxylabs/Scraper/venv/bin/python3 scrape.py
Sep 08 15:17:55 laptop python3[60803]: Scraping url https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html
Sep 08 15:17:55 laptop python3[60803]: Scraping url https://books.toscrape.com/catalogue/sharp-objects_997/index.html
Sep 08 15:17:55 laptop python3[60803]: Scraping url https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html
Protip: use journalctl -S today -u book-scraper.service to monitor your logs in real-time.
Congrats! Now you can control your service via systemd.
Running a Python script as a Windows service is not as straightforward as one might expect. Let’s start with the script changes.
To begin, change how the script is executed based on the number of arguments it receives from the command line.
If the script receives a single argument, assume that Windows Service Manager is attempting to start it. It means that you have to run an initialization code. If zero arguments are passed, print some helpful information by using win32serviceutil.HandleCommandLine:
if __name__ == '__main__':
if len(sys.argv) == 1:
servicemanager.Initialize()
servicemanager.PrepareToHostSingle(BookScraperService)
servicemanager.StartServiceCtrlDispatcher()
else:
win32serviceutil.HandleCommandLine(BookScraperService)
Next, extend the special utility class and set some properties. The service name, display name, and description will all be visible in the Windows services utility (services.msc) once your service is up and running.
class BookScraperService(win32serviceutil.ServiceFramework):
_svc_name_ = 'BookScraperService'
_svc_display_name_ = 'BookScraperService'
_svc_description_ = 'Constantly updates the info about books'
Finally, implement the SvcDoRun and SvcStop methods to start and stop the service. Here’s the script so far:
import sys
import servicemanager
import win32event
import win32service
import win32serviceutil
import json
import re
from pathlib import Path
import requests
from bs4 import BeautifulSoup
class BookScraperService(win32serviceutil.ServiceFramework):
_svc_name_ = 'BookScraperService'
_svc_display_name_ = 'BookScraperService'
_svc_description_ = 'Constantly updates the info about books'
def __init__(self, args):
win32serviceutil.ServiceFramework.__init__(self, args)
self.event = win32event.CreateEvent(None, 0, 0, None)
def GetAcceptedControls(self):
result = win32serviceutil.ServiceFramework.GetAcceptedControls(self)
result |= win32service.SERVICE_ACCEPT_PRESHUTDOWN
return result
def SvcDoRun(self):
urls = [
'https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
'https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
'https://books.toscrape.com/catalogue/sharp-objects_997/index.html',
]
index = 0
while True:
result = win32event.WaitForSingleObject(self.event, 5000)
if result == win32event.WAIT_OBJECT_0:
break
url = urls[index % len(urls)]
index += 1
print('Scraping url', url)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
book_name = soup.select_one('.product_main').h1.text
rows = soup.select('.table.table-striped tr')
product_info = {row.th.text: row.td.text for row in rows}
data_folder = Path('C:\\Users\\User\\Scraper\\dist\\scrape\\data')
data_folder.mkdir(parents=True, exist_ok=True)
json_file_name = re.sub('[\': ]', '-', book_name)
json_file_path = data_folder / f'{json_file_name}.json'
with open(json_file_path, 'w') as book_file:
json.dump(product_info, book_file)
def SvcStop(self):
self.ReportServiceStatus(win32service.SERVICE_STOP_PENDING)
win32event.SetEvent(self.event)
if __name__ == '__main__':
if len(sys.argv) == 1:
servicemanager.Initialize()
servicemanager.PrepareToHostSingle(BookScraperService)
servicemanager.StartServiceCtrlDispatcher()
else:
win32serviceutil.HandleCommandLine(BookScraperService)
Now that you have the script, open a Windows terminal of your preference.
Protip: if you’re using Powershell, make sure to include a .exe extension when running binaries to avoid unexpected errors.
Terminal
Once the terminal is open, change the directory to the location of your script with a virtual environment, for example:
cd C:\Users\User\Scraper
Next, install the experimental Python Windows extensions module, pypiwin32. You’ll also need to run the post-install script:
.\venv\Scripts\pip install pypiwin32
.\venv\Scripts\pywin32_postinstall.py -install
Unfortunately, if you attempt to install your Python script as a Windows service with the current setup, you’ll get the following error:
**** WARNING ****
The executable at "C:\Users\User\Scraper\venv\lib\site-packages\win32\PythonService.exe" is being used as a service.
This executable doesn't have pythonXX.dll and/or pywintypesXX.dll in the same
directory, and they can't be found in the System directory. This is likely to
fail when used in the context of a service.
The exact environment needed will depend on which user runs the service and
where Python is installed. If the service fails to run, this will be why.
NOTE: You should consider copying this executable to the directory where these
DLLs live - "C:\Users\User\Scraper\venv\lib\site-packages\win32" might be a good place.
However, if you follow the instructions of the error output, you’ll be met with a new issue when trying to launch your script:
Error starting service: The service did not respond to the start or control request in a timely fashion.
To solve this issue, you can add the Python libraries and interpreter to the Windows path. Alternatively, bundle your script and all its dependencies into an executable by using pyinstaller:
venv\Scripts\pyinstaller --hiddenimport win32timezone -F scrape.py
The --hiddenimport win32timezone option is critical as the win32timezone module is not explicitly imported but is still needed for the script to run.
Finally, let’s install the script as a service and run it by invoking the executable you’ve built previously:
PS C:\Users\User\Scraper> .\dist\scrape.exe install
Installing service BookScraper
Changing service configuration
Service updated
PS C:\Users\User\Scraper> .\dist\scrape.exe start
Starting service BookScraper
PS C:\Users\User\Scraper>
And that’s it. Now, you can open the Windows services utility and see your new service running.
Protip: you can read more about specific Windows API functions here.
The newly created service is running
As evident, you can use win32serviceutil to develop a Windows service. But the process is definitely not that simple – you could even say it sucks! Well, this is where the NSSM (Non-Sucking Service Manager) comes into play.
Let’s simplify the script by only keeping the code that performs web scraping:
import json
import re
from pathlib import Path
import requests
from bs4 import BeautifulSoup
urls = ['https://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html',
'https://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html',
'https://books.toscrape.com/catalogue/sharp-objects_997/index.html', ]
index = 0
while True:
url = urls[index % len(urls)]
index += 1
print('Scraping url', url)
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
book_name = soup.select_one('.product_main').h1.text
rows = soup.select('.table.table-striped tr')
product_info = {row.th.text: row.td.text for row in rows}
data_folder = Path('C:\\Users\\User\\Scraper\\data')
data_folder.mkdir(parents=True, exist_ok=True)
json_file_name = re.sub('[\': ]', '-', book_name)
json_file_path = data_folder / f'{json_file_name}.json'
with open(json_file_path, 'w') as book_file:
json.dump(product_info, book_file)
Next, build a binary using pyinstaller:
venv\Scripts\pyinstaller -F simple_scrape.py
Now that you have a binary, it’s time to install NSSM by visiting the official website. Extract it to a folder of your choice and add the folder to your PATH environment variable for convenience.
NSSM in a folder
Then, run the terminal as an admin.
Running as an admin
Once the terminal is open, change the directory to your script location:
cd C:\Users\User\Scraper
Finally, install the script using NSSM and start the service:
nssm.exe install SimpleScrape C:\Users\User\Scraper\dist\simple_scrape.exe
nssm.exe start SimpleScrape
Protip: if you have issues, redirect the standard error output of your service to a file to see what went wrong:
nssm set SimpleScrape AppStderr C:\Users\User\Scraper\service-error.log
NSSM ensures that a service is running in the background, and if it doesn’t, you at least get to know why.
Regardless of the operating system, you have various options for setting up Python scripts for recurring web scraping tasks. Whether you need the configurability of systemd, the flexibility of Windows services, or the simplicity of NSSM, be sure to follow this tried & true guide as you navigate their features.
If you are interested in more Python automation solutions for web scraping applications or web scraping with Python, take a look at our blog for various tutorials on all things web scraping. We also offer an advanced solution, Web Scraper API, designed to collect public data from most websites automatically and hassle-free. In addition, you can use a Scheduler feature to schedule multiple web scraping jobs at any frequency you like.
Systemd is a powerful tool for managing processes. It also provides logging and dependency management, for example, starting processes only when a network is available or once a previous service has been started.
Systemctl is a utility and is part of systemd. Among other things, it allows you to manage services and check their status.
You definitely can if it suits your needs. However, if you need more control over the service, systemd is the way to go. It allows you to start services based on certain conditions and is perfect for dealing with long-running scripts. It even allows you to run scripts in a similar way to crontab by using timers.
The word daemon comes from MIT’s Project MAC, which first coined the term in 1963. It is, by definition, an agent that works tirelessly in the background. Later Windows and Linux adopted the term as an alternative name. Therefore, this article uses the words daemon and service interchangeably.
About the author
Augustas Pelakauskas
Senior Copywriter
Augustas Pelakauskas is a Senior Copywriter at Oxylabs. Coming from an artistic background, he is deeply invested in various creative ventures - the most recent one being writing. After testing his abilities in the field of freelance journalism, he transitioned to tech content creation. When at ease, he enjoys sunny outdoors and active recreation. As it turns out, his bicycle is his fourth best friend.
All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.
Get the latest news from data gathering world
Forget about complex web scraping processes
Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.
Scale up your business with Oxylabs®
GET IN TOUCH
General:
hello@oxylabs.ioSupport:
support@oxylabs.ioCareer:
career@oxylabs.ioCertified data centers and upstream providers
Connect with us
Advanced proxy solutions
Resources
Innovation hub