Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Scrape IMDb Data: Step-by-Step Guide

Enrika Pavlovskytė

2023-10-105 min read
Share

Want to check where you’ve seen that actor before? Or perhaps want to rate a movie you really enjoyed watching? We all know the first place you’ll look. Or at least the first place Google will take you. 

This blog post provides insights into IMDb, its relevance for scraping, and offers a guide on efficient movie data extraction with Oxylabs' IMDb Scraper.

Why scrape IMDb

As one of the best-known entertainment data repositories, IMDb contains tons of data on movies, TV shows, and even video games. Not only is there a lot of data, but it's also extremely varied. For example, you can explore movie descriptions, cast, ratings, trivia, related movies, awards, and more. In addition to that, you’ll find user-generated data, such as reviews.

This wealth of information can be applied for a number of purposes, ranging from market research and movie recommender systems, to strategic marketing initiatives. Furthermore, user reviews present a goldmine for sentiment analysis, which can deepen insights into movie audiences.

1. Setting up for scraping IMDb

As you’ll be writing a Python script, make sure you have Python 3.8 or newer installed on your machine. This guide is written for Python 3.8+, so having a compatible version is crucial.

Creating a virtual environment

A virtual environment is an isolated space where you can install libraries and dependencies without affecting your global Python setup. It's a good practice to create one for each project. Here's how to set it up on different operating systems:

python -m venv imdb_env #Windows
python3 -m venv imdb_env #Mac and Linux

Replace imdb_env with the name you'd like to give to your virtual environment.

Activating the virtual environment

Once the virtual environment is created, you'll need to activate it:

.\imdb_env\Scripts\Activate #Windows
source imdb_env/bin/activate #Mac and Linux

You should see the name of your virtual environment in the terminal, indicating that it's active.

Installing required libraries

We'll use the requests library for this project to make HTTP requests. Install it by running the following command:

$ pip install requests pandas

And there you have it! Your project environment is ready for IMDb data scraping. In the following sections, we'll delve deeper into the IMDb structure.

2. Overview of Web Scraper API

Oxylabs' Web Scraper API allows you to extract data from many complex websites easily. The following is a basic example that shows how Scraper API works.

# scraper_api_demo.py
import requests

USERNAME = "username"
PASSWORD = "password"

payload = {
    "source": "universal",
    "url": "https://www.imdb.com"
}

response = requests.post(
    url="https://realtime.oxylabs.io/v1/queries",
    json=payload,
    auth=(USERNAME,PASSWORD),
)

print(response.json())

After importing requests, you need to replace the credentials with your own, which you can get by registering for a Web Scraper API subscription or getting a free trial. The payload is where you inform the API what and how you want to scrape.

Save this code in a file scraper_api_demo.py and run it. You’ll see that the entire HTML of the page will be printed, along with some additional information from Scraper API.

In the following section, let's examine various parameters we can send in the payload. 

Scraper API parameters

The most critical parameter is source. For IMDb, set the source as universal, which is general-purpose and can handle most domains.

The parameter url is self-explanatory, a direct link to the IMDb URLs you want to scrape. In the code discussed in the previous section, there are only two parameters. As a result, you get the entire HTML of the page.

Instead, what you need is parsed data. This is where the parameter parse comes into the picture. When you send parse as True, you must also send one more parameter — parsing_instructions. Combined, these two allow you to get parsed data in a structure you prefer.

The following allows you to get a JSON of the page title:

"title": {
    "_fns": [
                {
                    "_fn": "xpath_one", 
                    "_args": ["//title/text()"]
                }
            ]
        }
},

If you send this as parsing_instructions, the output would be the following JSON:

{'title': 'IMDb Top 250 Movies'}

The key _fns indicates a list of functions, which can contain one or more functions indicated by the _fn key, along with the arguments.

In this example, the function is xpath_one, which takes an XPath and returns the first matching element. On the other hand, the function xpath returns all matching elements.

The functions css_one and css are similar but use CSS selectors instead of XPath. For a complete list of available functions, see the Scraper API documentation.

The following code prints the title of the IMDb page:

# imdb_title.py
import requests

USERNAME = "username"
PASSWORD = "password"

payload = {
    "source": "universal",
    "url": "https://www.imdb.com",
    "parse": True,
    "parsing_instructions": {
        "title": {
            "_fns": [
                        {
                            "_fn": "xpath_one",
                            "_args": [
                                "//title/text()"
                                ]
                        }
                    ]
                }
    },
}


response = requests.post(
    url="https://realtime.oxylabs.io/v1/queries",
    json=payload,
    auth=(USERNAME,PASSWORD),
)


print(response.json()['results'][0]['content'])

Run this file to get the title of the IMDb page. In the next section, you’ll scrape movie data from a list.

3. Scraping movie info from a list

Before scraping a page, we need to examine the page structure. Open the IMDb top 250 listing in Chrome, right-click the movie list, and select Inspect.

Move around your mouse until you can precisely select one movie list item and related data. 

Inspecting an element

You can use the following XPath to select one movie detail:

//li[contains(@class,'ipc-metadata-list-summary-item')]

Also, you can iterate over these 250 items and get movie titles, year, and ratings using the same selector. Let’s see how to do it.

First, create the placeholder for movies as follows:

"movies": {
    "_fns": [
        {
            "_fn": "xpath",
            "_args": [
                "//li[contains(@class,'ipc-metadata-list-summary-item')]"
            ]
        }
    ],

Note the use of the function xpath. It means that it will return all matching elements.

Next, we can use reserved property _items to indicate that we want to iterate over a list, further processing each list item separately.

It will allow us to use concatenating to the path already defined as follows:

import json

payload = {
    "source": "universal",
    "url": "https://www.imdb.com/chart/top/?ref_=nv_mv_250",
    "parse": True,
    "parsing_instructions": {
        "movies": {
            "_fns": [
                {
                    "_fn": "xpath",
                    "_args": [
                        "//li[contains(@class,'ipc-metadata-list-summary-item')]"
                    ]
                }
            ],
            "_items": {
                "movie_name": {
                    "_fns": [
                        {
                            "_fn": "xpath_one",
                            "_args": [
                                ".//h3/text()"
                            ]
                        }
                    ]
                },
                "year":{
                    "_fns": [
                        {
                            "_fn": "xpath_one",
                            "_args": [
                                ".//*[contains(@class,'cli-title-metadata-item')]/text()"
                            ]
                        }
                    ]
                },
                "rating": {
                    "_fns": [
                        {
                            "_fn": "xpath_one",
                            "_args": [
                                ".//*[contains(@aria-label,'IMDb rating')]/text()"
                            ]
                        }
                    ]
                }
            }
        }
    }
}

with open("top_250_payload.json", 'w') as f:
    json.dump(payload, f, indent=4)

Note the use of ./ in the XPath of movie_name and year. A good way to organize your code is to save the payload as a separator JSON file. It will allow you to keep your Python file short:

# parse_top_250.py
import requests
import json

USERNAME = "username"
PASSWORD = "password"

payload = {}
with open("top_250_payload.json") as f:
    payload = json.load(f)


response = requests.post(
    url="https://realtime.oxylabs.io/v1/queries",
    json=payload,
    auth=(USERNAME, PASSWORD),
)


print(response.status_code)


with open("result.json", "w") as f:
    json.dump(response.json(),f, indent=4)

Code and output

That’s how it’s done! In the next section, you’ll explore how to scrape movie reviews from IMDb.

4. Scraping movie reviews

Let's scrape movie reviews of Shawshank Redemption. You’ll use CSS selectors instead of XPath this time, but the basic idea remains the same. You’ll use the css function to create a reviews node and then use the _items to extract information about the review.

First, take a look at the selectors:

The container for each review can be selected using .imdb-user-review. After that, we can use the following selectors to get various metadata:

  • .title for selecting the review title

  • .display-name-link a for reviewer name

  • .review-date for the review date

  • .content>.show-more__control for the review body

CSS selector, unlike XPath, cannot directly match the text in an element. This is where one more function from Scraper API becomes useful — element_text.

The element_text function extracts the text in the element. Scraper API allows us to chain as many functions as needed. It means we can chain css_one and element_text functions to select the data we need. 

"reviews": {
    "_fns": [
        {
            "_fn": "css",
            "_args": [
                ".imdb-user-review"
            ]
        }
    ],
    "_items": {
        "review_title": {
            "_fns": [
                {
                    "_fn": "css_one",
                    "_args": [
                        ".title"
                    ]
                },
                {
                    "_fn": "element_text"
                }
            ]
        },
}

Similarly, you can extract other data points. That's how the code should look so far:

{
    "source": "universal",
    "url": "https://www.imdb.com/title/tt0111161/reviews?ref_=tt_urv",
    "parse": true,
    "parsing_instructions": {
        "movie_name": {
            "_fns": [
                {
                    "_fn": "css_one",
                    "_args": [
                        ".parent a"
                    ]
                },
                {
                    "_fn": "element_text"
                }
            ]
        },
        "reviews": {
            "_fns": [
                {
                    "_fn": "css",
                    "_args": [
                        ".imdb-user-review"
                    ]
                }
            ],
            "_items": {
                "review_title": {
                    "_fns": [
                        {
                            "_fn": "css_one",
                            "_args": [
                                ".title"
                            ]
                        },
                        {
                            "_fn": "element_text"
                        }
                    ]
                },
                "review-body": {
                    "_fns": [
                        {
                            "_fn": "css_one",
                            "_args": [
                                ".content>.show-more__control"
                            ]
                        },
                        {
                            "_fn": "element_text"
                        }
                    ]
                },
                "rating": {
                    "_fns": [
                        {
                            "_fn": "css_one",
                            "_args": [
                                ".rating-other-user-rating"
                            ]
                        },
                        {
                            "_fn": "element_text"
                        }
                    ]
                },
                "name": {
                    "_fns": [
                        {
                            "_fn": "css_one",
                            "_args": [
                                ".display-name-link a"
                            ]
                        },
                        {
                            "_fn": "element_text"
                        }
                    ]
                },
                "review_date": {
                    "_fns": [
                        {
                            "_fn": "css_one",
                            "_args": [
                                ".review-date"
                            ]
                        },
                        {
                            "_fn": "element_text"
                        }
                    ]
                }
            }
        }
    }
}

Once your payload file is ready, you can use the same Python code file shown in the previous section, point to this payload, and run the code to get the results.

Comparing payload and results

5. Exporting to JSON and CSV

The output of Scraper API is a JSON and you can save the extracted data as JSON directly. If you want a CSV file, you can use a library such as Pandas. Remember that the parsed data is stored in the content inside results.

As we created the review in the key review, we can use the following snippet to save the extracted data:

# parse_reviews.py
# save results into a variable data
# save the data as a json file
with open("results_reviews.json", "w") as f:
json.dump(data, f, indent=4)
# save the reviews in a CSV file
df = pd.DataFrame(data['results'][0]['content']['reviews'])
df.to_csv('reviews.csv', index=False)

Conclusion

Web Scraper API simplifies web scraping by taking care of the most common data gathering challenges. You can use any language you like, and all you need to do is send the correct payload.

You might also be interested in reading up about scraping other targets such as YouTube, Google News, or Netflix.

Frequently asked questions

Does IMDb allow scraping?

While web scraping publicly available data from IMDb is considered to be legal, it highly depends on such factors as the target, local legislation, and how the data is going to be used. We highly recommend that you seek professional legal advice before starting any operations.

To learn more about the legality of web scraping, check here.

How do I scrape IMDb data?

There are a couple of ways that you can scrape movie data. You can either build a custom scraper or buy a commercial one. While a custom one will be more flexible, you’ll have to dedicate lots of resources to bypassing anti-bot systems and parsing the data. On the other hand, a commercial solution takes care of these aspects for you.

How do I scrape a movie review on IMDb?

To scrape a movie review on IMDb, you'll need to use a programming language like Python along with libraries like requests and Beautiful Soup. Alternatively, you can use Python alongside a Web Scraper API.

About the author

Enrika Pavlovskytė

Copywriter

Enrika Pavlovskytė is a Copywriter at Oxylabs. With a background in digital heritage research, she became increasingly fascinated with innovative technologies and started transitioning into the tech world. On her days off, you might find her camping in the wilderness and, perhaps, trying to befriend a fox! Even so, she would never pass up a chance to binge-watch old horror movies on the couch.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • Why scrape IMDb

  • 1. Setting up for scraping IMDb

  • 2. Overview of Web Scraper API

  • 3. Scraping movie info from a list

  • 4. Scraping movie reviews

  • 5. Exporting to JSON and CSV

  • Conclusion

Try IMDb Scraper API

Choose Oxylabs' IMDb Scraper API to gather movie data hassle-free.

Scale up your business with Oxylabs®