Pandas, Python, and Popcorn - Analyzing My Netflix History

Sep 18, 2024

A person with curly hair sits on a couch, holding a remote control, watching a nature scene on a large wall-mounted TV. A fireplace is visible below the TV, and houseplants add to the cozy atmosphere. — Immersed in the world of streaming, a person enjoys content on their TV.

We’ve all been there—one minute, you’re just “checking out” a new show, and the next, you’re 3 seasons deep, wondering when you last blinked or saw sunlight. With that I thought it would be fun to create a new project and document it along the way in this article.

We can call it a projumentary! No, wait—docuproject? We can call it a projumentary!

It’s time to be vulnerable and share my Netflix history with the world. :)

Turns out I watch a lot of TV…

Infographic showing Netflix viewing statistics. It displays 4,739 total titles watched, Bones as the most-watched show with 165 episodes, Sunday as the biggest binge day with 826 episodes, a pie chart of most-watched genres dominated by Drama at 62.8%, and a list of top 5 most-watched shows including Bones, Supernatural, Gilmore Girls, Grey's Anatomy, and Parenthood. — Netflix viewing habits revealed, a breakdown of most-watched genres and shows.

The numbers are in, and yes, I probably watch too much TV.

My ideal evening involves spending quality time with the Lorelai and Rory in Gilmore Girls, followed by a dose of hospital drama in Grey’s Anatomy, and wrapping up with the familial bonds in Parenthood.

But enough about my TV habits—let's talk about how you can dive into your own Netflix watch history! In this next section I’m going to guide you through the steps I took to extract, clean, and analyze my Netflix viewing data so you can better understand how many hours you’ve actually spent watching Stranger Things, Welcome to Wrexham, or whatever favorite show has your attention right.

First things first!

How the heck do you even get your Netflix history?

The gateway to your Netflix history viewing data.

I had no idea.

But, after a quick Google search, I was locked and loaded with instructions:

Login to Netflix
Profile Settings
Viewing Activity
You’ll find your watch history and ratings
Scroll to the bottom, and there will be an option to download your viewing data

Saved you the Google search :)

Now that we have our history, let’s get started with the rest of the process.

It’s time to code!

The adorable dog above is totally unrelated to this section. I just really like that dog.

First, you need to set up your environment

A pair of black and white checkered slippers on a patterned rug in front of a bed with white bedding. The image suggests a cozy, home-like setting. — Cozy slippers for a comfortable start.

I kicked things off with Visual Studio Code, but feel free to court any IDE that catches your fancy. The first rule of Coding Club? There are no rules about your coding gear.

My favorite part - we get to start writing the code!

A cozy living room with a white couch, blue throw pillow, and gray blanket. A fireplace is visible in the background, creating a warm and inviting atmosphere for coding. — Find a cozy spot to start writing code.

Now meet the three musketeers: cli.py, file_handler.py, and analysis.py. These scripts are the backbone of our little operation, handling everything from command-line inputs to data crunching.

cli.py starts the party. It uses the click library to pass the CSV file to both the two supporting scripts: file_handler.py and analysis.py. Basically, you run it and get results directly in the terminal.

Running the script would look like this in the terminal.

python cli.py --file path/to/NetflixViewingHistory.csv

Here is the final version of cli.py that ties everything together.

import click
from file_handler import load_csv
from analysis import analyze_viewing_history

@click.command()
@click.option('--file', '-f', required=True, type=click.Path(), help='Specify the path to your Netflix viewing history CSV file.')
def main(file: str):
    try:
        data = load_csv(file)
        analyze_viewing_history(data)
        print("Results have been successfully analyzed.")
    except FileNotFoundError:
        print(f"Error: The file '{file}' was not found. Please check the file path and try again.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

if __name__ == '__main__':
    main()

Not fancy, but like a well-oiled door hinge, it’s functional.

Next, we need file_handler.py to break down the CSV file. I decided to keep it simple and use the pandas library. Pandas is a Python library that helps with data manipulation and the interpretation of raw data to make it useful. Thank you, Mr. Wes McKinney (the creator of pandas).

Here is the final version of file_handler.py.

import pandas as pd

def load_csv(filepath):
    """Loads Netflix history from a CSV file into a DataFrame."""
    try:
        return pd.read_csv(filepath)
    except FileNotFoundError:
        raise FileNotFoundError(f"File '{filepath}' not found.")
    except Exception as e:
        raise Exception(f"An error occurred: {e}")

The cli.py passed the file path to the file_handler.py, where load_csv() reads the file and returns the data as Pandas DataFrame, a data structure used for organizing and analyzing tabular data. The DataFrame is then passed to analysis.py.

Analysis.py is the script that handles cleaning, tagging, and getting insights from the Netflix data. It is a troublemaker. The testing and debug section will go into more detail for analysis.py.

Next, we have to install dependencies

Close-up of a succulent plant in a dark pot. The plant has spiky, green leaves with white spots, symbolizing the growth and development in coding projects. — Close-up of a vibrant succulent.

Before we dive into the code sea, let’s make sure we’ve got all our gear. This means installing all the required libraries to keep the boat afloat.

To avoid installing each dependency manually, I created a requirements.txt file by compiling a list of dependencies for each script as they were built.

pip install -r requirements.txt

Running this will install all the necessary dependencies automatically. Now we can start testing it out.

Building the tool was smooth... until it wasn’t, time to test and debug :(

A modern kitchen counter with a bowl of popcorn and a glass of orange juice. Lemons are visible on a wooden tray in the background. The scene is bathed in warm, sunset light from a nearby window. — Time to fuel up with some snacks.

Let's be real, bugs are part of the journey. Run your code, watch it fail (it probably will), and then whip out your detective hat because it’s time to debug. This phase can feel like solving a murder mystery, where the victim is your sanity, and the murderer is usually a missing semicolon or has a typo.

When it came to testing Analysis.py, I had a lot of issues. Here are some of the highlights.

Issue #1 - Squeaky clean titles

My first “oh fudge” moment hit when I analyzed the CSV. My initial regular expressions (regex) were too broad. It was cleaning the titles too well, leaving them generic or completely blank, which means most of them got tagged as "Unknown" when I tried matching them with my genre_dict in the analysis file. After some regex tweaking, poof—it’s fixed.

"""Remove generic episode and season labels."""
title_cleaned = re.sub(r'Season\s\d+:.*|:\sEpisode\s\d+', '', title).strip()
return title_cleaned

Now, instead of Bones: Episode 1, we just have good ol’ Bones.

Issue #2 - Date format shenanigans

Netflix gave me dates in the standard mm/dd/yyyy format, but I still had to tweak things to get useful results. I used explicit date parsing to clean things up.

df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%y')

Using Pandas, I converted the ‘Date’ column from strings to proper datetime format, avoiding the century cutoff drama. The century cutoff in Python refers to how the time.strptime() function interprets two-digit years using the %y format. Now, 2024 stays in 2024, instead of mysteriously jumping back to 1924. For more details, you can check out the Python docs here.

Issue #3 - Fuzzy wuzzy was a bear

At this point most of my data was still marked as “Unknown” for the genre despite cleaning up the titles. I brought in genre_dict, which helps map the shows to their genres.

genre_dict = {
    'Breaking Bad': 'Action/Thriller',
    'Naruto': 'Anime',
    'Too Hot to Handle': 'Reality',
    # More shows...
}

“A” for effort but a lot of the genres were still “Unknown” due to slight title variations. This is where Fuzzywuzzy (a nursery rhyme can solve our problems). Just kidding—the FuzzyWuzzy library the script uses is a string-matching library. It uses partial matches and returns a more organized result. For more info, you can check out the FuzzyWuzzy PyPI

from fuzzywuzzy import process

def match_genre(title, genre_dict):
    """Fuzzy match the genre."""
    best_match, score = process.extractOne(title, genre_dict.keys())
    return genre_dict[best_match] if score >= 65 else "Unknown"

df['Genre'] = df['Title'].apply(lambda title: match_genre(title, genre_dict))

Initially, many shows and movies were categorized as “Unknown” genres because the cutoff score was set too high. Dropping the cutoff score from 80 to 65, we achieve a better balance between flexibility and accuracy. This way, the script doesn't get stuck on little things like typos or extra words.

import pandas as pd
import re
from fuzzywuzzy import process

def analyze_viewing_history(df):
    """Analyzes Netflix viewing data and prints results."""
    
    # Clean up titles
    df['CleanedTitle'] = df['Title'].apply(clean_title)
    
    # Basic insights
    total_watched = len(df)
    most_watched_show = df['CleanedTitle'].replace('', pd.NA).dropna().value_counts().idxmax()
    
    print(f"Total shows/movies watched: {total_watched}")
    print(f"Most watched show/movie: {most_watched_show}")

To sum it up, Fuzzywuzzy made sure that even if the title wasn’t an exact match, we could still categorize it properly. Thanks, fuzzy bear. The final version of Analysis.py can be found on the project repo.

Refactoring to make the data a bit more useful

A bright, clean bathroom with white tile walls. A white bathrobe hangs on a hook, and there's a bathtub visible. The image conveys a sense of freshness and organization. — Sunlit bathroom with plush robe.

It’s pajama time!

I love getting into my PJs and bundling up with a blanket when catching up on the latest show. The Netflix history CVS tells us when and what I watched, but what day is my ultimate binge-fest day?

df['DayOfWeek'] = df['Date'].dt.day_name()
most_watched_day = df['DayOfWeek'].value_counts().idxmax()
day_breakdown = df['DayOfWeek'].value_counts()

print(f"\nMost binge-watched day of the week: {most_watched_day}")
print(day_breakdown)

This snippet takes the date from each entry, converts it to the day of the week, and counts how often I watched something on each day. Sunday is my top binge day.

How much time I’ve sacrificed to the great streaming overlord?

The CSV didn’t include the runtime for any of the shows or movies, so I made a rough estimate of 45 minutes per episode for TV shows. I know movies are usually longer, but for simplicity, I stuck with that estimate across the board.

avg_episode_duration = 45  # minutes
total_time_watched = total_watched * avg_episode_duration
print(f"\nTotal Time in Hours: {total_time_watched / 60:.2f} hours")

This snippet multiplies the total number of shows and movies I’ve watched by 45 minutes (a rough average for episodes). Then, it converts that into hours to show how much time I’ve spent in Netflix-land.

Brace yourself, this is how much life Netflix has consumed from my soul.

3,554.25 hours of pure Netflix juiciness.

No regrets. And before you freak out, no, that’s not in just one year!

What did we learn today in class?

Analyzing viewing patterns helps target content better or tweak marketing strategies based on what people actually watch.

For big companies like Netflix, this data is invaluable, it allows them to recommend content more accurately, retain users by keeping them engaged, and even decide which shows to invest in for future production. A real-world example of this is the live-action adaptation of the famous mangas “Cowboy Bebop” and “One Piece”.

Although "Cowboy Bebop" made it into Netflix's Top 10 and racked up almost 74 million viewing hours worldwide right after its release, the number of viewers plummeted by 59 percent in the very next week. The series got axed after just one season, ouch.

On the flip side, “One Piece” has been a true juggernaut, shattering records by becoming the No. 1 title globally on Netflix with over 37.8 million views in less than two weeks of its release. This live-action adaptation was viewed 71.6 million times and amassed a whopping 541.9 million viewing hours! With all that success, it's no wonder they've already started shooting the second season.

Case in point.

This project was a quick win and revealed some things about my viewing habits that I didn’t even know. Want to dig into your data? Here is the repo, tweak the code, and see what shocking stats you can uncover.

Syntax + Glitter

Discussion about this post