Pandas, Python, and Popcorn - Analyzing My Netflix History
We’ve all been there—one minute, you’re just “checking out” a new show, and the next, you’re 3 seasons deep, wondering when you last blinked or saw sunlight. With that I thought it would be fun to create a new project and document it along the way in this article.
We can call it a projumentary! No, wait—docuproject? We can call it a projumentary!
It’s time to be vulnerable and share my Netflix history with the world. :)
Turns out I watch a lot of TV…
The numbers are in, and yes, I probably watch too much TV.
My ideal evening involves spending quality time with the Lorelai and Rory in Gilmore Girls, followed by a dose of hospital drama in Grey’s Anatomy, and wrapping up with the familial bonds in Parenthood.
But enough about my TV habits—let's talk about how you can dive into your own Netflix watch history! In this next section I’m going to guide you through the steps I took to extract, clean, and analyze my Netflix viewing data so you can better understand how many hours you’ve actually spent watching Stranger Things, Welcome to Wrexham, or whatever favorite show has your attention right.
First things first!
How the heck do you even get your Netflix history?
I had no idea.
But, after a quick Google search, I was locked and loaded with instructions:
Login to Netflix
Profile Settings
Viewing Activity
You’ll find your watch history and ratings
Scroll to the bottom, and there will be an option to download your viewing data
Saved you the Google search :)
Now that we have our history, let’s get started with the rest of the process.
It’s time to code!
First, you need to set up your environment
I kicked things off with Visual Studio Code, but feel free to court any IDE that catches your fancy. The first rule of Coding Club? There are no rules about your coding gear.
My favorite part - we get to start writing the code!
Now meet the three musketeers: cli.py
, file_handler.py
, and analysis.py
. These scripts are the backbone of our little operation, handling everything from command-line inputs to data crunching.
cli.py starts the party. It uses the click
library to pass the CSV file to both the two supporting scripts: file_handler.py
and analysis.py
. Basically, you run it and get results directly in the terminal.
Running the script would look like this in the terminal.
python cli.py --file path/to/NetflixViewingHistory.csv
Here is the final version of cli.py that ties everything together.
import click
from file_handler import load_csv
from analysis import analyze_viewing_history
@click.command()
@click.option('--file', '-f', required=True, type=click.Path(), help='Specify the path to your Netflix viewing history CSV file.')
def main(file: str):
try:
data = load_csv(file)
analyze_viewing_history(data)
print("Results have been successfully analyzed.")
except FileNotFoundError:
print(f"Error: The file '{file}' was not found. Please check the file path and try again.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
if __name__ == '__main__':
main()
Not fancy, but like a well-oiled door hinge, it’s functional.
Next, we need file_handler.py to break down the CSV file. I decided to keep it simple and use the pandas
library. Pandas is a Python library that helps with data manipulation and the interpretation of raw data to make it useful. Thank you, Mr. Wes McKinney (the creator of pandas
).
Here is the final version of file_handler.py.
import pandas as pd
def load_csv(filepath):
"""Loads Netflix history from a CSV file into a DataFrame."""
try:
return pd.read_csv(filepath)
except FileNotFoundError:
raise FileNotFoundError(f"File '{filepath}' not found.")
except Exception as e:
raise Exception(f"An error occurred: {e}")
The cli.py passed the file path to the file_handler.py, where load_csv() reads the file and returns the data as Pandas DataFrame, a data structure used for organizing and analyzing tabular data. The DataFrame is then passed to analysis.py.
Analysis.py
is the script that handles cleaning, tagging, and getting insights from the Netflix data. It is a troublemaker. The testing and debug section will go into more detail for analysis.py.
Next, we have to install dependencies
Before we dive into the code sea, let’s make sure we’ve got all our gear. This means installing all the required libraries to keep the boat afloat.
To avoid installing each dependency manually, I created a requirements.txt
file by compiling a list of dependencies for each script as they were built.
pip install -r requirements.txt
Running this will install all the necessary dependencies automatically. Now we can start testing it out.
Building the tool was smooth... until it wasn’t, time to test and debug :(
Let's be real, bugs are part of the journey. Run your code, watch it fail (it probably will), and then whip out your detective hat because it’s time to debug. This phase can feel like solving a murder mystery, where the victim is your sanity, and the murderer is usually a missing semicolon or has a typo.
When it came to testing Analysis.py, I had a lot of issues. Here are some of the highlights.
Issue #1 - Squeaky clean titles
My first “oh fudge” moment hit when I analyzed the CSV. My initial regular expressions (regex) were too broad. It was cleaning the titles too well, leaving them generic or completely blank, which means most of them got tagged as "Unknown" when I tried matching them with my genre_dict
in the analysis file. After some regex tweaking, poof—it’s fixed.
"""Remove generic episode and season labels."""
title_cleaned = re.sub(r'Season\s\d+:.*|:\sEpisode\s\d+', '', title).strip()
return title_cleaned
Now, instead of Bones: Episode 1, we just have good ol’ Bones.
Issue #2 - Date format shenanigans
Netflix gave me dates in the standard mm/dd/yyyy format, but I still had to tweak things to get useful results. I used explicit date parsing to clean things up.
df['Date'] = pd.to_datetime(df['Date'], format='%m/%d/%y')
Using Pandas, I converted the ‘Date’ column from strings to proper datetime format, avoiding the century cutoff drama. The century cutoff in Python refers to how the time.strptime()
function interprets two-digit years using the %y
format. Now, 2024 stays in 2024, instead of mysteriously jumping back to 1924. For more details, you can check out the Python docs here.
Issue #3 - Fuzzy wuzzy was a bear
At this point most of my data was still marked as “Unknown” for the genre despite cleaning up the titles. I brought in genre_dict
, which helps map the shows to their genres.
genre_dict = {
'Breaking Bad': 'Action/Thriller',
'Naruto': 'Anime',
'Too Hot to Handle': 'Reality',
# More shows...
}
“A” for effort but a lot of the genres were still “Unknown” due to slight title variations. This is where Fuzzywuzzy (a nursery rhyme can solve our problems). Just kidding—the FuzzyWuzzy library the script uses is a string-matching library. It uses partial matches and returns a more organized result. For more info, you can check out the FuzzyWuzzy PyPI
from fuzzywuzzy import process
def match_genre(title, genre_dict):
"""Fuzzy match the genre."""
best_match, score = process.extractOne(title, genre_dict.keys())
return genre_dict[best_match] if score >= 65 else "Unknown"
df['Genre'] = df['Title'].apply(lambda title: match_genre(title, genre_dict))
Initially, many shows and movies were categorized as “Unknown” genres because the cutoff score was set too high. Dropping the cutoff score from 80 to 65, we achieve a better balance between flexibility and accuracy. This way, the script doesn't get stuck on little things like typos or extra words.
import pandas as pd
import re
from fuzzywuzzy import process
def analyze_viewing_history(df):
"""Analyzes Netflix viewing data and prints results."""
# Clean up titles
df['CleanedTitle'] = df['Title'].apply(clean_title)
# Basic insights
total_watched = len(df)
most_watched_show = df['CleanedTitle'].replace('', pd.NA).dropna().value_counts().idxmax()
print(f"Total shows/movies watched: {total_watched}")
print(f"Most watched show/movie: {most_watched_show}")
To sum it up, Fuzzywuzzy made sure that even if the title wasn’t an exact match, we could still categorize it properly. Thanks, fuzzy bear. The final version of Analysis.py
can be found on the project repo.
Refactoring to make the data a bit more useful
It’s pajama time!
I love getting into my PJs and bundling up with a blanket when catching up on the latest show. The Netflix history CVS tells us when and what I watched, but what day is my ultimate binge-fest day?
df['DayOfWeek'] = df['Date'].dt.day_name()
most_watched_day = df['DayOfWeek'].value_counts().idxmax()
day_breakdown = df['DayOfWeek'].value_counts()
print(f"\nMost binge-watched day of the week: {most_watched_day}")
print(day_breakdown)
This snippet takes the date from each entry, converts it to the day of the week, and counts how often I watched something on each day. Sunday is my top binge day.
How much time I’ve sacrificed to the great streaming overlord?
The CSV didn’t include the runtime for any of the shows or movies, so I made a rough estimate of 45 minutes per episode for TV shows. I know movies are usually longer, but for simplicity, I stuck with that estimate across the board.
avg_episode_duration = 45 # minutes
total_time_watched = total_watched * avg_episode_duration
print(f"\nTotal Time in Hours: {total_time_watched / 60:.2f} hours")
This snippet multiplies the total number of shows and movies I’ve watched by 45 minutes (a rough average for episodes). Then, it converts that into hours to show how much time I’ve spent in Netflix-land.
Brace yourself, this is how much life Netflix has consumed from my soul.
3,554.25 hours of pure Netflix juiciness.
No regrets. And before you freak out, no, that’s not in just one year!
What did we learn today in class?
Analyzing viewing patterns helps target content better or tweak marketing strategies based on what people actually watch.
For big companies like Netflix, this data is invaluable, it allows them to recommend content more accurately, retain users by keeping them engaged, and even decide which shows to invest in for future production. A real-world example of this is the live-action adaptation of the famous mangas “Cowboy Bebop” and “One Piece”.
Although "Cowboy Bebop" made it into Netflix's Top 10 and racked up almost 74 million viewing hours worldwide right after its release, the number of viewers plummeted by 59 percent in the very next week. The series got axed after just one season, ouch.
On the flip side, “One Piece” has been a true juggernaut, shattering records by becoming the No. 1 title globally on Netflix with over 37.8 million views in less than two weeks of its release. This live-action adaptation was viewed 71.6 million times and amassed a whopping 541.9 million viewing hours! With all that success, it's no wonder they've already started shooting the second season.
Case in point.
This project was a quick win and revealed some things about my viewing habits that I didn’t even know. Want to dig into your data? Here is the repo, tweak the code, and see what shocking stats you can uncover.