Netflix Content Strategy Analysis

data analysis

eng

case study

insights on audience engagement

Author

Ana Luisa Bodevan

Published

July 16, 2025

From Statso:

The goal is to analyze Netflix’s content strategy to understand how various factors like content type, language, release season, and timing affect viewership patterns. By identifying the best-performing content and the timing of its release, the aim is to uncover insights into how Netflix maximizes audience engagement throughout the year.

Objectives

The primary goal is to analyze Netflix’s content strategy across multiple dimensions:

Content Performance: Compare viewership patterns between movies and TV shows
Language Impact: Understand how language affects global viewership
Temporal Analysis: Identify optimal release timing (seasons, months, days)
Strategic Insights: Uncover patterns that maximize audience engagement

Methodology

Packages and Setup

Code

library(pacman)
pacman :: p_load(tidyverse, dplyr, lubridate, plotly, cluster, 
                 corrplot, randomForest, forecast, scales, ggtext, 
                 showtext, kableExtra)

font_add_google("Lato", "lato", regular.wt = 400, bold.wt = 700)
showtext_auto()
showtext_opts(dpi = 300)

theme_set(
  theme_minimal() +
    theme(
      plot.title = element_text(family = 'lato', face = 'bold', color = '#c1071e', size = 14),
      plot.subtitle = element_text(family = 'lato', face = 'bold', color = '#131834', size = 10),
      plot.caption = element_text(family = 'lato', color = '#43465e', size = 8),
      
      panel.background = element_rect(color = '#dedede'),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_blank(),
      
      axis.title.x = element_markdown(family = "lato", hjust = .5, size = 8, color = "grey40"),
    axis.title.y = element_markdown(family = "lato", hjust = .5, size = 8, color = "grey40"),
    axis.text = element_text(family = "lato", hjust = .5, size = 8, color = "grey40")
    )
)

Data Loading and Initial Assessment

Code

netflix_data <- read.csv("netflix_content.csv")

head(netflix_data)

                                     Title Available.Globally. Release.Date
1                The Night Agent: Season 1                 Yes   2023-03-23
2                Ginny & Georgia: Season 2                 Yes   2023-01-05
3 The Glory: Season 1 // 더 글로리: 시즌 1                 Yes   2022-12-30
4                      Wednesday: Season 1                 Yes   2022-11-23
5      Queen Charlotte: A Bridgerton Story                 Yes   2023-05-04
6                            You: Season 4                 Yes   2023-02-09
  Hours.Viewed Language.Indicator Content.Type
1 81,21,00,000            English         Show
2 66,51,00,000            English         Show
3 62,28,00,000             Korean         Show
4 50,77,00,000            English         Show
5 50,30,00,000            English        Movie
6 44,06,00,000            English         Show

Data Processing Pipeline

Upon initial assessment, there are some data cleaning steps necessary.

Code

netflix_data$Hours.Viewed <- as.numeric(gsub(",", "", netflix_data$Hours.Viewed)) # convert to numeric 

netflix_data$Release.Date <- as.Date(netflix_data$Release.Date, format = "%Y-%m-%d") # convert date format 

netflix_data$Available.Globally. <- netflix_data$Available.Globally. == "Yes" # convert to boolean 

netflix_data$Language.Indicator <- as.factor(netflix_data$Language.Indicator)
netflix_data$Content.Type <- as.factor(netflix_data$Content.Type) # convert to factor 

# Extract release month, quarter, day of week
netflix_data$Release.Month <- month(netflix_data$Release.Date, label = TRUE)
netflix_data$Release.Quarter <- quarter(netflix_data$Release.Date)
netflix_data$Release.Day <- wday(netflix_data$Release.Date, label = TRUE)

netflix_data <- netflix_data %>%
  mutate(
    Release.Month = case_when(
      Release.Month == "jan" ~ "January",
      Release.Month == "fev" ~ "February", 
      Release.Month == "mar" ~ "March",
      Release.Month == "abr" ~ "April",
      Release.Month == "mai" ~ "May",
      Release.Month == "jun" ~ "June",
      Release.Month == "jul" ~ "July",
      Release.Month == "ago" ~ "August",
      Release.Month == "set" ~ "September",
      Release.Month == "out" ~ "October",
      Release.Month == "nov" ~ "November",
      Release.Month == "dez" ~ "December",
      TRUE ~ Release.Month
    )
  )

netflix_data <- netflix_data %>%
 mutate(
   Release.Day = case_when(
     Release.Day == "seg" ~ "Monday",
     Release.Day == "ter" ~ "Tuesday",
     Release.Day == "qua" ~ "Wednesday",
     Release.Day == "qui" ~ "Thursday",
     Release.Day == "sex" ~ "Friday",
     Release.Day == "sáb" ~ "Saturday",
     Release.Day == "dom" ~ "Sunday",
     TRUE ~ Release.Day
   )
 )

Explanatory Data Analysis

Top releases in 2023

Code

netflix_clean <- netflix_data %>% 
  na.omit()

top_titles <- netflix_data %>%
  filter(!is.na(Hours.Viewed)) %>%  
  arrange(desc(Hours.Viewed)) %>%   
  head(10)                          

top_titles %>%
  select(Title, Content.Type, Language.Indicator, Hours.Viewed) %>%
  mutate(
    Hours_Formatted = paste0(round(Hours.Viewed / 1e9, 1), "B")
  ) %>%
  select(Title, Content.Type, Language.Indicator, Hours_Formatted) %>% 
  kbl(
    caption = "Top 10 Netflix Titles by Hours Viewed",
    col.names = c("Title", "Content Type", "Language", "Hours (Formatted)"),
    align = c("l", "l", "l", "r") 
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  column_spec(1, bold = TRUE, border_right = TRUE)

Top 10 Netflix Titles by Hours Viewed
Title	Content Type	Language	Hours (Formatted)
The Night Agent: Season 1	Show	English	0.8B
Ginny & Georgia: Season 2	Show	English	0.7B
King the Land: Limited Series // 킹더랜드: 리미티드 시리즈	Movie	Korean	0.6B
The Glory: Season 1 // 더 글로리: 시즌 1	Show	Korean	0.6B
ONE PIECE: Season 1	Show	English	0.5B
Wednesday: Season 1	Show	English	0.5B
Queen Charlotte: A Bridgerton Story	Movie	English	0.5B
You: Season 4	Show	English	0.4B
La Reina del Sur: Season 3	Show	English	0.4B
Outer Banks: Season 3	Show	English	0.4B

There are some interesting things to notice, namely, the p#c1071eominance of TV shows spoken in English, with the exception of two very high placed Korean pieces. Next, I will take a look into the language distribution of Netflix releases.

Language Analysis

Code

language_summary <- netflix_clean %>%
  group_by(Language.Indicator) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  arrange(desc(Count)) %>%  
  mutate(
    Language_Label = case_when(
      Language.Indicator == "English" ~ "English",
      Language.Indicator == "Non-English" ~ "Non-English",
      TRUE ~ Language.Indicator
    ),
    # Calculate percentage of total releases
    Percentage = round(Count / sum(Count) * 100, 1)
  )

# Graph showing number of releases by language
ggplot(language_summary, aes(x = reorder(Language_Label, Count), y = Count)) +
  geom_col(fill = "#c1071e", alpha = 0.8, width = 0.6) +
  coord_flip() +
  scale_y_continuous(labels = comma_format(), 
                     expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Netflix Content Releases by Language",
    subtitle = "Number of titles released",
    x = NULL,
    y = "Number of Releases",
  ) +
  theme()

So, we can see that English-speaking releases dominate the market. Of non-English languages, Japanese and Korean releases are the most common. The success of Korean releases (being two out of the five most watched pieces) shows a strong trend and preference for Korean made movies and TV shows.

Content Type Analysis

Statistical Summary

Code

content_summary <- netflix_data %>%
  group_by(Content.Type) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  mutate(
    Count_Formatted = format(Count, big.mark = ","),
    
    Total_Hours_Formatted = paste0(round(Total_Hours / 1e9, 0), "B"),
    
    Avg_Hours_Formatted = paste0(round(Avg_Hours / 1e6, 1), "M"),
    
    Median_Hours_Formatted = paste0(round(Median_Hours / 1e6, 1), "M")
  ) %>%
  select(Content.Type, Count_Formatted, Total_Hours_Formatted, 
         Avg_Hours_Formatted, Median_Hours_Formatted)

content_summary %>%
  kbl(
    caption = "Netflix Content Performance by Type",
    col.names = c("Content Type", "Count", "Total Hours", "Average Hours", "Median Hours"),
    align = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  row_spec(1, background = "#f8f9fa") %>%
  row_spec(2, background = "#e9ecef") %>%
  column_spec(1, bold = TRUE, border_right = TRUE, width = "2cm") %>%
  column_spec(2:5, width = "2.5cm")

Netflix Content Performance by Type
Content Type	Count	Total Hours	Average Hours	Median Hours
Movie	14,104	51B	3.6M	0.5M
Show	10,708	108B	10.1M	2.5M

Code

ggplot(netflix_data, aes(x = Content.Type, y = Hours.Viewed, fill = Content.Type)) +
  geom_col(alpha = 0.9) +
  scale_fill_manual(values = c("#c1071e", "#131834")) +
  scale_y_continuous(labels = label_number(scale = 1e-9)) +
  labs(
    title = "Viewership Performance by Content Type",
    subtitle = "Movies vs TV Shows viewership comparison",
    x = "Content Type",
    y = "Hours Viewed (Billions)"
  ) +
  theme(legend.position = "none")

Thus:

Movies had more titles produced (14 104), but lower individual performance
TV shows had less titles produced (10 708), but very high individual performance
- Average viewership: Shows get ~2.8x more views than movies (100.6M vs 35.9M hours)
- Median viewership: Shows get ~2.5x more views than movies (87.2M vs 35.5M hours)
- Total viewership: Despite fewer titles, shows generate ~1.5x more total hours (708B vs 450B)

So, I infer the following strategic implications:

Content Investment Strategy: TV shows demonstrate superior ROI in terms of viewer engagement per title

Portfolio Balance: While movies provide content volume, TV shows drive sustained engagement

Resource Allocation: The data suggests prioritizing TV show development for maximum viewership impact

Temporal Analysis

Utilizing the new variables created in the data processing step, I will visualize how the amount of hours watched distribute in time to infer whether or not there is a relationship between the month, quarter or day of the week a new product is released and the engagement it receives.

Monthly Releases Analysis

Code

monthly_summary <- netflix_clean %>%
  group_by(Release.Month) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  arrange(desc(Avg_Hours)) 

monthly_summary %>%
  mutate(
    Count_Formatted = format(Count, big.mark = ","),
    Total_Hours_Formatted = paste0(round(Total_Hours / 1e9, 1), "B"),
    Avg_Hours_Formatted = paste0(round(Avg_Hours / 1e6, 1), "M"),
    Median_Hours_Formatted = paste0(round(Median_Hours / 1e6, 1), "M")
  ) %>%
  select(Release.Month, Count_Formatted, Total_Hours_Formatted, 
         Avg_Hours_Formatted, Median_Hours_Formatted) %>%
  kbl(
    caption = "Netflix Content Performance by Release Month",
    col.names = c("Release Month", "Count", "Total Hours", "Average Hours", "Median Hours"),
    align = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  column_spec(1, bold = TRUE, border_right = TRUE)

Netflix Content Performance by Release Month
Release Month	Count	Total Hours	Average Hours	Median Hours
December	787	10.1B	12.8M	2.4M
June	670	8.5B	12.7M	3.1M
February	560	7.1B	12.7M	3.5M
January	608	7.3B	12M	3.5M
May	624	7.1B	11.4M	3.4M
March	690	7.4B	10.8M	3M
April	647	6.9B	10.6M	2.9M
November	734	7.7B	10.6M	2.5M
July	631	6.5B	10.3M	3M
October	802	8.1B	10.1M	2.9M
August	674	6.8B	10.1M	2.8M
September	739	7.3B	9.8M	3.1M

Code

monthly_summary$Release.Month <- factor(monthly_summary$Release.Month, 
                                       levels = month.name)

monthly_summary %>%
  arrange(match(Release.Month, month.name)) %>%
  ggplot(aes(y = Total_Hours, x = Release.Month)) +
  geom_line(group = 1, alpha = 0.9, linewidth = 1.2) + 
  geom_point(size = 2) +
  scale_y_continuous(labels = label_number(scale = 1e-9)) + 
  labs(
    title = "Total Hours Viewed",
    subtitle = "by Month, 2023",
    x = NULL,
    y = "Hours Viewed (Billions)"
  ) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Quarterly Releases Analysis

Code

quarterly_summary <- netflix_clean %>%
  group_by(Release.Quarter) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  mutate(
    Quarter_Label = case_when(
      Release.Quarter == 1 ~ "Q1 (Jan-Mar)",
      Release.Quarter == 2 ~ "Q2 (Apr-Jun)",
      Release.Quarter == 3 ~ "Q3 (Jul-Sep)",
      Release.Quarter == 4 ~ "Q4 (Oct-Dec)"
    )
  )

quarterly_summary %>%
  mutate(
    Count_Formatted = format(Count, big.mark = ","),
    Total_Hours_Formatted = paste0(round(Total_Hours / 1e9, 1), "B"),
    Avg_Hours_Formatted = paste0(round(Avg_Hours / 1e6, 1), "M"),
    Median_Hours_Formatted = paste0(round(Median_Hours / 1e6, 1), "M")
  ) %>%
  select(Quarter_Label, Count_Formatted, Total_Hours_Formatted, 
         Avg_Hours_Formatted, Median_Hours_Formatted) %>%
  kbl(
    caption = "Netflix Content Performance by Quarter",
    col.names = c("Quarter", "Count", "Total Hours", "Average Hours", "Median Hours"),
    align = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  column_spec(1, bold = TRUE, border_right = TRUE)

Netflix Content Performance by Quarter
Quarter	Count	Total Hours	Average Hours	Median Hours
Q1 (Jan-Mar)	1,858	21.8B	11.7M	3.2M
Q2 (Apr-Jun)	1,941	22.5B	11.6M	3.2M
Q3 (Jul-Sep)	2,044	20.6B	10.1M	3M
Q4 (Oct-Dec)	2,323	25.9B	11.2M	2.6M

Day of the Week Releases Analysis

Code

daily_summary <- netflix_clean %>%
  group_by(Release.Day) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  arrange(desc(Avg_Hours))

daily_summary %>%
  mutate(
    Count_Formatted = format(Count, big.mark = ","),
    Total_Hours_Formatted = paste0(round(Total_Hours / 1e9, 1), "B"),
    Avg_Hours_Formatted = paste0(round(Avg_Hours / 1e6, 1), "M"),
    Median_Hours_Formatted = paste0(round(Median_Hours / 1e6, 1), "M")
  ) %>%
  select(Release.Day, Count_Formatted, Total_Hours_Formatted, 
         Avg_Hours_Formatted, Median_Hours_Formatted) %>%
  kbl(
    caption = "Netflix Content Daily Performance",
    col.names = c("Quarter", "Count", "Total Hours", "Average Hours", "Median Hours"),
    align = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  column_spec(1, bold = TRUE, border_right = TRUE)

Netflix Content Daily Performance
Quarter	Count	Total Hours	Average Hours	Median Hours
Saturday	238	5.1B	21.5M	4.4M
Thursday	1,145	20.3B	17.7M	3.4M
Wednesday	1,310	15.7B	12M	3.5M
Sunday	179	1.9B	10.8M	2.1M
Friday	3,863	38.2B	9.9M	3.4M
Monday	436	4B	9.1M	3.1M
Tuesday	995	5.6B	5.6M	1.2M

Thus,

Quarter 4 (Oct-Dec) emerges as the optimal release window, generating the highest total viewership

Quarter 1 (Jan-Mar) shows strong performance, likely capitalizing still capitalizing on holidays

Quarters 2 and 3 (Apr-Sep) demonstrate relatively lower engagement, suggesting seasonal viewing patterns where audiences may be more occupied with summer (northern hemisphere) and mid-year vacations (southern hemisphere)

Strategic Timing: The data reveals a clear seasonal strategy where Netflix concentrates high-impact releases during periods of maximum audience availability, particularly leveraging the winter months for premium content launches

Content vs. Temporal Analysis

Code

monthly_type <- netflix_clean %>%
  group_by(Release.Month, Content.Type) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = "drop"
  )

monthly_type$Release.Month <- factor(
  monthly_type$Release.Month, 
  levels = month.name
)

ggplot(monthly_type, aes(x = Release.Month, y = Avg_Hours, fill = Content.Type)) +
  geom_col(position = "dodge", alpha = 0.9) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  scale_fill_manual(values = c("#c1071e", "#131834")) +
  labs(
    title = "Monthly Average Viewership",
    subtitle = "By content type",
    x = NULL,
    y = "Avg. Hours Viewed (Millions)",
    fill = "Content Type"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

TV shows outperform movies in every month
The performance gap widens in Q4, suggesting strong demand for binge-worthy series during holiday periods
The best months for releasing high-impact shows are October and November

Conclusion

This analysis demonstrates how language, content type, and release timing influence Netflix viewership. Based on the findings, I recommend the following:

Invest Heavily in TV Shows: Given their higher ROI, prioritize show development over films.
Capitalize on Q4 & Q1 Windows: Maximize releases during Oct–Mar when engagement peaks.
Expand Successful Non-English Offerings: Korean content, despite lower volume, has outsized performance.