Netflix Content Strategy Analysis

data analysis
eng
case study
R
insights on audience engagement
Author

Ana Luisa Bodevan

Published

July 16, 2025

From Statso:

The goal is to analyze Netflix’s content strategy to understand how various factors like content type, language, release season, and timing affect viewership patterns. By identifying the best-performing content and the timing of its release, the aim is to uncover insights into how Netflix maximizes audience engagement throughout the year.

Objectives

The primary goal is to analyze Netflix’s content strategy across multiple dimensions:

  • Content Performance: Compare viewership patterns between movies and TV shows

  • Language Impact: Understand how language affects global viewership

  • Temporal Analysis: Identify optimal release timing (seasons, months, days)

  • Strategic Insights: Uncover patterns that maximize audience engagement

Methodology

Packages and Setup

Code
library(pacman)
pacman :: p_load(tidyverse, dplyr, lubridate, plotly, cluster, 
                 corrplot, randomForest, forecast, scales, ggtext, 
                 showtext, kableExtra)

font_add_google("Lato", "lato", regular.wt = 400, bold.wt = 700)
showtext_auto()
showtext_opts(dpi = 300)

theme_set(
  theme_minimal() +
    theme(
      plot.title = element_text(family = 'lato', face = 'bold', color = '#c1071e', size = 14),
      plot.subtitle = element_text(family = 'lato', face = 'bold', color = '#131834', size = 10),
      plot.caption = element_text(family = 'lato', color = '#43465e', size = 8),
      
      panel.background = element_rect(color = '#dedede'),
      panel.grid.minor = element_blank(),
      panel.grid.major = element_blank(),
      
      axis.title.x = element_markdown(family = "lato", hjust = .5, size = 8, color = "grey40"),
    axis.title.y = element_markdown(family = "lato", hjust = .5, size = 8, color = "grey40"),
    axis.text = element_text(family = "lato", hjust = .5, size = 8, color = "grey40")
    )
)

Data Loading and Initial Assessment

Code
netflix_data <- read.csv("netflix_content.csv")

head(netflix_data)
                                     Title Available.Globally. Release.Date
1                The Night Agent: Season 1                 Yes   2023-03-23
2                Ginny & Georgia: Season 2                 Yes   2023-01-05
3 The Glory: Season 1 // 더 글로리: 시즌 1                 Yes   2022-12-30
4                      Wednesday: Season 1                 Yes   2022-11-23
5      Queen Charlotte: A Bridgerton Story                 Yes   2023-05-04
6                            You: Season 4                 Yes   2023-02-09
  Hours.Viewed Language.Indicator Content.Type
1 81,21,00,000            English         Show
2 66,51,00,000            English         Show
3 62,28,00,000             Korean         Show
4 50,77,00,000            English         Show
5 50,30,00,000            English        Movie
6 44,06,00,000            English         Show

Data Processing Pipeline

Upon initial assessment, there are some data cleaning steps necessary.

Code
netflix_data$Hours.Viewed <- as.numeric(gsub(",", "", netflix_data$Hours.Viewed)) # convert to numeric 

netflix_data$Release.Date <- as.Date(netflix_data$Release.Date, format = "%Y-%m-%d") # convert date format 

netflix_data$Available.Globally. <- netflix_data$Available.Globally. == "Yes" # convert to boolean 

netflix_data$Language.Indicator <- as.factor(netflix_data$Language.Indicator)
netflix_data$Content.Type <- as.factor(netflix_data$Content.Type) # convert to factor 

# Extract release month, quarter, day of week
netflix_data$Release.Month <- month(netflix_data$Release.Date, label = TRUE)
netflix_data$Release.Quarter <- quarter(netflix_data$Release.Date)
netflix_data$Release.Day <- wday(netflix_data$Release.Date, label = TRUE)

netflix_data <- netflix_data %>%
  mutate(
    Release.Month = case_when(
      Release.Month == "jan" ~ "January",
      Release.Month == "fev" ~ "February", 
      Release.Month == "mar" ~ "March",
      Release.Month == "abr" ~ "April",
      Release.Month == "mai" ~ "May",
      Release.Month == "jun" ~ "June",
      Release.Month == "jul" ~ "July",
      Release.Month == "ago" ~ "August",
      Release.Month == "set" ~ "September",
      Release.Month == "out" ~ "October",
      Release.Month == "nov" ~ "November",
      Release.Month == "dez" ~ "December",
      TRUE ~ Release.Month
    )
  )

netflix_data <- netflix_data %>%
 mutate(
   Release.Day = case_when(
     Release.Day == "seg" ~ "Monday",
     Release.Day == "ter" ~ "Tuesday",
     Release.Day == "qua" ~ "Wednesday",
     Release.Day == "qui" ~ "Thursday",
     Release.Day == "sex" ~ "Friday",
     Release.Day == "sáb" ~ "Saturday",
     Release.Day == "dom" ~ "Sunday",
     TRUE ~ Release.Day
   )
 )

Explanatory Data Analysis

Top releases in 2023

Code
netflix_clean <- netflix_data %>% 
  na.omit()

top_titles <- netflix_data %>%
  filter(!is.na(Hours.Viewed)) %>%  
  arrange(desc(Hours.Viewed)) %>%   
  head(10)                          

top_titles %>%
  select(Title, Content.Type, Language.Indicator, Hours.Viewed) %>%
  mutate(
    Hours_Formatted = paste0(round(Hours.Viewed / 1e9, 1), "B")
  ) %>%
  select(Title, Content.Type, Language.Indicator, Hours_Formatted) %>% 
  kbl(
    caption = "Top 10 Netflix Titles by Hours Viewed",
    col.names = c("Title", "Content Type", "Language", "Hours (Formatted)"),
    align = c("l", "l", "l", "r") 
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  column_spec(1, bold = TRUE, border_right = TRUE)
Top 10 Netflix Titles by Hours Viewed
Title Content Type Language Hours (Formatted)
The Night Agent: Season 1 Show English 0.8B
Ginny & Georgia: Season 2 Show English 0.7B
King the Land: Limited Series // 킹더랜드: 리미티드 시리즈 Movie Korean 0.6B
The Glory: Season 1 // 더 글로리: 시즌 1 Show Korean 0.6B
ONE PIECE: Season 1 Show English 0.5B
Wednesday: Season 1 Show English 0.5B
Queen Charlotte: A Bridgerton Story Movie English 0.5B
You: Season 4 Show English 0.4B
La Reina del Sur: Season 3 Show English 0.4B
Outer Banks: Season 3 Show English 0.4B

There are some interesting things to notice, namely, the p#c1071eominance of TV shows spoken in English, with the exception of two very high placed Korean pieces. Next, I will take a look into the language distribution of Netflix releases.

Language Analysis

Code
language_summary <- netflix_clean %>%
  group_by(Language.Indicator) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  arrange(desc(Count)) %>%  
  mutate(
    Language_Label = case_when(
      Language.Indicator == "English" ~ "English",
      Language.Indicator == "Non-English" ~ "Non-English",
      TRUE ~ Language.Indicator
    ),
    # Calculate percentage of total releases
    Percentage = round(Count / sum(Count) * 100, 1)
  )

# Graph showing number of releases by language
ggplot(language_summary, aes(x = reorder(Language_Label, Count), y = Count)) +
  geom_col(fill = "#c1071e", alpha = 0.8, width = 0.6) +
  coord_flip() +
  scale_y_continuous(labels = comma_format(), 
                     expand = expansion(mult = c(0, 0.15))) +
  labs(
    title = "Netflix Content Releases by Language",
    subtitle = "Number of titles released",
    x = NULL,
    y = "Number of Releases",
  ) +
  theme()

So, we can see that English-speaking releases dominate the market. Of non-English languages, Japanese and Korean releases are the most common. The success of Korean releases (being two out of the five most watched pieces) shows a strong trend and preference for Korean made movies and TV shows.

Content Type Analysis

Statistical Summary

Code
content_summary <- netflix_data %>%
  group_by(Content.Type) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  mutate(
    Count_Formatted = format(Count, big.mark = ","),
    
    Total_Hours_Formatted = paste0(round(Total_Hours / 1e9, 0), "B"),
    
    Avg_Hours_Formatted = paste0(round(Avg_Hours / 1e6, 1), "M"),
    
    Median_Hours_Formatted = paste0(round(Median_Hours / 1e6, 1), "M")
  ) %>%
  select(Content.Type, Count_Formatted, Total_Hours_Formatted, 
         Avg_Hours_Formatted, Median_Hours_Formatted)

content_summary %>%
  kbl(
    caption = "Netflix Content Performance by Type",
    col.names = c("Content Type", "Count", "Total Hours", "Average Hours", "Median Hours"),
    align = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    position = "center"
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  row_spec(1, background = "#f8f9fa") %>%
  row_spec(2, background = "#e9ecef") %>%
  column_spec(1, bold = TRUE, border_right = TRUE, width = "2cm") %>%
  column_spec(2:5, width = "2.5cm")
Netflix Content Performance by Type
Content Type Count Total Hours Average Hours Median Hours
Movie 14,104 51B 3.6M 0.5M
Show 10,708 108B 10.1M 2.5M
Code
ggplot(netflix_data, aes(x = Content.Type, y = Hours.Viewed, fill = Content.Type)) +
  geom_col(alpha = 0.9) +
  scale_fill_manual(values = c("#c1071e", "#131834")) +
  scale_y_continuous(labels = label_number(scale = 1e-9)) +
  labs(
    title = "Viewership Performance by Content Type",
    subtitle = "Movies vs TV Shows viewership comparison",
    x = "Content Type",
    y = "Hours Viewed (Billions)"
  ) +
  theme(legend.position = "none")

Thus:

  • Movies had more titles produced (14 104), but lower individual performance

  • TV shows had less titles produced (10 708), but very high individual performance

    • Average viewership: Shows get ~2.8x more views than movies (100.6M vs 35.9M hours)

    • Median viewership: Shows get ~2.5x more views than movies (87.2M vs 35.5M hours)

    • Total viewership: Despite fewer titles, shows generate ~1.5x more total hours (708B vs 450B)

So, I infer the following strategic implications:

  • Content Investment Strategy: TV shows demonstrate superior ROI in terms of viewer engagement per title
  • Portfolio Balance: While movies provide content volume, TV shows drive sustained engagement
  • Resource Allocation: The data suggests prioritizing TV show development for maximum viewership impact

Temporal Analysis

Utilizing the new variables created in the data processing step, I will visualize how the amount of hours watched distribute in time to infer whether or not there is a relationship between the month, quarter or day of the week a new product is released and the engagement it receives.

Monthly Releases Analysis

Code
monthly_summary <- netflix_clean %>%
  group_by(Release.Month) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  arrange(desc(Avg_Hours)) 

monthly_summary %>%
  mutate(
    Count_Formatted = format(Count, big.mark = ","),
    Total_Hours_Formatted = paste0(round(Total_Hours / 1e9, 1), "B"),
    Avg_Hours_Formatted = paste0(round(Avg_Hours / 1e6, 1), "M"),
    Median_Hours_Formatted = paste0(round(Median_Hours / 1e6, 1), "M")
  ) %>%
  select(Release.Month, Count_Formatted, Total_Hours_Formatted, 
         Avg_Hours_Formatted, Median_Hours_Formatted) %>%
  kbl(
    caption = "Netflix Content Performance by Release Month",
    col.names = c("Release Month", "Count", "Total Hours", "Average Hours", "Median Hours"),
    align = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  column_spec(1, bold = TRUE, border_right = TRUE)
Netflix Content Performance by Release Month
Release Month Count Total Hours Average Hours Median Hours
December 787 10.1B 12.8M 2.4M
June 670 8.5B 12.7M 3.1M
February 560 7.1B 12.7M 3.5M
January 608 7.3B 12M 3.5M
May 624 7.1B 11.4M 3.4M
March 690 7.4B 10.8M 3M
April 647 6.9B 10.6M 2.9M
November 734 7.7B 10.6M 2.5M
July 631 6.5B 10.3M 3M
October 802 8.1B 10.1M 2.9M
August 674 6.8B 10.1M 2.8M
September 739 7.3B 9.8M 3.1M
Code
monthly_summary$Release.Month <- factor(monthly_summary$Release.Month, 
                                       levels = month.name)

monthly_summary %>%
  arrange(match(Release.Month, month.name)) %>%
  ggplot(aes(y = Total_Hours, x = Release.Month)) +
  geom_line(group = 1, alpha = 0.9, linewidth = 1.2) + 
  geom_point(size = 2) +
  scale_y_continuous(labels = label_number(scale = 1e-9)) + 
  labs(
    title = "Total Hours Viewed",
    subtitle = "by Month, 2023",
    x = NULL,
    y = "Hours Viewed (Billions)"
  ) +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 45, hjust = 1))

Quarterly Releases Analysis

Code
quarterly_summary <- netflix_clean %>%
  group_by(Release.Quarter) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  mutate(
    Quarter_Label = case_when(
      Release.Quarter == 1 ~ "Q1 (Jan-Mar)",
      Release.Quarter == 2 ~ "Q2 (Apr-Jun)",
      Release.Quarter == 3 ~ "Q3 (Jul-Sep)",
      Release.Quarter == 4 ~ "Q4 (Oct-Dec)"
    )
  )

quarterly_summary %>%
  mutate(
    Count_Formatted = format(Count, big.mark = ","),
    Total_Hours_Formatted = paste0(round(Total_Hours / 1e9, 1), "B"),
    Avg_Hours_Formatted = paste0(round(Avg_Hours / 1e6, 1), "M"),
    Median_Hours_Formatted = paste0(round(Median_Hours / 1e6, 1), "M")
  ) %>%
  select(Quarter_Label, Count_Formatted, Total_Hours_Formatted, 
         Avg_Hours_Formatted, Median_Hours_Formatted) %>%
  kbl(
    caption = "Netflix Content Performance by Quarter",
    col.names = c("Quarter", "Count", "Total Hours", "Average Hours", "Median Hours"),
    align = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  column_spec(1, bold = TRUE, border_right = TRUE)
Netflix Content Performance by Quarter
Quarter Count Total Hours Average Hours Median Hours
Q1 (Jan-Mar) 1,858 21.8B 11.7M 3.2M
Q2 (Apr-Jun) 1,941 22.5B 11.6M 3.2M
Q3 (Jul-Sep) 2,044 20.6B 10.1M 3M
Q4 (Oct-Dec) 2,323 25.9B 11.2M 2.6M

Day of the Week Releases Analysis

Code
daily_summary <- netflix_clean %>%
  group_by(Release.Day) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = 'drop'
  ) %>%
  arrange(desc(Avg_Hours))

daily_summary %>%
  mutate(
    Count_Formatted = format(Count, big.mark = ","),
    Total_Hours_Formatted = paste0(round(Total_Hours / 1e9, 1), "B"),
    Avg_Hours_Formatted = paste0(round(Avg_Hours / 1e6, 1), "M"),
    Median_Hours_Formatted = paste0(round(Median_Hours / 1e6, 1), "M")
  ) %>%
  select(Release.Day, Count_Formatted, Total_Hours_Formatted, 
         Avg_Hours_Formatted, Median_Hours_Formatted) %>%
  kbl(
    caption = "Netflix Content Daily Performance",
    col.names = c("Quarter", "Count", "Total Hours", "Average Hours", "Median Hours"),
    align = c("l", "r", "r", "r", "r")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE
  ) %>%
  row_spec(0, background = "#c1071e", color = "white", bold = TRUE) %>%
  column_spec(1, bold = TRUE, border_right = TRUE)
Netflix Content Daily Performance
Quarter Count Total Hours Average Hours Median Hours
Saturday 238 5.1B 21.5M 4.4M
Thursday 1,145 20.3B 17.7M 3.4M
Wednesday 1,310 15.7B 12M 3.5M
Sunday 179 1.9B 10.8M 2.1M
Friday 3,863 38.2B 9.9M 3.4M
Monday 436 4B 9.1M 3.1M
Tuesday 995 5.6B 5.6M 1.2M

Thus,

  • Quarter 4 (Oct-Dec) emerges as the optimal release window, generating the highest total viewership
  • Quarter 1 (Jan-Mar) shows strong performance, likely capitalizing still capitalizing on holidays
  • Quarters 2 and 3 (Apr-Sep) demonstrate relatively lower engagement, suggesting seasonal viewing patterns where audiences may be more occupied with summer (northern hemisphere) and mid-year vacations (southern hemisphere)
  • Strategic Timing: The data reveals a clear seasonal strategy where Netflix concentrates high-impact releases during periods of maximum audience availability, particularly leveraging the winter months for premium content launches

Content vs. Temporal Analysis

Code
monthly_type <- netflix_clean %>%
  group_by(Release.Month, Content.Type) %>%
  summarise(
    Count = n(),
    Total_Hours = sum(Hours.Viewed),
    Avg_Hours = mean(Hours.Viewed),
    Median_Hours = median(Hours.Viewed),
    .groups = "drop"
  )

monthly_type$Release.Month <- factor(
  monthly_type$Release.Month, 
  levels = month.name
)

ggplot(monthly_type, aes(x = Release.Month, y = Avg_Hours, fill = Content.Type)) +
  geom_col(position = "dodge", alpha = 0.9) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M")) +
  scale_fill_manual(values = c("#c1071e", "#131834")) +
  labs(
    title = "Monthly Average Viewership",
    subtitle = "By content type",
    x = NULL,
    y = "Avg. Hours Viewed (Millions)",
    fill = "Content Type"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

  • TV shows outperform movies in every month

  • The performance gap widens in Q4, suggesting strong demand for binge-worthy series during holiday periods

  • The best months for releasing high-impact shows are October and November

Conclusion

This analysis demonstrates how language, content type, and release timing influence Netflix viewership. Based on the findings, I recommend the following:

  1. Invest Heavily in TV Shows: Given their higher ROI, prioritize show development over films.
  2. Capitalize on Q4 & Q1 Windows: Maximize releases during Oct–Mar when engagement peaks.
  3. Expand Successful Non-English Offerings: Korean content, despite lower volume, has outsized performance.