An Analysis of Lichess Chess Matches

Author

Ana Luisa Bodevan

Published

November 5, 2025

This is a project of data cleaning and analysis of data of +20.000 Lichess chess matches. The objective is to explore common matches patterns and identify variables that may influence in the match outcome.

Technology and Data

Python
- Pandas, NumPy
- Matplotlib, Plotly

Data is from the Chess Game Dataset (Lichess) on Kaggle.

Data Description

Game ID;

Rated (T/F);

Start Time; End Time;

Number of Turns;

Game Status; Winner;

Time Increment;

White Player ID; White Player Rating;

Black Player ID; Black Player Rating;

All Moves in Standard Chess Notation;

Opening Eco (Standardised Code for any given opening); Opening Name; Opening Ply (Number of moves in the opening phase)

0. Setup and Cleaning Data

The data_cleaning.py script cleaned the original games.csv dataset and saved it as games_clean.csv. The process was quick and easy, given that the data was already very well structured. There were no missing values, and the only necessary changes were cleaning out duplicate data and creating a column for openings without their variants, for simplicity.

1. Descriptive Analysis

The eda.py script generated the insights presented. The following code load the necessary libraries, data and calculates the rating_difference variable.

Code

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px

# Load Data
df = pd.read_csv('data/games_clean.csv')

# Feature Engineering
df['rating_difference'] = df['white_rating'] - df['black_rating']

1.1 Summary Statistics

The table below provides a summary of the core numerical features, highlighting the central tendencies and spread of the data.

Code

numerical_cols = ['turns', 'white_rating', 'black_rating', 'opening_ply', 'rating_difference']
df[numerical_cols].describe().round(2)

Summary Statistics of Numerical Features
	turns	white_rating	black_rating	opening_ply	rating_difference
count	19429.00	19429.00	19429.00	19429.00	19429.00
mean	61.06	1597.70	1589.91	4.84	7.79
std	33.14	288.14	288.83	2.79	246.12
min	3.00	784.00	789.00	1.00	-1605.00
25%	38.00	1401.00	1395.00	3.00	-106.00
50%	56.00	1568.00	1563.00	4.00	3.00
75%	79.00	1792.00	1784.00	6.00	121.00
max	349.00	2700.00	2621.00	28.00	1499.00

Ratings: The average rating for both players is high (around 1600), but the range is vast 288 points).
Game Length: The mean game length is 61.06 turns (median 56), showing a slight positive skew (more short games than very long ones).
Rating Difference: The mean difference is 7.79 (White - Black), suggesting White is, on average, the slightly higher-rated player. However, the large standard deviation 246.12 confirms significant rating disparities exist in many individual matchups.

1.2 Feature Correlation

The correlation matrix reveals the linear relationships between the numerical variables.

Player Matching: The strongest correlation is between white_rating and black_rating ($\mathbf{0.64}$), confirming that players are typically matched against opponents of a similar skill level.
Rating vs. Outcome Potential: The rating_difference is strongly correlated with white_rating (0.42) and negatively correlated with black_rating (-0.43). This confirms the engineered feature is highly predictive of the rating skew towards one color.
Turns: Game length has only a weak correlation with ratings (0.12 to 0.15).

2. Game Outcome Analysis

2.1 Winner and Victory Stats Distribution

Winner: White wins the largest share of games, consistent with the first-move advantage in chess.
Victory Status: Resignation is the most common end-game status, followed by Checkmate, suggesting players know they are rarely able to revert a losing position.

2.2 Rating Influence on Outcome

The histogram below shows the distribution of the rating difference, separated by the winner.

The distributions are distinctly separated. The mass of the “white” wins is centered where White had a positive rating advantage, and the mass of the “black” wins is centered where Black had a positive rating advantage (negative rating difference). This visually confirms that the higher-rated player is more likely to win, regardless of color.

2.3 Game Lenght by Victory Status

The box plot illustrates the typical number of turns required for each victory status.

Observations:

Longest Games: Games ending in a Draw tend to be the longest (highest median and upper quartile), as balanced positions require extensive play.
Resignation/Mate: These outcomes occur at a similar median game length, suggesting that decisive mistakes or tactical breakthroughs happen mid-game.

3. Opening Analysis

3.1 Top 10 Openings Winning Rate

Code

top_openings = df['opening_name'].value_counts().nlargest(10).index.tolist()
df_top_openings = df[df['opening_name'].isin(top_openings)]
opening_win_counts = df_top_openings.groupby('opening_name')['winner'].value_counts().unstack(fill_value=0)
total_games = opening_win_counts.sum(axis=1)
opening_win_rates = opening_win_counts.divide(total_games, axis=0) * 100
opening_win_rates = opening_win_rates.sort_values(by='white', ascending=False)
opening_win_rates[['white', 'black', 'draw']].round(2)

Winning Rates for Top 10 Most Frequent Openings (%)
winner	white	black	draw
opening_name
Scandinavian Defense: Mieses-Kotroc Variation	62.85	34.78	2.37
Italian Game: Two Knights Defense	54.81	41.84	3.35
Scotch Game	53.41	42.42	4.17
Horwitz Defense	51.98	45.54	2.48
Queen's Pawn Game: Mason Attack	50.43	44.35	5.22
French Defense: Knight Variation	50.19	44.53	5.28
Queen's Pawn Game: Chigorin Variation	48.66	47.77	3.57
Sicilian Defense	41.86	53.78	4.36
Sicilian Defense: Bowdler Attack	40.48	55.10	4.42
Van't Kruijs Opening	33.23	62.02	4.75

White’s Best: The Scandinavian Defense: Mieses-Kotroc Variation heavily favors White, who wins 62.85 of the time.
Black’s Best: The Van’t Kruijs Opening heavily favors Black, who wins 62% of the time. This suggests that White’s non-standard opening choice is often punished.

4. Conclusions

The primary factor influencing the game outcome is the rating difference between the players, confirming that the higher-rated participant is statistically favored to win, regardless of playing color. White maintains a slight overall winning edge, supporting the notion of the first-move advantage.

Game endings are predominantly characterized by resignation or checkmate rather than time or draw-related outcomes, indicating a decisive result is achieved in the majority of contests. Furthermore, the analysis of popular openings revealed substantial imbalances, with specific lines, such as the Scandinavian Defense: Mieses-Kotroc Variation and the Van’t Kruijs Opening, showing stark bias toward one color.

--- title: "An Analysis of Lichess Chess Matches" author: "Ana Luisa Bodevan" date: "11-05-2025" execute: warning: false message: false eval: true format: html: code-tools: true code-fold: true toc: true --- This is a project of data cleaning and analysis of data of +20.000 Lichess chess matches. The objective is to explore common matches patterns and identify variables that may influence in the match outcome. ## Technology and Data - Python - Pandas, NumPy - Matplotlib, Plotly Data is from the Chess Game Dataset (Lichess) on [Kaggle](https://www.kaggle.com/datasets/datasnaek/chess/data). ### Data Description `Game ID`; `Rated` (T/F); `Start Time`; `End Time`; `Number of Turns`; `Game Status`; `Winner`; `Time Increment`; `White Player ID`; `White Player Rating`; `Black Player ID`; `Black Player Rating`; `All Moves in Standard Chess Notation`; `Opening Eco` (Standardised Code for any given opening); `Opening Name`; `Opening Ply` (Number of moves in the opening phase) ## 0. Setup and Cleaning Data The `data_cleaning.py` script cleaned the original `games.csv` dataset and saved it as `games_clean.csv`. The process was quick and easy, given that the data was already very well structured. There were no missing values, and the only necessary changes were cleaning out duplicate data and creating a column for openings without their variants, for simplicity. ## 1. Descriptive Analysis The `eda.py` script generated the insights presented. The following code load the necessary libraries, data and calculates the `rating_difference` variable. ```{python} #| label: setup #| echo: true import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import plotly.express as px # Load Data df = pd.read_csv('data/games_clean.csv') # Feature Engineering df['rating_difference'] = df['white_rating'] - df['black_rating'] ``` ### 1.1 Summary Statistics The table below provides a summary of the core numerical features, highlighting the central tendencies and spread of the data. ```{python} #| label: descriptive-stats #| tbl-cap: "Summary Statistics of Numerical Features" numerical_cols = ['turns', 'white_rating', 'black_rating', 'opening_ply', 'rating_difference'] df[numerical_cols].describe().round(2) ``` - **Ratings**: The average rating for both players is high (around 1600), but the range is vast **288** points). - **Game Length**: The mean game length is **61.06** turns (median 56), showing a slight positive skew (more short games than very long ones). - **Rating Difference**: The mean difference is **7.79** (White - Black), suggesting White is, on average, the slightly higher-rated player. However, the large standard deviation **246.12** confirms significant rating disparities exist in many individual matchups. ### 1.2 Feature Correlation The correlation matrix reveals the linear relationships between the numerical variables. ![](output/correlation_matrix_heatmap.png){fig-align="center" width="687"} - **Player Matching:** The strongest correlation is between `white_rating` and `black_rating` (\$\\mathbf{0.64}\$), confirming that players are typically matched against opponents of a similar skill level. - **Rating vs. Outcome Potential:** The `rating_difference` is strongly correlated with `white_rating` (0.42) and negatively correlated with `black_rating` (-0.43). This confirms the engineered feature is highly predictive of the rating skew towards one color. - **Turns:** Game length has only a weak correlation with ratings (0.12 to 0.15). ## 2. Game Outcome Analysis ### 2.1 Winner and Victory Stats Distribution ![](output/victory_winner_distributions.png) - **Winner:** White wins the largest share of games, consistent with the first-move advantage in chess. - **Victory Status:** **Resignation** is the most common end-game status, followed by **Checkmate**, suggesting players know they are rarely able to revert a losing position. ### 2.2 Rating Influence on Outcome The histogram below shows the distribution of the rating difference, separated by the winner. ![](output/rating_difference_by_winner_hist.png){fig-align="center" width="683"} The distributions are distinctly separated. The mass of the "white" wins is centered where White had a positive rating advantage, and the mass of the "black" wins is centered where Black had a positive rating advantage (negative rating difference). This visually confirms that the **higher-rated player is more likely to win**, regardless of color. ### 2.3 Game Lenght by Victory Status The box plot illustrates the typical number of turns required for each victory status. ![](output/turns_by_victory_status_boxplot.png){fig-align="center" width="798"} **Observations:** - **Longest Games:** Games ending in a **Draw** tend to be the longest (highest median and upper quartile), as balanced positions require extensive play. - **Resignation/Mate:** These outcomes occur at a similar median game length, suggesting that decisive mistakes or tactical breakthroughs happen mid-game. ## 3. Opening Analysis ### 3.1 Top 10 Openings Winning Rate ```{python} #| label: top-10-openings-win-rates-table #| tbl-cap: "Winning Rates for Top 10 Most Frequent Openings (%)" top_openings = df['opening_name'].value_counts().nlargest(10).index.tolist() df_top_openings = df[df['opening_name'].isin(top_openings)] opening_win_counts = df_top_openings.groupby('opening_name')['winner'].value_counts().unstack(fill_value=0) total_games = opening_win_counts.sum(axis=1) opening_win_rates = opening_win_counts.divide(total_games, axis=0) * 100 opening_win_rates = opening_win_rates.sort_values(by='white', ascending=False) opening_win_rates[['white', 'black', 'draw']].round(2) ``` - **White's Best:** The **Scandinavian Defense: Mieses-Kotroc Variation** heavily favors White, who wins **62.85** of the time. - **Black's Best:** The **Van't Kruijs Opening** heavily favors Black, who wins **62%** of the time. This suggests that White's non-standard opening choice is often punished. ## 4. Conclusions The primary factor influencing the game outcome is the **rating difference** between the players, confirming that the higher-rated participant is statistically favored to win, regardless of playing color. White maintains a slight overall winning edge, supporting the notion of the first-move advantage. Game endings are predominantly characterized by **resignation** or **checkmate** rather than time or draw-related outcomes, indicating a decisive result is achieved in the majority of contests. Furthermore, the analysis of popular openings revealed substantial imbalances, with specific lines, such as the Scandinavian Defense: Mieses-Kotroc Variation and the Van't Kruijs Opening, showing stark bias toward one color.