DEV Community

Ekemini Samuel
Ekemini Samuel

Posted on

Analysing European Soccer Data with Deepnote in Windsurf IDE

Word on the street? Deepnote is now open-source! Jakub Jurových, the Founder of Deepnote, announced on LinkedIn this week.

During my research, I found out that the European Soccer Database from Kaggle has 25K+ matches, 10K+ players, and 2008–2016 seasons across 11 leagues. In this guide will use Deepnote in Windsurf to analyse the European sports data from Kaggle for three use cases:

  1. To find out which team improved most post-2012 from the data?
  2. To predict "star players" and their value, with the Top 10 players by goal impact
  3. To analyse win rates, goal differences, and home/away advantages over seasons for the teams

Let's get started!

Prerequisites

You will need the following to proceed

  • A Kaggle account
  • Windsurf account and IDE installed on your computer
  • A Deepnote account

What is Deepnote?

Deepnote is a cloud-based, collaborative data science notebook platform that enhances Jupyter notebooks with real-time co-editing, integrated data connections, and simplified sharing, acting as a central workspace for teams to develop and deploy data-driven products. It supports Python, R, and SQL

Deepnote

To get started, we will download the European soccer database from Kaggle. Click on the Download button like so

Kaggle

Then click on Download dataset as zip

dataset

When the dataset is downloaded on your computer, extract it to your preferred directory. For this project, I created a directory on my computer called windsurf_deepnote. You can do the same and extract the dataset to that specific folder.

directory

After the dataset is extracted, you get a file named database.sqlite in the directory.

Now that we have the dataset, lets setup Windsurf.

Setting up Windsurf IDE

Windsurf is an integrated development environment (IDE) for coding. It is similar to VS Code, with AI-first features.

Go to the Windsurf website, and download the application suited for your operating system (OS).

Windsurf

After it has downloaded, sign up with your preferred method, and open the Windsurf IDE with the directory you want to work with.

Project

You can also view your Windsurf account activity in the profile section like so

Code

Using Deepnote in Windsurf

Now we have the dataset and Windsurf setup, let's proceed to working with Deepnote in Windsurf.

Sign up here to create your Deepnote account

In the Windsurf IDE, go to the extensions tab at the left sidebar, search for Deepnote and click on the Install button like so

Deepnote

After it has installed, this tab loads. It has information on how to get started using Deepnote in Windsurf

Deepnote

Next, use the Ctrl+Shift+P command to create a new Deepnote project, and name it soccer.
soccer

It then creates a new project called soccer and loads a new Deepnote block, just as in the Deepnote web app UI

Deepnote

We are using Kaggle to get the football/soccer data that we will be analysing.

You can enter python code in blocks and run like so

Deepnote

If you want to convert a notebook to a deepnote notebook, open your terminal and run this command:

npx @deepnote/convert notebook.ipynb # This will convert the notebook and create notebook.deepnote 
Enter fullscreen mode Exit fullscreen mode

Enter yes (y) for the prompts that come up during installation

After it has been installed, Then open your .deepnote file in Windsurf IDE.

Implementing the data analysis of the three use Cases

From the soccer dataset, we will run each of this Python codes to analyse the data.

Find Teams That Improved Most Post-2012

We will find which teams improved most after 2012, by comparing average team performance (win rate, goals) before and after 2012

Enter this code:

import sqlite3 import pandas as pd import matplotlib.pyplot as plt # Connect to the SQLite database conn = sqlite3.connect("database.sqlite") # Load match data matches = pd.read_sql_query(""" SELECT m.id, m.season, m.home_team_api_id, m.away_team_api_id, m.home_team_goal, m.away_team_goal, t.team_long_name AS home_team FROM Match m JOIN Team t ON m.home_team_api_id = t.team_api_id """, conn) # Compute win rate per team per season matches["result"] = matches.apply( lambda x: "win" if x["home_team_goal"] > x["away_team_goal"] else ("loss" if x["home_team_goal"] < x["away_team_goal"] else "draw"), axis=1) team_stats = matches.groupby(["home_team", "season"])["result"].value_counts().unstack().fillna(0) team_stats["win_rate"] = team_stats["win"] / (team_stats["win"] + team_stats["loss"] + team_stats["draw"]) # Split pre- and post-2012 averages pre_2012 = team_stats[team_stats.index.get_level_values("season") < "2012/2013"].groupby("home_team")["win_rate"].mean() post_2012 = team_stats[team_stats.index.get_level_values("season") >= "2012/2013"].groupby("home_team")["win_rate"].mean() improvement = (post_2012 - pre_2012).dropna().sort_values(ascending=False).head(10) # Visualize plt.figure(figsize=(10,5)) improvement.plot(kind="bar", color="green") plt.title("Top 10 Most Improved Teams Post-2012") plt.ylabel("Win Rate Improvement") plt.tight_layout() plt.show() 
Enter fullscreen mode Exit fullscreen mode

After we run the code block above, we get this visualization:

dataset

From the data, we confirm that Juventus showed the biggest leap after 2012, improving their win rate by +30%.

Borussia Mönchengladbach and AS Saint-Étienne followed closely, both gaining around 25%, proving that the post-2012 era rewarded better management, new ownership money, and tactics.

Predict Star Players by Goal Impact

Next we predict star players and rank the Top 10 by goal impact score, by combining player attributes and match stats, then using regression to find impact.

Enter this python code block and run it:

import sqlite3 import pandas as pd from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt conn = sqlite3.connect("database.sqlite") # Load player data with names players = pd.read_sql_query(""" SELECT p.player_api_id, p.player_name, pa.overall_rating, pa.potential FROM Player p JOIN Player_Attributes pa ON p.player_api_id = pa.player_api_id """, conn) # Load match data goals = pd.read_sql_query(""" SELECT home_player_1, home_player_2, home_player_3, home_team_goal FROM Match """, conn) # Aggregate goals by player player_goals = goals.melt( value_vars=['home_player_1', 'home_player_2', 'home_player_3'], var_name='position', value_name='player_api_id' ) player_goals = player_goals.groupby('player_api_id').size().reset_index(name='goal_count') # Merge with player data and calculate impact score merged = players.groupby(['player_api_id', 'player_name']).mean().reset_index() merged = merged.merge(player_goals, on='player_api_id', how='inner').dropna() # Model goal impact X = merged[['overall_rating', 'potential']] y = merged['goal_count'] model = LinearRegression().fit(X, y) merged['impact_score'] = model.predict(X) # Get top 10 players top_players = merged.sort_values('impact_score', ascending=False).head(10) # Plot with player names plt.figure(figsize=(12, 6)) plt.barh(top_players['player_name'], top_players['impact_score'], color='orange') plt.title("Top 10 Star Players by Goal Impact Score", pad=20) plt.xlabel("Predicted Impact Score") plt.ylabel("Player Name") plt.gca().invert_yaxis() plt.tight_layout() plt.show() 
Enter fullscreen mode Exit fullscreen mode

After running the code, we get this visualisation:

data

From the data, we confirm that Iker Casillas tops the list as the star player with the highest predicted goal-impact score (~41), followed closely by Sergio Ramos, Manuel Neuer, and Petr Čech. John Terry also rounds out the top spot.

Win Rate & Home Advantage Analysis

Then we analyse win rates, goal differences, and home/away advantages across seasons. We do this by comparing home vs away wins by season and visualize trends.

Run this python code block:

import sqlite3 import pandas as pd import matplotlib.pyplot as plt # Connect to the database conn = sqlite3.connect("database.sqlite") # Load match data with team names matches = pd.read_sql_query(""" SELECT m.season, t1.team_long_name as home_team, t2.team_long_name as away_team, m.home_team_goal, m.away_team_goal FROM Match m JOIN Team t1 ON m.home_team_api_id = t1.team_api_id JOIN Team t2 ON m.away_team_api_id = t2.team_api_id WHERE m.season IS NOT NULL """, conn) # Calculate match results matches['home_win'] = matches['home_team_goal'] > matches['away_team_goal'] matches['draw'] = matches['home_team_goal'] == matches['away_team_goal'] matches['away_win'] = matches['home_team_goal'] < matches['away_team_goal'] # Calculate win rates per team home_stats = matches.groupby('home_team').agg({ 'home_win': 'sum', 'home_team_goal': 'sum', 'away_team_goal': 'sum' }).rename(columns={ 'home_win': 'wins', 'home_team_goal': 'goals_for', 'away_team_goal': 'goals_against' }) away_stats = matches.groupby('away_team').agg({ 'away_win': 'sum', 'away_team_goal': 'sum', 'home_team_goal': 'sum' }).rename(columns={ 'away_win': 'wins', 'away_team_goal': 'goals_for', 'home_team_goal': 'goals_against' }) # Combine home and away stats team_stats = home_stats.add(away_stats, fill_value=0) team_stats['total_matches'] = matches['home_team'].value_counts().add( matches['away_team'].value_counts(), fill_value=0) team_stats['win_rate'] = (team_stats['wins'] / team_stats['total_matches']) * 100 team_stats['goal_difference'] = team_stats['goals_for'] - team_stats['goals_against'] # Get top 10 teams by win rate top_teams = team_stats.nlargest(10, 'win_rate').sort_values('win_rate') # Plotting plt.figure(figsize=(12, 8)) # Create bar plot bars = plt.barh(top_teams.index, top_teams['win_rate'], color='skyblue') plt.title('Top 10 Teams by Win Rate', fontsize=16, pad=20) plt.xlabel('Win Rate (%)', fontsize=12) plt.ylabel('Team', fontsize=12) # Add value labels for bar in bars: width = bar.get_width() plt.text(width + 0.5, bar.get_y() + bar.get_height()/2, f'{width:.1f}%', va='center', fontsize=10) plt.grid(True, linestyle='--', alpha=0.6) plt.tight_layout() # Save the plot plt.savefig('team_win_rates.png', bbox_inches='tight', dpi=100) plt.close() print("Top 10 Teams by Win Rate:") print(top_teams[['win_rate', 'total_matches', 'goal_difference']].sort_values('win_rate', ascending=False).to_string()) print("\nPlot saved as 'team_win_rates.png'") 
Enter fullscreen mode Exit fullscreen mode

After running the code, we get the visualization of the dataset like so:

data

We confirm that FC Barcelona dominates with a 77% overall win rate across the entire period, followed by Real Madrid (75%) and SL Benfica (74.6%). The data shows the classic home advantage of roughly 55–60% win rate for home teams throughout 2008–2016.

And that's it! We have a great analysis of the soccer data!

Next steps

Show the Deepnote some support and star the GitHub repository !😎

Deepnote

Thank you for reading!

Top comments (0)