This project aims to predict the outcomes of Premier League football matches using machine learning models. It explores various features to determine their importance in predicting match results—whether it’s a home win, draw, or away win.
- Midterm Project: Premier League Football Prediction
Predicting the outcomes of football matches has always been a challenging yet fascinating task for sports analysts and enthusiasts. This project focuses on the Premier League, aiming to build robust machine learning models to forecast match results—whether it’s a home win, draw, or away win. By leveraging historical match data and team statistics, the project seeks to identify key features that influence match outcomes. Despite the inherent unpredictability and dynamic nature of sports betting markets, this project aspires to provide valuable insights and potentially profitable predictions.
The raw data for this project is sourced from Football Data. The focus is exclusively on the Premier League, covering seasons from 2005/2006 to 2024/2025. The raw data files can be found here.
For the last twenty years, the tendency has been for the home teams to win the majority of games, which is really common for this type of sport because there is always an associated home factor.
During data gathering, the following key steps are performed:
The data_gathering
function in the 01_data_gathering.py
script encapsulates these steps. It ensures the necessary directories exist, downloads the CSV files for the specified seasons, checks the columns in the files, concatenates the data, and saves the processed data.
For more details, see the 01_data_gathering.py script.
The 02_data_preparation.py
script performs the following key steps:
- Data Cleaning: Fix column names, handle missing values, and ensure data integrity.
- Feature Engineering: Create new features such as goal difference, total shots, shot accuracy, and time-based features.
- Rolling Averages: Calculate rolling averages for various statistics over 3 and 5 game windows.
- Cumulative Points: Compute cumulative points for home and away teams.
- Normalize Betting Odds: Convert betting odds to implied probabilities.
- Save Processed Data: Save the processed data for the current season (2024/2025) and the final prepared dataset to CSV files.
For more details, see the 02_data_preparation.py script.
The 03_data_eda.py
script is dedicated to Exploratory Data Analysis (EDA). It includes the following key steps:
- Data Checking: Check data types, missing values, unique values, duplicates, and outliers.
- Correlation Analysis: Identify highly correlated features using a correlation matrix.
- Variance Inflation Factor (VIF): Calculate VIF to check for multicollinearity and remove features with high VIF values.
- Cluster Maps: Plot clustered heatmaps to visualize feature correlations.
- Target Distribution: Visualize the distribution of the target variable.
- Saving Data: Save the cleaned and processed data for modeling and backtesting.
For more details, see the 03_data_eda.py script.
The 04_train_model.py
script covers the following key steps:
- Data Preprocessing: Prepare the data for modeling.
- Feature Selection: Use Recursive Feature Elimination with Cross-Validation (RFECV) to select important features.
Check documentation about RFECV in: Scikit_learn-RFECV
- Model Evaluation: Evaluate models using RandomForest and XGBoost classifiers.
For example, here the model was overfitting in training data
- Hyperparameter Tuning: Tune hyperparameters to reduce overfitting.
After hyperparameter tuning we were able to decrease overfitting
- Model Finalization: Finalize the best model using a pipeline and save it for future predictions.
For more details, see the 04_train_model.py script.
The 05_back_testing_market.py
script includes the following key steps:
- Loading the Model and Data: Load the trained model and test datasets.
- Making Predictions: Generate predictions and prediction probabilities using the model.
- Preparing Data for Analysis: Combine predictions with actual results and market probabilities.
- Calculating Brier Scores: Compute Brier scores for both the model's predictions and the market probabilities.
- Comparing Performance: Compare the average Brier scores of the model and the market.
For more details, see the 05_back_testing_market.py script.
- Python 3.8 or higher
- Docker
- Pipenv
Use git clone
to copy the repository to your local machine and navigate into the project directory.
git clone <repository-url>
cd repository
Replace repository-url
with the actual URL of the repository (for example, from GitHub, GitLab, etc.)
git clone https://github.com/username/repository.git
cd repository
First, open a terminal and change to the directory where your Pipfile
and Pipfile.lock
are located.
cd /path/to/your/project
In the project directory, use pipenv install
to create the virtual environment and install all dependencies specified in the Pipfile.lock
.
pipenv install
This command will:
- Create a virtual environment if one doesn’t already exist.
- Install the dependencies exactly as specified in the
Pipfile.lock
.
To activate the virtual environment, use:
pipenv shell
Now you're in an isolated environment where the dependencies specified in the Pipfile.lock
are installed.
Build the Docker image:
docker build -t <docker_image_name> .
Run the Docker container:
docker run -it --rm -p 9696:9696 <docker_image_name>
Note:
If you get an error with[ 5/11] RUN 'pipenv install --system --deploy'
, try turning off your VPN.
To run Elastic Beanstalk locally, follow these steps:
-
Install the AWS Elastic Beanstalk CLI: Ensure you have the AWS CLI and Elastic Beanstalk CLI installed. You can install the Elastic Beanstalk CLI using pip:
pip install awsebcli
-
Initialize Elastic Beanstalk: Navigate to your project directory and initialize Elastic Beanstalk:
eb init
Follow the prompts to set up your application. Choose the appropriate region and select the platform (e.g., Python).
-
Create an Environment and Deploy: Create a new environment and deploy your application:
eb create <environment-name> eb deploy
Replace
<environment-name>
with your desired environment name. -
Access Your Application: After deployment, you can access your application using the URL provided by Elastic Beanstalk.
-
Update Your Application: To deploy updates, use the
eb deploy
command again:eb deploy
-
Terminate the Environment: When you are done, you can terminate the environment to stop incurring charges:
eb terminate <environment-name>
Note: Ensure your
Dockerrun.aws.json
orDockerfile
is correctly configured for Elastic Beanstalk.
To deploy the project on AWS Elastic Beanstalk remotely and resolve common errors, follow these steps:
-
Create a Launch Template in the AWS EC2 Console:
- Go to the EC2 Console in your AWS account.
- In the left-hand menu, choose Launch Templates (under Instances).
- Select Create launch template and fill in the required fields:
- Launch Template Name: Use something identifiable, like
eb-launch-template
. - AMI ID: Select a compatible Amazon Linux 2 AMI.
- Instance Type: Choose an instance type that suits your application needs (e.g.,
t2.micro
).
- Launch Template Name: Use something identifiable, like
- Click Create launch template and note the Launch Template ID (e.g.,
lt-0abcdef1234567890
).
-
Configure Elastic Beanstalk to Use the Launch Template:
- In your Elastic Beanstalk project directory, create or open the
.ebextensions
folder. - In
.ebextensions
, create a configuration file named00_launch_template.config
with the following YAML code:
Resources: AWSEBAutoScalingGroup: Type: AWS::AutoScaling::AutoScalingGroup Properties: MixedInstancesPolicy: InstancesDistribution: OnDemandPercentageAboveBaseCapacity: 100 LaunchTemplate: LaunchTemplateSpecification: LaunchTemplateId: "lt-XXXXXXXXX" # Replace with your Launch Template ID Version: "1" # Use the appropriate version of your template
- Replace
lt-0abcdef1234567890
with your actual Launch Template ID from Step 1. - Save the file, ensuring it is in the project root folder.
- In your Elastic Beanstalk project directory, create or open the
-
Deploy or Re-create Your Elastic Beanstalk Environment:
- If the environment already exists, you can update it with this new configuration:
eb deploy
- Alternatively, if you need to re-create the environment, terminate the current one (if any) and create a new one:
eb terminate <environment-name> # Only if you need to delete the existing environment eb create <new-environment-name>
This configuration should allow Elastic Beanstalk to deploy without attempting to use the deprecated Launch Configuration, solving the Auto Scaling Launch Configuration failed errors.
Open a new terminal and run the test script:
python tests/test_predict.py
To use the prediction service, send a POST request to the /predict endpoint with the following JSON payload:
curl -X POST http://127.0.0.1:9696/predict \
-H "Content-Type: application/json" \
-d '{
"home_team": "arsenal",
"away_team": "liverpool",
"date": "2024-12-16"
}'
To run the Streamlit app locally, follow these steps:
-
Ensure you have all dependencies installed and the virtual environment activated as described in the Installing Dependencies section.
-
Navigate to the project directory where
app.py
is located. -
Run the Streamlit app using the following command:
streamlit run app.py
Contributions are welcome! Please open an issue or submit a pull request for any improvements or bug fixes