- Based on Transfermarkt Player Values & APIFootball.com Player Stats, inspired by the methodology of the movie Moneyball
This repo contains the files to replicate a way of doing real football scouting: collecting all the transfermarkt player values and the player statistics of APIfootball.com.
The methodology used in this project is as follows:
-
I built a web-scrapping script to extract transfermarkt player values for the main football leagues, and its second division leagues.
-
After that, I used APIFootball.com Football API, through RapidAPI, to extract the current season stats for every player and save them within a .csv. This API also contains an internal rating.
-
We fit an a linear regression using PySpark (we could have used scikit-learn or any other ML library, but this is a university exercise to run it on several servers).
-
Once we get that rating based on current season stats, we compare it with the historical rating from APIFootball.com. We calculate the difference to determine if based on stats, the player is currently over or undervalued.
-
After that, we order by the descending order by player value to find good players that are currently overperforming and are cheap in the market.
- data/: Contains datasets used in the project.
- notebooks/: Jupyter notebooks with detailed analysis and model development.
- PDF: Documentation on methodology and usage.
Btw, to initialize a pyspark session, you will need to install Java Runtime first. If you want to make this easier, you might want to check out: https://github.com/raulmarinperez/osbdet
BAM! I recommend you to refer to the PDF on MDA (Modern Data Architectures) I uploaded. It is much better explained.