This project has applied Machine Learning and Deep Learning techniques to analyse and predict the Air Quality in Beijing. Our task is to predict one hour into the future the concentration level of air pollutant PM2.5.
For the machine learning part we used a lag of 2 hours, which we deducted using PACF. But when in came to Deep learning we opted for a 48h lag because longer sequences gives better predictions.
This data set includes hourly air pollutants data from 12 nationally-controlled air-quality monitoring sites. The air-quality data are from the Beijing Municipal Environmental Monitoring Center. The meteorological data in each air-quality site are matched with the nearest weather station from the China Meteorological Administration. The time period is from March 1st, 2013 to February 28th, 2017. Missing data are denoted as NA. Link of the dataset
NW :
- We merged this data into one CSV.
- Outlier detection and removal using box plot.
- KNNImputation to impute missing values.
- The link to this preporcessed data can be found here Link of the dataset
We used this function to determine the appropriate lags p in an AR (p) model or in an extended ARIMA (p,d,q) model.
We choose for example the explanatory variable PM10
and how it is correlated in time.
We noticed that all variable verify the same plot meaning the best lag is two.
The main project implementation files can be seen in the directory named 'src'. The structure and description of this directory is shown as:
- src:
-
AirQualityData:
- The preprocessed data.
-
DataPreprocessing.ipynb
- The notebook mainly for data cleaning and data preprocessing.
-
Deep Learning
- Pytorch LSTM Baseline .ipynb
- Pytorch Attention LSTM Baseline.ipynb
- Tabnet baseline.ipynb
-
Machine Learning
- Catboost baseline.ipynb
- Lightgbm-baseline.ipynb
- Linear models baseline.ipynb
- XGBOOST-Baseline.ipynb
-
Model | RMSE | Kaggle | code |
---|---|---|---|
Catboost | 10.29049 | our work | this repo |
Lightgbm | 9.43424 | our work | this repo |
XGBOOST | 9.23511 | our work | this repo |
Linear models | 12.29697 | our work | this repo |
LSTM | 15.45468 | our work | this repo |
Attention LSTM | 14.51535 | our work | this repo |
Tabnet | 10.38852 | our work | this repo |
Mohamed Ali Bouchhioua 💻 |
Nour Hadrich 💻 |