Combatting electricity and gas fraud in Tunisia πΉπ³ for the Tunisian Company of Electricity and Gas (STEG). With losses reaching 200 million Tunisian Dinars, I achieved a top 25% position in the leaderboard using an XGBoost model with an AUC of 0.86. By analyzing client billing history, the solution aims to detect and curb fraudulent activities, safeguarding STEG's revenues and minimizing losses.
Detect and prevent fraudulent activities in electricity and gas consumption to enhance revenue and reduce losses.
All data is provided by the Tunisian Company of Electricity and Gas (STEG). You can access the data at Zindi data section.
File Descriptions:
- train.csv - Contains the target. This is the dataset used for model training.
- Fraud_Detection_Starter.ipynb - This notebook helps you make your first submission for this challenge.
- Test.csv - Resembles Train.csv but without the target-related columns. This is the dataset on which you will apply your model.
- SampleSubmission.csv - Shows the submission format for this competition, with the βIDβ column mirroring that of Test.csv and the βtargetβ column containing your predictions. The order of the rows does not matter, but the names of the ID must be correct.
- My Notebook on kaggle or tunisia-energy-fraud-detection-steg.ipynb
Approach:
- Exploratory Data Analysis (EDA) on client and invoice data.
- Correlation analysis, feature engineering, and aggregation to improve model performance.
- Utilized an XGBoost classifier with tuning for optimal AUC.
- Checked for NaN values.
- Transformed data types.
- And applied label and one-hot encoding to categorical columns.
model = XGBClassifier(
n_estimators=4000,
learning_rate=0.01,
max_depth=3,
objective='binary:logistic',
random_state=42,
scale_pos_weight=sum(y_train == 0) / sum(y_train == 1),
gamma=0.1,
reg_lambda=1,
reg_alpha=0,
)
- Optimization: XGBoost's boosting process corrects errors of the combined ensemble.
- Loss Function: Binary logistic.
- Epochs and Early Stopping: Utilized 4000 estimators without early stopping for the final model.
- Evaluation Metrics: Focused on AUC, f1 score, and other relevant metrics.
Key Findings:
- Identified significant features, including the number of counters used, counter state, counter coefficient, tarif type, and reading remarque.
- Achieved a top-performing model with an AUC of 0.8641.
- Fine-tune the model for better performance.
- Explore anomaly detection approaches.
- Experiment with more robust anomaly detection models.
- Address misclassified labels for improved accuracy.
Feel free to reach out for any project-related inquiries, collaboration opportunities, or discussions. You can connect with me on LinkedIn, explore more of my projects on GitHub, and check out my portfolio here.
I'd like to express my gratitude to Zindi the organizers of this challenge.
Thank you for visiting my project repository, and I'm excited to share more data-driven insights in the future!