[ Have a look at the related Jupyter Notebook:
ML-Corrupted.ipynb
]
[!] This README is only a brief overview of the notebook contents!
Problem |
Results |
Approach |
Resources
// Disclaimer:
I created this out of interest in the summer of 2019 at the age of 15
Resources, links and libraries were updated in the spring of 2022
→Machine Learning level: Script Kiddie
Solving CTF tasks led me to corrupted files at some point. When they lack distinctive features, such as signatures, most of the tools identify them as "Text files
" or "Data
", which is not useful at all, to be honest.
To proceed, it would be great to know their types, at least. Even though some files might be too broken to be fixed, unless you know the exact file format, which parts are missing and which part you have got before your eyes, I believe that guessing types may still be useful in some cases, e.g. criminal investigations — it might be important to find remains of what used to be evidences and analyze them thoroughly. Thus, my aim was to build models that would guess types of broken files.
Test set of files + knnRecG8 (
./guess
)
> Resources
[ Every piece of data and all copyrights belong to the original owners ]
- 🎨 Images (Raster)
- (Uncompressed)
BMP
: General-100 Dataset[1]
+ random files gathered from the Internet - (Lossy compression)
JPEG
: UTKFace: Large Scale Face Dataset[2]
- (Lossless compression)
PNG
: random files gathered from the Internet
- (Uncompressed)
- 🎬 Videos
- 🎧 Audios
- (Uncompressed)
WAV
: FSDnoisy18k[4]
- (Lossy compression)
MP3
: Mozilla Common Voice Dataset
- (Uncompressed)
- ⚙️ Executables
- Portable Executables: Microsoft Malware Classification Challenge
[5]
- Portable Executables: Microsoft Malware Classification Challenge
A total of 7 DataFrames were made:
S
= Top 10 bytes in each part of a file (file is split into 3 parts)G4
= Top 20 byte 4-Grams (stride=1) in a fileG6
= Top 20 byte 6-Grams (stride=2) in a fileG8
= Top 20 byte 8-Grams (stride=4) in a fileG4S
= Top 10 byte 4-Grams (stride=1) in each part of a file (file is split into 3 parts)G6S
= Top 10 byte 6-Grams (stride=2) in each part of a file (file is split into 3 parts)G8S
= Top 10 byte 8-Grams (stride=4) in each part of a file (file is split into 3 parts)
S
→ Split,GN
→ N-Grams
Scikit-Learn + XGBoost + LightGBM + CatBoost:
from sklearn.ensemble import RandomForestClassifier # Random Forest Classifier
from sklearn.neighbors import KNeighborsClassifier # KNN Classifier
from xgboost import XGBClassifier # XGBoost Classifier
from lightgbm import LGBMClassifier # LightGBM Classifier
from catboost import CatBoostClassifier # CatBoost Classifier
from sklearn.model_selection import RandomizedSearchCV # Randomized search on hyperparameters
Models were built using every algorithm listed above and were trained based on every DataFrame specified. For each case there are two models: one with default settings and one with hyperparameters recommended by RandomizedSearchCV
.
The total number of models is 70 (7 DataFrames * 5 Algorithms * 2 Sets of hyperparameters):
rfcS, rfcRecS, rfcG4, rfcRecG4, rfcG6, rfcRecG6, rfcG8, rfcRecG8, rfcG4S, rfcRecG4S, rfcG6S, rfcRecG6S, rfcG8S, rfcRecG8S,
knnS, knnRecS, knnG4, knnRecG4, knnG6, knnRecG6, knnG8, knnRecG8, knnG4S, knnRecG4S, knnG6S, knnRecG6S, knnG8S, knnRecG8S,
xgbS, xgbRecS, xgbG4, xgbRecG4, xgbG6, xgbRecG6, xgbG8, xgbRecG8, xgbG4S, xgbRecG4S, xgbG6S, xgbRecG6S, xgbG8S, xgbRecG8S,
lgbmS, lgbmRecS, lgbmG4, lgbmRecG4, lgbmG6, lgbmRecG6, lgbmG8, lgbmRecG8, lgbmG4S, lgbmRecG4S, lgbmG6S, lgbmRecG6S, lgbmG8S, lgbmRecG8S,
cbcS, cbcRecS, cbcG4, cbcRecG4, cbcG6, cbcRecG6, cbcG8, cbcRecG8, cbcG4S, cbcRecG4S, cbcG6S, cbcRecG6S, cbcG8S, cbcRecG8S
from sklearn.metrics import accuracy_score # Accuracy score
# [ Accuracy Score = (True Positives + True Negatives) / (True Positives + False Positives + True Negatives + False Negatives) ]
Also, heatmaps were used to visually compare accuracy scores of predictions (on validation sets) of different models. For example, the heatmap below depicts the difference between K-Nearest Neighbors model (train dataset = G8
) with default hyperparameters (knnG8
) and the ones recommended by RandomizedSearchCV
(knnRecG8
):
Data processing | Algorithms & boosting | Visualization | Misc |
---|---|---|---|
numpy 1.21.6 |
scikit-learn 1.0.2 |
matplotlib 3.5.2 |
binascii |
pandas 1.3.5 |
xgboost 1.6.1 |
seaborn 0.11.2 |
collections |
lightgbm 3.3.2 |
tqdm 4.35.0 |
||
catboost 1.0.6 |
dill 0.3.5.1 |
Information regarding hardware that was used to run the notebook. The time spent on every crucial step may be seen in the notebook, next to each tqdm progress bar.
RAM:
12 GB
GPU:NVIDIA GeForce GTX 1060
Processor:Intel(R) Core(TM) i5-8300H
[1]
Chao Dong, Chen Change Loy, Xiaoou Tang. Accelerating the Super-Resolution Convolutional Neural Network, in Proceedings of European Conference on Computer Vision (ECCV), 2016 arXiv:1608.00367
[2]
Zhang Zhifei, Song Yang, and Qi Hairong. "Age Progression/Regression by Conditional Adversarial Autoencoder". IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1702.08423, 2017
[3]
H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. ICCV, 2011. PDF Bibtex
[4]
Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019
[5]
Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, Mansour Ahmadi. "Microsoft Malware Classification Challenge". arXiv:1802.10135, 2018