Skip to content
This repository has been archived by the owner on May 23, 2023. It is now read-only.

Latest commit

 

History

History
132 lines (101 loc) · 7.81 KB

README.md

File metadata and controls

132 lines (101 loc) · 7.81 KB

gif

[ Have a look at the related Jupyter Notebook: ML-Corrupted.ipynb ]
[!] This README is only a brief overview of the notebook contents!

Problem | Results | Approach | Resources


// Disclaimer:
  I created this out of interest in the summer of 2019 at the age of 15
  Resources, links and libraries were updated in the spring of 2022
Machine Learning level: Script Kiddie

📉 Problem

Solving CTF tasks led me to corrupted files at some point. When they lack distinctive features, such as signatures, most of the tools identify them as "Text files" or "Data", which is not useful at all, to be honest.

To proceed, it would be great to know their types, at least. Even though some files might be too broken to be fixed, unless you know the exact file format, which parts are missing and which part you have got before your eyes, I believe that guessing types may still be useful in some cases, e.g. criminal investigations — it might be important to find remains of what used to be evidences and analyze them thoroughly. Thus, my aim was to build models that would guess types of broken files.

📊 Results


Test set of files + knnRecG8 (./guess)

🔍 Approach

Data

> Resources   [ Every piece of data and all copyrights belong to the original owners ]

Features

A total of 7 DataFrames were made:

  1. S = Top 10 bytes in each part of a file (file is split into 3 parts)
  2. G4 = Top 20 byte 4-Grams (stride=1) in a file
  3. G6 = Top 20 byte 6-Grams (stride=2) in a file
  4. G8 = Top 20 byte 8-Grams (stride=4) in a file
  5. G4S = Top 10 byte 4-Grams (stride=1) in each part of a file (file is split into 3 parts)
  6. G6S = Top 10 byte 6-Grams (stride=2) in each part of a file (file is split into 3 parts)
  7. G8S = Top 10 byte 8-Grams (stride=4) in each part of a file (file is split into 3 parts)

S → Split, GNN-Grams

Algorithms & boosts

Scikit-Learn + XGBoost + LightGBM + CatBoost:

from sklearn.ensemble import RandomForestClassifier      # Random Forest Classifier
from sklearn.neighbors import KNeighborsClassifier       # KNN Classifier
from xgboost import XGBClassifier                        # XGBoost Classifier
from lightgbm import LGBMClassifier                      # LightGBM Classifier
from catboost import CatBoostClassifier                  # CatBoost Classifier

from sklearn.model_selection import RandomizedSearchCV   # Randomized search on hyperparameters

Models were built using every algorithm listed above and were trained based on every DataFrame specified. For each case there are two models: one with default settings and one with hyperparameters recommended by RandomizedSearchCV.

The total number of models is 70 (7 DataFrames * 5 Algorithms * 2 Sets of hyperparameters):

rfcS, rfcRecS, rfcG4, rfcRecG4, rfcG6, rfcRecG6, rfcG8, rfcRecG8, rfcG4S, rfcRecG4S, rfcG6S, rfcRecG6S, rfcG8S, rfcRecG8S,
knnS, knnRecS, knnG4, knnRecG4, knnG6, knnRecG6, knnG8, knnRecG8, knnG4S, knnRecG4S, knnG6S, knnRecG6S, knnG8S, knnRecG8S,
xgbS, xgbRecS, xgbG4, xgbRecG4, xgbG6, xgbRecG6, xgbG8, xgbRecG8, xgbG4S, xgbRecG4S, xgbG6S, xgbRecG6S, xgbG8S, xgbRecG8S,
lgbmS, lgbmRecS, lgbmG4, lgbmRecG4, lgbmG6, lgbmRecG6, lgbmG8, lgbmRecG8, lgbmG4S, lgbmRecG4S, lgbmG6S, lgbmRecG6S, lgbmG8S, lgbmRecG8S,
cbcS, cbcRecS, cbcG4, cbcRecG4, cbcG6, cbcRecG6, cbcG8, cbcRecG8, cbcG4S, cbcRecG4S, cbcG6S, cbcRecG6S, cbcG8S, cbcRecG8S

Accuracy calculation & comparison

from sklearn.metrics import accuracy_score               # Accuracy score

# [ Accuracy Score = (True Positives + True Negatives) / (True Positives + False Positives + True Negatives + False Negatives) ]

Also, heatmaps were used to visually compare accuracy scores of predictions (on validation sets) of different models. For example, the heatmap below depicts the difference between K-Nearest Neighbors model (train dataset = G8) with default hyperparameters (knnG8) and the ones recommended by RandomizedSearchCV (knnRecG8):

Libraries

Data processing Algorithms & boosting Visualization Misc
numpy 1.21.6 scikit-learn 1.0.2 matplotlib 3.5.2 binascii
pandas 1.3.5 xgboost 1.6.1 seaborn 0.11.2 collections
lightgbm 3.3.2 tqdm 4.35.0
catboost 1.0.6 dill 0.3.5.1

Hardware

Information regarding hardware that was used to run the notebook. The time spent on every crucial step may be seen in the notebook, next to each tqdm progress bar.

RAM: 12 GB
GPU: NVIDIA GeForce GTX 1060
Processor: Intel(R) Core(TM) i5-8300H

📚 Resources

Datasets:

[1] Chao Dong, Chen Change Loy, Xiaoou Tang. Accelerating the Super-Resolution Convolutional Neural Network, in Proceedings of European Conference on Computer Vision (ECCV), 2016 arXiv:1608.00367

[2] Zhang Zhifei, Song Yang, and Qi Hairong. "Age Progression/Regression by Conditional Adversarial Autoencoder". IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1702.08423, 2017

[3] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A Large Video Database for Human Motion Recognition. ICCV, 2011. PDF Bibtex

[4] Eduardo Fonseca, Manoj Plakal, Daniel P. W. Ellis, Frederic Font, Xavier Favory, and Xavier Serra, “Learning Sound Event Classifiers from Web Audio with Noisy Labels”, arXiv preprint arXiv:1901.01189, 2019

[5] Royi Ronen, Marian Radu, Corina Feuerstein, Elad Yom-Tov, Mansour Ahmadi. "Microsoft Malware Classification Challenge". arXiv:1802.10135, 2018