This is a fork of the Petrobrass 3W dataset.
First semester course 2022/2023
Classify the undesirable events that occur in offshore oil wells reported in the 3W dataset (Petrobras) using the techniques and algorithms of machine Learning.
See EDA
The dataset is composed of 8 folders, one for every situation of the process (one normal and 8 abnormal). Every folder has multiple files containing a time series of meter readings and a class label.
A data manager Python class (d3w) has been written to facilitate access to the files and split into training, validation and test datasets.
Two main assumptions has been made prior to fitting model to the data:
- Minute averages instead of instantaneous (per second) data are used. It seems that the dynamics of the process are not so fast.
- The situations can only be assessed looking at a time window of around 30 minutes (2D time x feature).
A Data Generator class (CustomDataGen) has been developed for every type of model.
A Python class has been developed to fit, evaluate and score different neural networks. The most promising are CNN's (Conv1D layers) and RNN's (LSTM layers). See keras_model_complex.
Some models that accept incremental learning has been tried not very succesfully. See multiclass_model.
A better solution has been first preprocess the dataset through a ipca model (incremental PCA) and compress the data into a much smaller datasets that fits in memory and allow the use of a greater variety of models. See ipca_model for a Random Forest.
River is a library to build online machine learning models.
Preprocessing is carried out using River routines:
- Trying to emulate a SCADA that send to the model one row every second. Data is preprocessed (scaled), incremental rolling time window statistics are calculated and fed to an incremental learning model. What's not very realistic is using an abnormal class label for each row as this classification has been assessed by an expert in batch. The score of the test dataset is poor and the fitting process is very slow.
- Using minute averages computed with pandas is much faster but less realistic. A python class for rolling window values has been developed that is used along with river stats calculations.
- Only real data sorted by date has been used. The only abnormal situation with enough real data is Flow Instability (class 4).
An Hoeffding Tree Classifier has been used. Is a multiclass classifier.
A drift detector has been introduced in the stream learning process trying to detect concept drift in the flow of data that advices to reinitialize the model been trained. See river_window_model (html).
The final TFM report in PDF format (Spanish).
The final TFM slide presentation in PDF format (Spanish).
*** The tfm folder contains the ypnb files for training various models with the dataset.