Skip to content

Tensorflow based training, inference and feature engineering pipelines used in OSIC Kaggle Competition

Notifications You must be signed in to change notification settings


Repository files navigation

TensorFlow based Quantile Regression solution - OSIC Pulmonary Fibrosis

A complete tensorflow pipeline for training, inference and feature extraction notebooks used in Kaggle competition OSIC Pulmonary Fibrosis (July-Oct 2020)

Table of Contents

Brief overview of the competition data

The data contained of dicom (images + metadata) data of chest X-Ray of patients along with tabular data like smoking status, age, Forced Vital Capacity (FVC) values etc.
Slices preview of chest X-Ray of a patient are as:
Lung mask segmentation process deployed was (3rd image - final mask)

3D plot of stacked 2D segmented masks to form a lung produces

Apart from the dicom data the tabular data was as follows

Notebooks description

A brief content description is provided here, for detailed descriptions check the notebook

Feature Engineering notebook

A major task was engineering and extracting features from the dcm slices
In total I engineered 5 features as follows

  1. Chest Volume:
    - Calculated through numpy.trapz() integration over all 2D slices using pixel count, sliceThickness and pixelSpacing (Voxel spacing) metadata in the dcm file
    - Dealt with the inconsistencies in the data and final distplot produced was

  2. Chest Area:
    - Maximum area of chest calculated using the average of 3 middle most slices in same fashion as Chest Volume
    - distplot

  3. Lung - Tissue ratio:
    - Ratio of pixel area of segmented lung mask to the total tissue pixel area as in original dcm file
    - The ideology behind being this feature was to detect lung shrinkage inside chest
    - distplot

  4. Chest Height:
    - Chest height calculated using sliceThickness and number of slices forming the lung
    - distplot

  5. Height of the Patient:
    - Approximate height calculated using FVC values and age of a patient according to formulaes and observations made from external medical research data
    - distplot

Plots of Features vs FVC / Percent

[TRAIN] notebook

EffNet train notebook described below, Custom tf tabular data only model listed in [INFERENCE] itself

  1. Pre-Processing:
    - Handled the various sizes and missing slices issues
    - Stratified 5 fold split based on PatientID

  2. Augmentations:
    - Albumentations - RandomSizedCrop, Flips, Gaussian Blur, CoarseDropout, Rotate (0-90)

  3. Configurations:
    - Optimizer - NAdam
    - LR Scheduler - ReduceLRonPlateau (initial LR = 0.0005, patience = 5, factor = 0.5)
    - Model - EfficientNet B5
    - Input Size - 512 * 512

[INFERENCE] Submission notebook

Contains custom tabular data model training and inference too

  1. Custom Net:
    - A tiny net using given tabular data and engineered features on swish activated dense layers
    - Pinball loss function for multiple quantiles was used, the difference in first and last quantiles was used as uncertainty measure

  2. Ensemble:
    - Final submission made using ensemble of both effnet image and custom model

How to use

Just change the directories according to your environment.

Google Colab deployed versions are available for
[TRAIN] Effnet Open In Colab
[TRAIN] Base Custom Net Open In Colab