Merge branch 'master' of github.com:YosefLab/epitome into test

YosefLab · Mar 5, 2021 · 3ff0e8a · 3ff0e8a
2 parents 5f8311a + 58931c8
commit 3ff0e8a
Show file tree

Hide file tree

Showing 13 changed files with 359 additions and 92 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -0,0 +1,39 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: epitome
+
+on:
+  push:
+    branches: [master]
+  pull_request:
+    branches: [master]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: [3.6, 3.7, 3.8]
+
+    steps:
+    - uses: actions/checkout@v2
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v2
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Cache pip
+      uses: actions/cache@v2
+      with:
+          path: ~/.cache/pip
+          key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
+          restore-keys: |
+            ${{ runner.os }}-pip-
+    - name: Install dependencies
+      run: |
+        pip install pytest-cov
+        make develop
+    - name: Test with pytest
+      run: |
+        make test
diff --git a/README.md b/README.md
@@ -1,3 +1,9 @@
+[![pypi](https://img.shields.io/pypi/v/epitome.svg)](https://pypi.org/project/epitome/)
+[![docs](https://readthedocs.org/projects/epitome/badge/?version=latest)](https://epitome.readthedocs.io/en/latest/)
+![Build status](https://github.com/YosefLab/epitome/workflows/epitome/badge.svg)
+[![Codacy Badge](https://api.codacy.com/project/badge/Grade/6c2cef0a2eae45399c9caed2d8c81965)](https://app.codacy.com/gh/YosefLab/epitome?utm_source=github.com&utm_medium=referral&utm_content=YosefLab/epitome&utm_campaign=Badge_Grade)
+
+
 # Epitome
 
 Pipeline for predicting ChIP-seq peaks in novel cell types using chromatin accessibility.
@@ -6,6 +12,10 @@ Pipeline for predicting ChIP-seq peaks in novel cell types using chromatin acces
 
 Epitome leverages chromatin accessibility (either DNase-seq or ATAC-seq) to predict epigenetic events in a novel cell type of interest. Such epigenetic events include transcription factor binding sites and histone modifications. Epitome computes chromatin accessibility similarity between ENCODE cell types and the novel cell type, and uses this information to transfer known epigentic signal to the novel cell type of interest.
 
+# Documentation
+
+Epitome documentation is hosted at [readthedocs](https://epitome.readthedocs.io/en/latest/). Documentation for Epitome includes tutorials for creating Epitome datasets, training, testing, and evaluated models.
+
 
 ## Requirements
 * [conda](https://docs.conda.io/en/latest/miniconda.html)
@@ -25,8 +35,6 @@ pip install epitome
 
 ## Training a Model
 
-TODO: link to documentation
-
 First, create an Epitome dataset that defines the cell types and ChIP-seq
 targets you want to train on,
 

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,4 +1,4 @@
-sphinx==1.7.7
+sphinx==2.1.1
+sphinx_rtd_theme==0.4.3
 nbsphinx==0.3.4
-sphinx_rtd_theme==0.4.2
 mock
diff --git a/docs/usage/train.rst b/docs/usage/train.rst
@@ -3,7 +3,7 @@ Training an Epitome Model
 
 Once you have `installed Epitome <../installation/source.html>`__, you are ready to train a model.
 
-Training a Model
+Create a Dataset
 ----------------
 
 First, import Epitome:
@@ -13,10 +13,7 @@ First, import Epitome:
 	from epitome.dataset import *
 	from epitome.models import *
 
-Create an Epitome Dataset
--------------------------
-
-First, create an Epitome Dataset. In the dataset, you will define the
+Next, create an Epitome Dataset. In the dataset, you will need to define the
 ChIP-seq targets you want to predict, the cell types you want to train from,
 and the assays you want to use to compute cell type similarity. For more information
 on creating an Epitome dataset, see `Configuring data <./dataset.html>`__.
@@ -28,24 +25,80 @@ on creating an Epitome dataset, see `Configuring data <./dataset.html>`__.
 
 	dataset = EpitomeDataset(targets=targets, cells=celltypes)
 
+Train a Model
+----------------
 Now, you can create a model:
 
 .. code:: python
 
 	model = EpitomeModel(dataset, test_celltypes = ["K562"]) # cell line reserved for testing
 
-Next, train the model. Here, we train the model for 5000 iterations:
+Next, train the model. Here, we train the model for 5000 batches:
 
 .. code:: python
 
 	model.train(5000)
 
-You can then evaluate model performance on held out test cell lines specified in the model declaration. In this case, we will evaluate on K562 on the first 10,000 points.
+Train a Model that Stops Early
+-------------------------------
+If you are not sure how many batches your model should train for or are concerned
+about your model overfitting, you can specify the max_valid_batches parameter when
+initializing the model, which will create a train_validation dataset the size of
+max_valid_batches. This forces the model to validate on the train-validation dataset
+and generate the train-validation loss every 200 training batches. The model may
+stop training early (before max_train_batches) if the model's train-validation
+losses stop improving during training. Else, the model will continue to train
+until max_train_batches.
+
+First, we have created a model that has a train-validation set size of 1000:
+
+.. code:: python
+
+	model = EpitomeModel(dataset,
+		test_celltypes = ["K562"], # cell line reserved for testing
+		max_valid_batches = 1000) # train_validation set size reserved while training
+
+Next, we train the model for a maximum of 5000 batches. If the train-validation
+loss stops improving, the model will stop training early:
+
+.. code:: python
+
+	best_model_batches, total_trained_batches, train_valid_losses = model.train(5000)
+
+If you are concerned about the model above overtraining because the model continues
+to improve by miniscule amounts, you can specify the min-delta which is minimum
+change in the train-validation loss required to qualify as an improvement. In the
+model below, a minimum improvement of at least 0.1 is required for the model to
+qualify as improving.
+
+If you are concerned about the model above under-fitting (stopping training too
+early because the train-validation loss might worsen slightly before reaching it's
+highest accuracy), you can specify the patience. In the model below, specifying
+a patience of 3 allows the model to train for up to 3 train-validation iterations
+(200 batches each) with no improvement, before stopping training.
+
+You can read the in-depth explanation of these hyper-parameters in
+`this section <https://www.overleaf.com/project/5cd315cb8028bd409596bdff>`__ of the
+paper. Detailed documentation of the train() function can also
+be found in the `Github repo <https://github.com/YosefLab/epitome>`__.
+
+.. code:: python
+
+	best_model_batches, total_trained_batches, train_valid_losses = model.train(5000,
+		patience = 3,
+		min_delta = 0.1)
+
+Test the Model
+----------------
+Finally, you can evaluate model performance on held out test cell lines specified
+in the model declaration. In this case, we will evaluate on K562 on the first 10,000 points.
 
 .. code:: python
 
 	results = model.test(10000,
 		mode = Dataset.TEST,
 		calculate_metrics=True)
 
-The output of `results` will contain the predictions and truth values, a dictionary of assay specific performance metrics, and the average auROC and auPRC across all evaluated assays.
+The output of `results` will contain the predictions and truth values, a dictionary
+of assay specific performance metrics, and the average auROC and auPRC across all
+evaluated assays.
diff --git a/epitome/__init__.py b/epitome/__init__.py
@@ -22,35 +22,6 @@
 
 __path__ = __import__('pkgutil').extend_path(__path__, __name__)
 
-S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com/hg19.zip'
-
-# os env that should be set by user to explicitly set the data path
-EPITOME_DATA_PATH_ENV="EPITOME_DATA_PATH"
-
-# data files required by epitome
-# data.h5 contains data, row information (celltypes and targets) and
-# column information (chr, start, binSize)
-EPITOME_H5_FILE = "data.h5"
-REQUIRED_FILES = [EPITOME_H5_FILE]
-# required keys in h5 file
-REQUIRED_KEYS = ['/',
- '/columns',
- '/columns/binSize',
- '/columns/chr',
- '/columns/index',
- '/columns/index/TEST',
- '/columns/index/TRAIN',
- '/columns/index/VALID',
- '/columns/index/test_chrs',
- '/columns/index/valid_chrs',
- '/columns/start',
- '/data',
- '/meta',
- '/meta/assembly',
- '/meta/source',
- '/rows',
- '/rows/celltypes',
- '/rows/targets']
 
 def GET_EPITOME_USER_PATH():
     return os.path.join(os.path.expanduser('~'), '.epitome')

diff --git a/epitome/constants.py b/epitome/constants.py
@@ -60,3 +60,8 @@ class Dataset(Enum):
     r"""
     All mode: Specifies that data should not be divided by train, valid, and test.
     """
+
+    TRAIN_VALID = 6 # For early stopping criteria. pulls train/valid chromsome.
+    r"""
+    TRAIN_VALID mode: Specifies that only a validation chr from train should be used.
+    """
diff --git a/epitome/dataset.py b/epitome/dataset.py
@@ -22,10 +22,42 @@
 
 # local imports
 from epitome import *
-from .constants import *
+from .constants import Dataset
 from .functions import download_and_unzip
 from .viz import plot_assay_heatmap
 
+################### File accession constants #######################
+S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com/hg19.zip'
+
+# os env that should be set by user to explicitly set the data path
+EPITOME_DATA_PATH_ENV="EPITOME_DATA_PATH"
+
+# data files required by epitome
+# data.h5 contains data, row information (celltypes and targets) and
+# column information (chr, start, binSize)
+EPITOME_H5_FILE = "data.h5"
+REQUIRED_FILES = [EPITOME_H5_FILE]
+# required keys in h5 file
+REQUIRED_KEYS = ['/',
+ '/columns',
+ '/columns/binSize',
+ '/columns/chr',
+ '/columns/index',
+ '/columns/index/TEST',
+ '/columns/index/TRAIN',
+ '/columns/index/VALID',
+ '/columns/index/test_chrs',
+ '/columns/index/valid_chrs',
+ '/columns/start',
+ '/data',
+ '/meta',
+ '/meta/assembly',
+ '/meta/source',
+ '/rows',
+ '/rows/celltypes',
+ '/rows/targets']
+
+
 class EpitomeDataset:
     '''
     Dataset for holding Epitome data.
@@ -118,6 +150,7 @@ def __init__(self,
         self.indices[Dataset.TRAIN] = dataset['columns']['index'][Dataset.TRAIN.name][:]
         self.indices[Dataset.VALID] = dataset['columns']['index'][Dataset.VALID.name][:]
         self.indices[Dataset.TEST] = dataset['columns']['index'][Dataset.TEST.name][:]
+        self.indices[Dataset.TRAIN_VALID] = [] # placeholder for if early stop is used
         self.valid_chrs = [i.decode() for i in dataset['columns']['index']['valid_chrs'][:]]
         self.test_chrs = [i.decode() for i in dataset['columns']['index']['test_chrs'][:]]
 
@@ -127,8 +160,34 @@ def __init__(self,
 
         dataset.close()
 
+    def set_train_validation_indices(self, chrom):
+        '''
+        Removes and reserves a given chromosome from the TRAIN dataset into
+        its own TRAIN_VALID dataset.
+
+        :param str chrom: string representation of chromosome in 'chr{int}' format (Ex: 'chr22').
+        '''
+        assert chrom in self.regions.chromosomes, "%s must be part of the genome assembly. Not found in regions."
+        assert chrom  not in self.valid_chrs and chrom not in self.test_chrs, "%s cannot be a valid or test chromosome."
+
+        # load in original training indices
+        dataset = h5py.File(self.h5_path, 'r')
+        train_indices = dataset['columns']['index'][Dataset.TRAIN.name][:]
+        dataset.close()
+
+        chr_indices = self.regions[self.regions.Chromosome == chrom].idx
+
+        # make sure this chromosome is in train set
+        assert len(np.setdiff1d(chr_indices, train_indices)) == 0, "chr_indices must be a subset of train_indices"
+
+        # remove valid indices
+        self.indices[Dataset.TRAIN] = np.setdiff1d(train_indices, chr_indices)
+        self.indices[Dataset.TRAIN_VALID] = chr_indices
+
+
     def get_parameter_dict(self):
-        ''' Returns dict of all parameters required to reconstruct this dataset
+        '''
+        Returns dict of all parameters required to reconstruct this dataset
 
         :return: dict containing all parameters to reconstruct dataset.
         :rtype: dict

diff --git a/epitome/generators.py b/epitome/generators.py
@@ -122,7 +122,6 @@ def load_data(data,
 
                 # sites where TF is bound in at least 2 cell line
                 positive_indices = np.where(np.sum(data[TF_indices,:], axis=0) > 1)[0]
-
                 indices_probs = np.ones([data.shape[1]])
                 indices_probs[positive_indices] = 0
                 indices_probs = indices_probs/np.sum(indices_probs, keepdims=1)
@@ -143,7 +142,6 @@ def load_data(data,
         else:
             indices = range(0, data.shape[-1]) # not training mode, set to all points
 
-
     if (mode == Dataset.RUNTIME):
         label_cell_types = ["PLACEHOLDER_CELL"]
         if similarity_matrix is None: