Skip to content

Commit

Permalink
Merge branch 'master' of github.com:YosefLab/epitome into test
Browse files Browse the repository at this point in the history
  • Loading branch information
akmorrow13 committed Mar 5, 2021
2 parents 5f8311a + 58931c8 commit 3ff0e8a
Show file tree
Hide file tree
Showing 13 changed files with 359 additions and 92 deletions.
39 changes: 39 additions & 0 deletions .github/workflows/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: epitome

on:
push:
branches: [master]
pull_request:
branches: [master]

jobs:
build:

runs-on: ubuntu-latest
strategy:
matrix:
python-version: [3.6, 3.7, 3.8]

steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Cache pip
uses: actions/cache@v2
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
restore-keys: |
${{ runner.os }}-pip-
- name: Install dependencies
run: |
pip install pytest-cov
make develop
- name: Test with pytest
run: |
make test
12 changes: 10 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,9 @@
[![pypi](https://img.shields.io/pypi/v/epitome.svg)](https://pypi.org/project/epitome/)
[![docs](https://readthedocs.org/projects/epitome/badge/?version=latest)](https://epitome.readthedocs.io/en/latest/)
![Build status](https://github.com/YosefLab/epitome/workflows/epitome/badge.svg)
[![Codacy Badge](https://api.codacy.com/project/badge/Grade/6c2cef0a2eae45399c9caed2d8c81965)](https://app.codacy.com/gh/YosefLab/epitome?utm_source=github.com&utm_medium=referral&utm_content=YosefLab/epitome&utm_campaign=Badge_Grade)


# Epitome

Pipeline for predicting ChIP-seq peaks in novel cell types using chromatin accessibility.
Expand All @@ -6,6 +12,10 @@ Pipeline for predicting ChIP-seq peaks in novel cell types using chromatin acces

Epitome leverages chromatin accessibility (either DNase-seq or ATAC-seq) to predict epigenetic events in a novel cell type of interest. Such epigenetic events include transcription factor binding sites and histone modifications. Epitome computes chromatin accessibility similarity between ENCODE cell types and the novel cell type, and uses this information to transfer known epigentic signal to the novel cell type of interest.

# Documentation

Epitome documentation is hosted at [readthedocs](https://epitome.readthedocs.io/en/latest/). Documentation for Epitome includes tutorials for creating Epitome datasets, training, testing, and evaluated models.


## Requirements
* [conda](https://docs.conda.io/en/latest/miniconda.html)
Expand All @@ -25,8 +35,6 @@ pip install epitome

## Training a Model

TODO: link to documentation

First, create an Epitome dataset that defines the cell types and ChIP-seq
targets you want to train on,

Expand Down
4 changes: 2 additions & 2 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
sphinx==1.7.7
sphinx==2.1.1
sphinx_rtd_theme==0.4.3
nbsphinx==0.3.4
sphinx_rtd_theme==0.4.2
mock
69 changes: 61 additions & 8 deletions docs/usage/train.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ Training an Epitome Model

Once you have `installed Epitome <../installation/source.html>`__, you are ready to train a model.

Training a Model
Create a Dataset
----------------

First, import Epitome:
Expand All @@ -13,10 +13,7 @@ First, import Epitome:
from epitome.dataset import *
from epitome.models import *
Create an Epitome Dataset
-------------------------

First, create an Epitome Dataset. In the dataset, you will define the
Next, create an Epitome Dataset. In the dataset, you will need to define the
ChIP-seq targets you want to predict, the cell types you want to train from,
and the assays you want to use to compute cell type similarity. For more information
on creating an Epitome dataset, see `Configuring data <./dataset.html>`__.
Expand All @@ -28,24 +25,80 @@ on creating an Epitome dataset, see `Configuring data <./dataset.html>`__.
dataset = EpitomeDataset(targets=targets, cells=celltypes)
Train a Model
----------------
Now, you can create a model:

.. code:: python
model = EpitomeModel(dataset, test_celltypes = ["K562"]) # cell line reserved for testing
Next, train the model. Here, we train the model for 5000 iterations:
Next, train the model. Here, we train the model for 5000 batches:

.. code:: python
model.train(5000)
You can then evaluate model performance on held out test cell lines specified in the model declaration. In this case, we will evaluate on K562 on the first 10,000 points.
Train a Model that Stops Early
-------------------------------
If you are not sure how many batches your model should train for or are concerned
about your model overfitting, you can specify the max_valid_batches parameter when
initializing the model, which will create a train_validation dataset the size of
max_valid_batches. This forces the model to validate on the train-validation dataset
and generate the train-validation loss every 200 training batches. The model may
stop training early (before max_train_batches) if the model's train-validation
losses stop improving during training. Else, the model will continue to train
until max_train_batches.

First, we have created a model that has a train-validation set size of 1000:

.. code:: python
model = EpitomeModel(dataset,
test_celltypes = ["K562"], # cell line reserved for testing
max_valid_batches = 1000) # train_validation set size reserved while training
Next, we train the model for a maximum of 5000 batches. If the train-validation
loss stops improving, the model will stop training early:

.. code:: python
best_model_batches, total_trained_batches, train_valid_losses = model.train(5000)
If you are concerned about the model above overtraining because the model continues
to improve by miniscule amounts, you can specify the min-delta which is minimum
change in the train-validation loss required to qualify as an improvement. In the
model below, a minimum improvement of at least 0.1 is required for the model to
qualify as improving.

If you are concerned about the model above under-fitting (stopping training too
early because the train-validation loss might worsen slightly before reaching it's
highest accuracy), you can specify the patience. In the model below, specifying
a patience of 3 allows the model to train for up to 3 train-validation iterations
(200 batches each) with no improvement, before stopping training.

You can read the in-depth explanation of these hyper-parameters in
`this section <https://www.overleaf.com/project/5cd315cb8028bd409596bdff>`__ of the
paper. Detailed documentation of the train() function can also
be found in the `Github repo <https://github.com/YosefLab/epitome>`__.

.. code:: python
best_model_batches, total_trained_batches, train_valid_losses = model.train(5000,
patience = 3,
min_delta = 0.1)
Test the Model
----------------
Finally, you can evaluate model performance on held out test cell lines specified
in the model declaration. In this case, we will evaluate on K562 on the first 10,000 points.

.. code:: python
results = model.test(10000,
mode = Dataset.TEST,
calculate_metrics=True)
The output of `results` will contain the predictions and truth values, a dictionary of assay specific performance metrics, and the average auROC and auPRC across all evaluated assays.
The output of `results` will contain the predictions and truth values, a dictionary
of assay specific performance metrics, and the average auROC and auPRC across all
evaluated assays.
29 changes: 0 additions & 29 deletions epitome/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,35 +22,6 @@

__path__ = __import__('pkgutil').extend_path(__path__, __name__)

S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com/hg19.zip'

# os env that should be set by user to explicitly set the data path
EPITOME_DATA_PATH_ENV="EPITOME_DATA_PATH"

# data files required by epitome
# data.h5 contains data, row information (celltypes and targets) and
# column information (chr, start, binSize)
EPITOME_H5_FILE = "data.h5"
REQUIRED_FILES = [EPITOME_H5_FILE]
# required keys in h5 file
REQUIRED_KEYS = ['/',
'/columns',
'/columns/binSize',
'/columns/chr',
'/columns/index',
'/columns/index/TEST',
'/columns/index/TRAIN',
'/columns/index/VALID',
'/columns/index/test_chrs',
'/columns/index/valid_chrs',
'/columns/start',
'/data',
'/meta',
'/meta/assembly',
'/meta/source',
'/rows',
'/rows/celltypes',
'/rows/targets']

def GET_EPITOME_USER_PATH():
return os.path.join(os.path.expanduser('~'), '.epitome')
Expand Down
5 changes: 5 additions & 0 deletions epitome/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,8 @@ class Dataset(Enum):
r"""
All mode: Specifies that data should not be divided by train, valid, and test.
"""

TRAIN_VALID = 6 # For early stopping criteria. pulls train/valid chromsome.
r"""
TRAIN_VALID mode: Specifies that only a validation chr from train should be used.
"""
63 changes: 61 additions & 2 deletions epitome/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,42 @@

# local imports
from epitome import *
from .constants import *
from .constants import Dataset
from .functions import download_and_unzip
from .viz import plot_assay_heatmap

################### File accession constants #######################
S3_DATA_PATH = 'https://epitome-data.s3-us-west-1.amazonaws.com/hg19.zip'

# os env that should be set by user to explicitly set the data path
EPITOME_DATA_PATH_ENV="EPITOME_DATA_PATH"

# data files required by epitome
# data.h5 contains data, row information (celltypes and targets) and
# column information (chr, start, binSize)
EPITOME_H5_FILE = "data.h5"
REQUIRED_FILES = [EPITOME_H5_FILE]
# required keys in h5 file
REQUIRED_KEYS = ['/',
'/columns',
'/columns/binSize',
'/columns/chr',
'/columns/index',
'/columns/index/TEST',
'/columns/index/TRAIN',
'/columns/index/VALID',
'/columns/index/test_chrs',
'/columns/index/valid_chrs',
'/columns/start',
'/data',
'/meta',
'/meta/assembly',
'/meta/source',
'/rows',
'/rows/celltypes',
'/rows/targets']


class EpitomeDataset:
'''
Dataset for holding Epitome data.
Expand Down Expand Up @@ -118,6 +150,7 @@ def __init__(self,
self.indices[Dataset.TRAIN] = dataset['columns']['index'][Dataset.TRAIN.name][:]
self.indices[Dataset.VALID] = dataset['columns']['index'][Dataset.VALID.name][:]
self.indices[Dataset.TEST] = dataset['columns']['index'][Dataset.TEST.name][:]
self.indices[Dataset.TRAIN_VALID] = [] # placeholder for if early stop is used
self.valid_chrs = [i.decode() for i in dataset['columns']['index']['valid_chrs'][:]]
self.test_chrs = [i.decode() for i in dataset['columns']['index']['test_chrs'][:]]

Expand All @@ -127,8 +160,34 @@ def __init__(self,

dataset.close()

def set_train_validation_indices(self, chrom):
'''
Removes and reserves a given chromosome from the TRAIN dataset into
its own TRAIN_VALID dataset.
:param str chrom: string representation of chromosome in 'chr{int}' format (Ex: 'chr22').
'''
assert chrom in self.regions.chromosomes, "%s must be part of the genome assembly. Not found in regions."
assert chrom not in self.valid_chrs and chrom not in self.test_chrs, "%s cannot be a valid or test chromosome."

# load in original training indices
dataset = h5py.File(self.h5_path, 'r')
train_indices = dataset['columns']['index'][Dataset.TRAIN.name][:]
dataset.close()

chr_indices = self.regions[self.regions.Chromosome == chrom].idx

# make sure this chromosome is in train set
assert len(np.setdiff1d(chr_indices, train_indices)) == 0, "chr_indices must be a subset of train_indices"

# remove valid indices
self.indices[Dataset.TRAIN] = np.setdiff1d(train_indices, chr_indices)
self.indices[Dataset.TRAIN_VALID] = chr_indices


def get_parameter_dict(self):
''' Returns dict of all parameters required to reconstruct this dataset
'''
Returns dict of all parameters required to reconstruct this dataset
:return: dict containing all parameters to reconstruct dataset.
:rtype: dict
Expand Down
2 changes: 0 additions & 2 deletions epitome/generators.py
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,6 @@ def load_data(data,

# sites where TF is bound in at least 2 cell line
positive_indices = np.where(np.sum(data[TF_indices,:], axis=0) > 1)[0]

indices_probs = np.ones([data.shape[1]])
indices_probs[positive_indices] = 0
indices_probs = indices_probs/np.sum(indices_probs, keepdims=1)
Expand All @@ -143,7 +142,6 @@ def load_data(data,
else:
indices = range(0, data.shape[-1]) # not training mode, set to all points


if (mode == Dataset.RUNTIME):
label_cell_types = ["PLACEHOLDER_CELL"]
if similarity_matrix is None:
Expand Down
Loading

0 comments on commit 3ff0e8a

Please sign in to comment.