GitHub - bio-it-station/DeepCircle: Predict eccDNA based on deep learning models

Predict eccDNAs based on CNN and DNABERT

Install

git clone --recursive https://github.com/KaiLiChang/DeepCircle.git
cd DeepCircle
conda env create -f environment.yml
conda activate DeepCircle

Download and unzip reference genome, pre-trained and fine-tuned DNABERT

bash download_genome_and_models.sh

If you want to infer eccDNA-related motifs, please download and install the MEME suite (see https://meme-suite.org/meme/doc/install.html?man_type=web)

Quick start

Pre-processing

Pre-process the dataset of both models (CNN and DNABERT)

python3 ./bin/DeepCircle.py Pre-processing -type PC-3 -bound hg38_noalt.fa.genome -g hg38_noalt_gap.bed -gen hg38_noalt.fa -ecc PC-3_circleseq_eccdna_filt_uniq.bed

CNN

Train a CNN based on provided eccDNAs

python3 ./bin/DeepCircle.py CNN-train -bed ./DNABERT/eccdna/db/human/PC-3_circleseq_eccdna_filt_uniq.bed

Predict eccDNAs based on given bed files using pre-trained CNN models

python3 ./bin/DeepCircle.py CNN-predict -bed ./DNABERT/eccdna/db/human/PC-3_circleseq_eccdna_filt_uniq.bed -model PC-3

Infer eccDNA-related motifs based on prediction results of CNN

python3 ./bin/DeepCircle.py infer-motif -method CNN -data PC-3 -model PC-3

DNABERT

Fine-tune a DNABERT model based on provided eccDNAs

python3 ./bin/DeepCircle.py DNABERT-train -bed PC-3_circleseq_eccdna_filt_uniq.bed

Predict eccDNAs based on given bed files using fine-tuned DNABERT models

python3 ./bin/DeepCircle.py DNABERT-predict -bed PC-3_circleseq_eccdna_filt_uniq.bed -model PC-3

Infer eccDNA-related motifs based on prediction results of DNABERT

python3 ./bin/DeepCircle.py infer-motif -method DNABERT -data PC-3 -model PC-3

Advanced usage

usage: DeepCircle [-h]
                  {CNN-train,CNN-predict,DNABERT-train,DNABERT-predict,infer-motif}
                  ...

Wrapper function for training and predicting using CNN and DNABERT within DeepCircle

positional arguments:
  {CNN-train,CNN-predict,DNABERT-train,DNABERT-predict,infer-motif}

optional arguments:
  -h, --help            show this help message and exit

Train a CNN model

usage: DeepCircle CNN-train [-h] [-c NUMBER_OF_CPUS] [-fa GENOME_FASTA]
                            [-gap GENOME_GAP] [-g GENOME] -bed ECCDNA_BED
                            [-r RATIO] [-epo EPOCHS] [-batch BATCH_SIZE]
                            [-dir OUTPUT_PREDICTION]

optional arguments:
  -h, --help            show this help message and exit
  -c NUMBER_OF_CPUS, --number-of-cpus NUMBER_OF_CPUS
                        Number of cpus [default: 8]
  -fa GENOME_FASTA, --genome-fasta GENOME_FASTA
                        Directory to the fasta file of hg38 genome [default:
                        ./DNABERT/eccdna/genome/human/hg38_noalt.fa]
  -gap GENOME_GAP, --genome-gap GENOME_GAP
                        Directory to the bed file of gaps of hg38 genome
                        [default:
                        ./DNABERT/eccdna/genome/human/hg38_noalt_gap.bed]
  -g GENOME, --genome GENOME
                        Directory to the file containing lengths of each
                        chromosome in hg38 genome [default:
                        ./DNABERT/eccdna/genome/human/hg38_noalt.fa.genome]
  -bed ECCDNA_BED, --eccDNA-bed ECCDNA_BED
                        Directory to the bed file containing eccDNAs [default:
                        ./DNABERT/eccdna/db/human/PC-3_circleseq_eccdna_filt_u
                        niq.bed]
  -r RATIO, --ratio RATIO
                        Ratio of data used for testing [default: 0.2]
  -epo EPOCHS, --epochs EPOCHS
                        Number of epochs for training [default: 5]
  -batch BATCH_SIZE, --batch-size BATCH_SIZE
                        Number of batch size for training [default: 32]
  -dir OUTPUT_PREDICTION, --output-prediction OUTPUT_PREDICTION
                        Directory to output prediction results of the model
                        [default: ./output/CNN]

Predict eccDNAs using a pre-trained CNN model

usage: DeepCircle CNN-predict [-h] [-c NUMBER_OF_CPUS] [-fa GENOME_FASTA]
                              [-gap GENOME_GAP] [-g GENOME] -bed ECCDNA_BED
                              [-dir OUTPUT_PREDICTION] -model
                              {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}

optional arguments:
  -h, --help            show this help message and exit
  -c NUMBER_OF_CPUS, --number-of-cpus NUMBER_OF_CPUS
                        Number of cpus [default: 8]
  -fa GENOME_FASTA, --genome-fasta GENOME_FASTA
                        Directory to the fasta file of hg38 genome [default:
                        ./DNABERT/eccdna/genome/human/hg38_noalt.fa]
  -gap GENOME_GAP, --genome-gap GENOME_GAP
                        Directory to the bed file of gaps of hg38 genome
                        [default:
                        ./DNABERT/eccdna/genome/human/hg38_noalt_gap.bed]
  -g GENOME, --genome GENOME
                        Directory to the file containing lengths of each
                        chromosome in hg38 genome [default:
                        ./DNABERT/eccdna/genome/human/hg38_noalt.fa.genome]
  -bed ECCDNA_BED, --eccDNA-bed ECCDNA_BED
                        Directory to the bed file containing eccDNAs [default:
                        ./DNABERT/eccdna/db/human/PC-3_circleseq_eccdna_filt_u
                        niq.bed]
  -dir OUTPUT_PREDICTION, --output-prediction OUTPUT_PREDICTION
                        Directory to output prediction results of the model
                        [default: ./output/CNN]
  -model {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}, --chosen-model {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}
                        Choose pre-trained models for predicting eccDNAs
                        [default: ./pre_trained_models/]

Fine-tune a DNABERT model

usage: DeepCircle DNABERT-train [-h] -bed ECCDNA_BED [-sp SPECIES] [-g GENOME]
                                [-fa GENOME_FASTA] [-gap GENOME_GAP] [-k KMER]
                                [-model_path MODEL_PATH] [-batch BATCH_SIZE]
                                [-epo EPOCHS] [-c NUMBER_OF_CPUS]

optional arguments:
  -h, --help            show this help message and exit
  -bed ECCDNA_BED, --eccDNA-bed ECCDNA_BED
                        File name of the bed file containing eccDNAs, put this
                        file in DeepCircle/DNABERT/eccdna/db/{species}/
  -sp SPECIES, --species SPECIES
                        Species from which input data derived, [default:
                        human]
  -g GENOME, --genome GENOME
                        Name of the file containing lengths of each chromosome
                        in hg38 genome, put this file in
                        DeepCircle/DNABERT/eccdna/genome/{species}/ [default:
                        hg38_noalt.fa.genome]
  -fa GENOME_FASTA, --genome-fasta GENOME_FASTA
                        Name of the fasta file of the reference genome, put
                        this file in
                        DeepCircle/DNABERT/eccdna/genome/{species}/ [default:
                        hg38_noalt_gap.fa]
  -gap GENOME_GAP, --genome-gap GENOME_GAP
                        Name of the bed file containing gaps of the reference
                        genome, put this file in
                        DeepCircle/DNABERT/eccdna/genome/{species}/ [default:
                        hg38_noalt_gap.bed]
  -k KMER, --kmer KMER  Specify k-mer used to train DNABERT [default: 6]
  -model_path MODEL_PATH, --model_path MODEL_PATH
                        Directory to the pre-trained DNABERT, [default:
                        ./DNABERT/6-new-12w-0]
  -batch BATCH_SIZE, --batch-size BATCH_SIZE
                        Number of batch size for training
  -epo EPOCHS, --epochs EPOCHS
                        Number of epochs for training
  -c NUMBER_OF_CPUS, --number-of-cpus NUMBER_OF_CPUS
                        Number of cpus

Predict eccDNAs using a fine-tuned DNABERT model

usage: DeepCircle DNABERT-predict [-h] -bed ECCDNA_BED [-sp SPECIES]
                                  [-g GENOME] [-fa GENOME_FASTA]
                                  [-gap GENOME_GAP] [-k KMER]
                                  [-batch BATCH_SIZE] [-epo EPOCHS]
                                  [-c NUMBER_OF_CPUS]
                                  [-model {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}]

optional arguments:
  -h, --help            show this help message and exit
  -bed ECCDNA_BED, --eccDNA-bed ECCDNA_BED
                        File name of the bed file containing eccDNAs, put this
                        file in DeepCircle/DNABERT/eccdna/db/{species}/
  -sp SPECIES, --species SPECIES
                        Species from which input data derived, [default:
                        human]
  -g GENOME, --genome GENOME
                        Name of the file containing lengths of each chromosome
                        in hg38 genome, put this file in
                        DeepCircle/DNABERT/eccdna/genome/{species}/ [default:
                        hg38_noalt.fa.genome]
  -fa GENOME_FASTA, --genome-fasta GENOME_FASTA
                        Name of the fasta file of the reference genome, put
                        this file in
                        DeepCircle/DNABERT/eccdna/genome/{species}/ [default:
                        hg38_noalt_gap.fa]
  -gap GENOME_GAP, --genome-gap GENOME_GAP
                        Name of the bed file containing gaps of the reference
                        genome, put this file in
                        DeepCircle/DNABERT/eccdna/genome/{species}/ [default:
                        hg38_noalt_gap.bed]
  -k KMER, --kmer KMER  Specify k-mer used to train DNABERT [default: 6]
  -batch BATCH_SIZE, --batch-size BATCH_SIZE
                        Number of batch size for training
  -epo EPOCHS, --epochs EPOCHS
                        Number of epochs for training
  -c NUMBER_OF_CPUS, --number-of-cpus NUMBER_OF_CPUS
                        Number of cpus
  -model {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}, --chosen-model {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}
                        Choose pre-trained models for predicting eccDNAs
                        [default: DeepCircle/DNABERT/pre_trained_models]

Infer eccDNA-related motifs using prediction results

usage: DeepCircle infer-motif [-h] [-method {CNN,DNABERT}] -data
                              {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}
                              [-n NUM_MOTIF]
                              [-model {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}]

optional arguments:
  -h, --help            show this help message and exit
  -method {CNN,DNABERT}
                        Choose type of model to that was used to predict
                        eccDNA [default: CNN]
  -data {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}, --data-type {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}
                        Choose type of data that was used to predict eccDNA
  -n NUM_MOTIF, --num-motif NUM_MOTIF
                        Type number of motifs to infer [default: 10]
  -model {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}, --chosen-model {C4-2,ES2,HeLaS3,LnCap,OVCAR8,PC-3,U937,leukocytes,muscle,pool_LCN,pool_LCT}
                        Choose pre-trained models for predicting eccDNAs

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
DNABERT @ 59599ff		DNABERT @ 59599ff
bin		bin
pre_trained_models		pre_trained_models
.gitignore		.gitignore
.gitmodules		.gitmodules
DeepCircle_logo.gif		DeepCircle_logo.gif
README.md		README.md
download_genome_and_models.sh		download_genome_and_models.sh
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict eccDNAs based on CNN and DNABERT

Install

Download and unzip reference genome, pre-trained and fine-tuned DNABERT

If you want to infer eccDNA-related motifs, please download and install the MEME suite (see https://meme-suite.org/meme/doc/install.html?man_type=web)

Quick start

Pre-processing

CNN

DNABERT

Advanced usage

Train a CNN model

Predict eccDNAs using a pre-trained CNN model

Fine-tune a DNABERT model

Predict eccDNAs using a fine-tuned DNABERT model

Infer eccDNA-related motifs using prediction results

About

Releases

Packages

Contributors 2

Languages

bio-it-station/DeepCircle

Folders and files

Latest commit

History

Repository files navigation

Predict eccDNAs based on CNN and DNABERT

Install

Download and unzip reference genome, pre-trained and fine-tuned DNABERT

If you want to infer eccDNA-related motifs, please download and install the MEME suite (see https://meme-suite.org/meme/doc/install.html?man_type=web)

Quick start

Pre-processing

CNN

DNABERT

Advanced usage

Train a CNN model

Predict eccDNAs using a pre-trained CNN model

Fine-tune a DNABERT model

Predict eccDNAs using a fine-tuned DNABERT model

Infer eccDNA-related motifs using prediction results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages