DeepES is a deep learning-based framework for enzyme screening. DeepES can identify orphan enzyme candidate protiens by focusing on biosynthetic gene clusters and KEGG RClass. DeepES uses protein sequences as inputs and evaluate whether the input genes contain biosynthetic gene clusters of interest.
- python 3.9.19 (with following packages)
- numpy 1.26.4
- pandas 2.2.1
- pytorch 1.13.0
- biopython 1.83
- fair-esm 2.0.0
By using environment.yml
, you can build an anaconda environment exactly the same as this research.
conda env create -f environment.yml
conda activate deepes
The model weights are available at Zotero. Please download files and unzip.
python embed.py \
--input_dir data \
--output_dir output
- --input_dir: Please specify the path of your directory where the input data is located.
- --output_dir: Please specify the path to save outputs.
- --reuse: Whether to use the results of previous runs. default:
True
- --batch_size: Batch size in embedding protein sequences. default:
1
- --cuda: Whether to use a GPU. default:
False
- --cpu_num: Number of threads. default:
1
python predict.py \
--input_dir data \
--output_dir output \
--model_dir model \
--rclass_list RC01053 RC00004,RC00014 RC01923
- --input_dir: Same as above.
- --output_dir: Same as above.
- --model_dir: Please specify the path of your directory where the pre-trained model weights are located.
- --rclass_list: Please specify list of RClass corresponding to sequential enzyme reactions of interest.
- --batch_size: Batch size in calculating probabilities. default:
16
- --cuda: Same as above. default:
False
- --cpu_num: Same as above. default:
1
python evaluate.py \
--input_dir data \
--output_dir output \
--rclass_list RC01053 RC00004,RC00014 RC01923
- --input_dir: Same as above.
- --output_dir: Same as above.
- --rclass_list: Same as above.
- --window_size: Range of contiguous genes to be evaluated at a time. default:
10
- --threshold: Threshold to obtain candidate genes. default:
0.99
- --duplication: Whether to allow a single gene to be associated with multiple enzyme reactions. default:
False
You can test DeepES by running the above three commands in sequence. After the testrun, the following directory structure is created.
output
├── candidate_genes
│ └── sample_data.tsv
├── embedding_vector
│ └── sample_data.pt
├── gene_table
│ └── sample_data.tsv
├── inference
│ ├── sample_data_RC00004.npy
│ ├── sample_data_RC00014.npy
│ ├── sample_data_RC01053.npy
│ └── sample_data_RC01923.npy
└── mapping_result
└── sample_data.pkl
The result file is output/candidate_genes/sample_data.tsv
:
score | window_idx | RC01053 | RC00004,RC00014 | RC01923 |
---|---|---|---|---|
0.9999939003858481 | 0 | eco:b2261 | eco:b2260 | eco:b2262 |
0.9999939003858481 | 1 | eco:b2261 | eco:b2260 | eco:b2262 |
0.9999939003858481 | 2 | eco:b2261 | eco:b2260 | eco:b2262 |
0.9999939003858481 | 3 | eco:b2261 | eco:b2260 | eco:b2262 |
0.9999939003858481 | 4 | eco:b2261 | eco:b2260 | eco:b2262 |
0.9999939003858481 | 5 | eco:b2261 | eco:b2260 | eco:b2262 |
DeepES is released under the MIT License.