Skip to content

Latest commit

 

History

History
152 lines (112 loc) · 9.97 KB

README.md

File metadata and controls

152 lines (112 loc) · 9.97 KB

prtm

Protein Models (prtm) is an inference-only library for deep learning protein models.

Background

This library started out as a learning project to catch up on the deep learning models being used in protein science. After cloning a few repos it became clear that a nascent ecosystem was forming and that there was a need for a common interface to accelerate the creation of new workflows. The goal of prtm is to provide an (hopefully) enjoyable and interactive API for running, comparing, and chaining together protein DL models. Currently covered use cases include:

  • Folding
  • Inverse folding
  • Structure design
  • Sequence language modeling
  • Ligand docking

With many more to come!

Motivating Example

A very common workflow is to design a protein structure, apply inverse folding to generate plausible sequences, and then fold those sequences to see if they match the designed structure.

In prtm, we accomplish this with a few lines of code:

from prtm import models
from prtm import visual

# Define models for structure design, inverse folding and folding
designer = models.RFDiffusionForStructureDesign(model_name="auto")
inverse_folder = models.ProteinMPNNForInverseFolding(model_name="ca_only_model-20")
folder = models.OmegaFoldForFolding()

# Tell RFDiffusion to create a structure with exactly 128 residues
designed_structure, _ = designer(
    models.rfdiffusion_config.UnconditionalSamplerConfig(
        contigmap_params=models.rfdiffusion_config.ContigMap(contigs=["128-128"]),
    )
)

# Design a sequence and fold it!
designed_sequence, _ = inverse_folder(designed_structure)
predicted_designed_structure, _ = folder(designed_sequence)

# Visualize the designed structure and the predicted structure overlaid in a notebook
visual.view_superimposed_structures(designed_structure, predicted_designed_structure)

# Convert to PBD
pdb_str = predicted_designed_structure.to_pdb()

# Try docking a ligand (methotrexate) to the designed structure
ligand = "CN(CC1=CN=C2C(=N1)C(=NC(=N2)N)N)C3=CC=C(C=C3)C(=O)NC(CCC(=O)O)C(=O)O"
docker = models.DiffDockForLigandDocking()
poses, aux_output = docker(predicted_designed_structure, ligand)

# Visualize the predicted ligand poses
visual.view_structure_with_ligand(predicted_designed_structure, poses)

Installation

At this early stage, prtm has only been tested on a Linux system with a CUDA-enabled GPU. There are no guarantees that it will work on other systems.

Before getting started it's assumed that you've already installed conda or mamba (preferred), then clone this repo and create a prtm environment:

git clone https://github.com/conradry/prtm.git
cd prtm
mamba env create -f environment.yaml
mamba activate prtm
pip install -e .

To make prtm more accessible it was decided to remove custom CUDA kernels from all models that previously used them, so that's it for most cases!

Optionally, Pyrosetta is a soft-dependency of prtm and is only required for the protein_seq_des model. A license is required to use Pyrosetta and can be obtained for free for academic use. For installation instructions, see here.

What's implemented

Note: Most, but not all models, allow commerial use. Please check the license of each model.

AlphaFold is written and JAX but all other models are written in PyTorch, therefore we chose not to directly integrate the AlphaFold inference code into this repo. Both OpenFold and Uni-Fold allow for the conversion of the AlphaFold JAX weights into PyTorch. The Uni-Fold implementation is designed to work with MMSeqs2 and has support for multimers which is why we adopted it. Eventually, we may decide to subsume the OpenFold models under Uni-Fold.

Model Name Function Notebook Source Code License
AlphaFold/Uni-Fold Folding Notebook https://github.com/dptech-corp/Uni-Fold Apache 2.0
AlphaFold/UniFold-Multimer Folding Notebook https://github.com/dptech-corp/Uni-Fold Apache 2.0
OpenFold Folding Notebook https://github.com/aqlaboratory/openfold Apache 2.0
ESMFold Folding Notebook https://github.com/facebookresearch/esm MIT License
RoseTTAFold Folding Notebook https://github.com/RosettaCommons/RoseTTAFold MIT License
OmegaFold Folding Notebook https://github.com/HeliXonProtein/OmegaFold Apache 2.0
DMPfold2 Folding Notebook https://github.com/psipred/DMPfold2 GPL v3.0
Uni-Fold Symmetry Folding Notebook https://github.com/dptech-corp/Uni-Fold GPL v3.0
IgFold Antibody Folding Notebook https://github.com/Graylab/IgFold JHU License
ESM-IF Inverse Folding Notebook https://github.com/facebookresearch/esm MIT License
ProteinMPNN Inverse Folding Notebook https://github.com/dauparas/ProteinMPNN MIT License
PiFold Inverse Folding Notebook https://github.com/A4Bio/PiFold MIT License
ProteinSeqDes Inverse Folding Notebook https://github.com/nanand2/protein_seq_des BSD-3
ProteinSolver Inverse Folding Notebook https://github.com/ostrokach/proteinsolver MIT License
RFDiffusion Design Notebook https://github.com/RosettaCommons/RFdiffusion BSD
ProteinGenerator Design Notebook https://github.com/RosettaCommons/protein_generator MIT License
Genie Design Notebook https://github.com/aqlaboratory/genie Apache 2.0
FoldingDiff Design Notebook https://github.com/microsoft/foldingdiff MIT License
SE3-Diffusion Design Notebook https://github.com/jasonkyuyim/se3_diffusion MIT License
EigenFold Fold sampling Notebook https://github.com/bjing2016/EigenFold MIT License
AntiBERTy Antibody language modeling Notebook https://github.com/jeffreyruffolo/AntiBERTy MIT License
DiffDock Ligand docking Notebook https://github.com/gcorso/DiffDock MIT License

Links for papers can be found on the Github repos for each model.

Documentation

A real docs page is a work in progress, but to get started the provided notebooks should be enough. In addition to minimal usage notebooks for each implemented model, there are also more general notebooks that cover common use cases and some features of the prtm API. A good order to try is:

For more complex design algorithms like RFDiffusion and ProteinGenerator, there are detailed example notebooks to look at:

Roadmap and Contributing

The currently implemented models only scratch the surface of what's available. There's a sketchy model tracking Google sheet for papers and code repos that are being considered for implementation. If you'd like to contribute or suggest priorities, please open an issue or PR and we can discuss!

There's, of course, also a lot of technical debt to payoff that accumulated from duct taping together code from many different sources. Docstrings, API improvements, bug fixes, and better tests are very welcome!

Acknowledgments

This project is an achievement of copy-paste engineering 😉. It would not have been possible without the hard work of the authors of the models that are implemented here. Please cite their work if you use their model!