This repo lists a collection of resources for performing Deep Learning in Python for Life Sciences. Since the end of 2021 I started observing an always growing volume of academic work and Open Source initiatives related to topics such as biochemistry, genetics, molecular biology, bioinformatics, etc. Being my study background in Biomedical Engineering and Deep Learning, coming from past experiences as Software Engineer and currently working on applying ML/DL to real-life use cases in the pharma industry, these new Open Source efforts have caught my interest. That's why finally I decided to start this repository to provide researchers, developers and practitioners a single place to keep track of the latest developments in this space, with focus in particular on biotech and pharma.
Contributions, suggestions and stars are welcome!
Image generated through DALL-E mini by prompting "A fancy protein folding".
*** 30/11/2022 Update: Coming to this repo on 2023: I will start adding my feedback on my attempts to reproduce the code and results for projects added to this list. Due to multiple job projects management, conferences attendance and personal matters, I didn't have a chance so far to share my findings. Will start doing it in a couple of months. ***
*** 28/02/2023 Update: In a month you could start seeing some green, yellow or red icons appearing close to project names in the lists below. They refer to the outcome of my attempts to reproduce the associated code. More explanation will be provided later. Thanks for your patience. ***
Molecules
Proteins
Cheminformatics
Drug Discovery
Datasets
Explainable AI
Other
- pysmiles - A lightweight Python library for reading and writing SMILES strings.
- SmilesDrawer - A Colab notebook to draw from SMILES strings.
- PySMILESUtils - Utilities for working with SMILES based encodings of molecules for Deep Learning (PyTorch oriented).
- SELFIES - Robust representation of semantically constrained graphs, in particular for molecules in chemistry.
- ChemProp - Message Passing Neural Networks for molecule property prediction.
- Evidential Deep Learning for Guided Molecular Property Prediction and Discovery - Fast and scalable uncertainty quantification for neural molecular property prediction, accelerated optimization, and guided virtual screening. [Paper]
- mols2grid - Interactive molecule viewer for 2D structures.
- Image to SMILES Generator - Code to generate datasets of pairs "image - sequence" for chemical molecules. [Article].
- Auto3D - Automatic generation of the low-energy 3D structures with ANI Neural Network potentials.
- MolDQN - Optimization of molecules via Deep Reinforcement Learning. [Paper]
- Pasithea - Deep Molecular Dreaming: Inverse Machine Learning for de-novo molecular design and interpretability with surjective representations. [Paper]
- fragment-based-dgm - A Deep Generative Model for fragment-based molecule generation. [Paper]
- MAT - Molecule Attention Transformer for molecular prediction tasks. [Paper].
- Specklit - A Streamlit Component for creating Speck molecular structures within a Streamlit Web app.
- molcloud - A package to draw molecules in a big canvas packed together.
- Img2Mol - Inferring molecules from pictures.
- GLAMOUR - Chemistry-informed Macromolecule Graph Representation for Similarity Computation, Unsupervised and Supervised Learning. [Paper]
- MOSES - Molecular Sets: a benchmarking platform for Molecular Generation Models. [Paper]
- Tartarus - A benchmarking platform for realistic and practical inverse molecular design. [Paper]
- Transformer-M - One Transformer that can understand both 2D & 3D molecular data. [Paper]
- GraphINVENT - A platform for graph-based molecular generation using graph neural networks.
- SynNet - An amortized approach to synthetic tree generation using neural networks. This model can serve as both a synthesis planning tool and as a tool for synthesizable molecular design. [Paper]
- SPIB - SPIB (State Predictive Information Bottleneck) is a Deep Learning-based framework that learns the reaction coordinates from high dimensional molecular simulation trajectories. [Paper]
- MolT5 - A self-supervised learning framework for pretraining models on a vast amount of unlabeled natural language text and molecule strings. [Paper]
- DIONYSUS - An extensive study of the calibration and generalizability of probabilistic Machine Learning models on small chemical datasets. [Paper]
- NVIDIA-PCQM4Mv2 - Heterogenous ensemble of models for Molecular Property Prediction. [Paper]
- JAEGER - JT-VAE Generative Modeling (JAEGER) is a deep generative approach for small-molecule design. It is based on the Junction-Tree Variational Auto-Encoder (JT-VAE) method. [JT-VAE paper]
- Chem Faiss - Vector similarity search functionality from Faiss, in conjunction with chemical fingerprinting to build a scalable similarity search architecture for compounds/molecules.
- DECIMER - Deep lEarning for Chemical ImagE Recognition (DECIMER): it translates a bitmap image of a molecule into a SMILES. [Paper]
- DECIMER Image Transformer - The DECIMER (Deep lEarning for Chemical ImagE Recognition) 2.1 project.
- STOUT - Transformer based SMILES to IUPAC Translator.
- MoLFormer - A large-scale chemical language model designed with the intention of learning a model trained on small molecules which are represented as SMILES strings. [Paper]
- Mol-CycleGAN - A generative model for molecular optimization. [Paper]
- CLAMP - CLAMP (Contrastive Language-Assay Molecule Pre-Training): natural language to predict the most relevant molecule, given a textual description of a bioassay, without training samples. [Paper].
- molplotly - An add-on to Plotly built on RDKit which allows 2D images of molecules to be shown in Plotly figures when hovering over the data points.
- MolForge - Neural-machine-translation based models that translate a set of various structural fingerprints to conventional text-based molecular representations, such as SMILES and SELFIES. [Paper]
- EDM - Equivariant Diffusion for Molecule Generation in 3D. [Paper]
- SELFormer - Molecular Representation Learning via SELFIES Language Models. [Paper]
- Regression Transformer - Concurrent sequence regression and generation for molecular language modelling. [Paper]
- Bio-Diffusion - A PyTorch hub of denoising diffusion probabilistic models designed to generate novel biological data. [Paper]
- InstructMol - Multi-Modal integration for building a versatile and reliable molecular assistant in Drug Discovery. [Paper]
- AlphaFold Protein Structure Database - A online database which provides open access to 992,316 protein structure predictions for the human proteome and other key proteins of interest, to accelerate scientific research.
- AlphaFold - Open source code for DeepMind's AlphaFold.
- OpenFold - Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2.
- AlphaFold - single sequence input - A Colab notebook to predict the protein structure from a single sequence (for educational purposes only).
- ColabFold - Making Protein folding accessible to all via Google Colab. [Article]
- LocalColabFold - Running ColabFold on your local PC.
- Meaningful Protein Representation - Learning meaningful representations of protein sequences using a VAE. [Article] [Paper].
- TAPE - Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. [Paper]
- FastFold - Optimizing Protein Structure Prediction Model Training and Inference on GPU Clusters.
- ESM - Pretrained language models that enable zero-shot prediction of the effects of mutations on protein function. [Paper]
- Protein Sequence Embedding - Learning protein sequence embeddings using information from structure. [Paper]
- IdpGAN - A GAN to generate different 3D conformations for intrinsically disordered proteins given their sequences. [Article]
- PocketMiner - A tool for predicting the locations of cryptic pockets from single protein structures. [Paper]
- progen2 - A suite of open-sourced projects and models for protein engineering and design. [Paper]
- TransformerCPI - Improving compound–protein interaction prediction by sequence-based Deep Learning with self-attention mechanism and label reversal experiments. [Paper]
- Graphein - A Python package which provides functionality for producing a number of types of graph-based representations of proteins, compatible with standard geometric Deep Learning library formats, as well as graph objects designed for ease of use with popular Deep Learning libraries.
- EvoBind - In silico directed evolution of peptide binders with AlphaFold2. [Paper]
- alphafold_finetune - Python code for fine-tuning AlphaFold to perform protein-peptide binding predictions.
- ProtGPT2 - A deep unsupervised language model for protein design. [Article]
- Bio Embeddings - General purpose Python embedders based on open models trained on biological sequence representations. [Paper]
- Uni-Fold - An Open Source platform for developing protein models beyond AlphaFold. [Paper]
- AF2Rank - State-of-the-art estimation of protein model accuracy using AlphaFold. [Paper]
- ProteinMPNN - Robust Deep Learning based protein sequence design. [Paper]
- LigandMPNN - Atomic context-conditioned protein sequence design. [Paper]
- DPAM - A Domain Parser for AlphaFold Models. [Paper]
- ModelAngelo - Automatic atomic model building program for Electron cryo-microscopy (cryo-EM) maps. [Paper]
- DiffDock - Diffusion steps, twists, and turns for Molecular Docking. [Paper].
- MoLPC - Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search. [Article]
- foldingdiff - A diffusion model for generating novel protein backbone structures. [Paper]
- ProGen - Suite of open-sourced projects and models for protein engineering and design. [Paper]
- DeepAb - Antibody structure prediction using interpretable Deep Learning. [Paper]
- cdna-display-proteolysis-pipeline - Mega-scale experimental analysis of protein folding stability in biology and protein design. [Paper]
- PDBench - A dataset and software package for evaluating fixed-backbone sequence design algorithms. [Paper]
- vcMSA - A Python library to run vector clustering Multiple Sequence Alignment. [Paper]
- Ankh - A optimized Protein Language Model. [Paper]
- TorchProtein - A Machine Learning library for protein science, built on top of TorchDrug.
- GearNet - Geometric pretraining methods for Protein Structure Representation Learning. [Paper]
- RFdiffusion - Broadly applicable and accurate protein design by integrating structure prediction networks and diffusion generative models. [Paper]
- TargetGAN - A deep generative model for drug design from protein target sequence. [Paper]
- Iterative_masking - An iterative method that directly employs the masked language modeling objective to generate sequences using a MSA Transformer. [Paper]
- protpardelle - An all-atom protein generative model. [Paper]
- ProteinGLUE - A multi-task benchmark suite for self-supervised protein modeling. [Paper]
- MassiveFold - A tool that allows to massively expand the sampling of structure predictions by improving the computing of AlphaFold based predictions. [Paper]
- DRFP - An NLP-inspired chemical reaction fingerprint based on basic set arithmetic. [Article].
- DeepChem - A high quality Open Source toolchain that democratizes the use of Deep Learning in drug discovery, materials science, quantum chemistry, and biology.
- CompAugCycleGAN - Augmented CycleGAN used for generating chemical compositions. [Article]
- Chemformer - A pre-trained transformer for computational chemistry.
- RDKit - Open Source toolkit for cheminformatics and Machine Learning.
- Streamlit-app - A Streamlit web app for cheminformatics which includes also a RDKit cheatsheet.
- datamol - A lightweight Python library to work with molecules, built on top of RDKit.
- rxn_yields - Prediction of chemical reaction yields using Deep Learning and data augmentation strategies. [Article]
- gptchem - Using GPT-3 to solve Chemistry problems. [Paper]
- protein_scoring - Computational Scoring and experimental evaluation of enzymes generated by Neural Networks. [Paper]
- Jazzy - A Python library that allows to calculate a set of atomic/molecular descriptors which include the Gibbs free energy of hydration (kJ/mol), its polar/apolar components, and the hydrogen-bond strength of donor and acceptor atoms using either SMILES or MOL/SDF inputs.
- CRISPRi - Improved prediction of bacterial CRISPRi guide efficiency from depletion screens through mixed-effect Machine Learning and data integration. [Paper]
- TorchDrug - A powerful and flexible PyTorch-based Deep Learning platform for drug discovery.
- COVID-19 Multi-Targeted Drug Repurposing Using Few-Shot - PyTorch implementation of MolGNN Few-shot. [Article] [Paper]
- PaddleHelix - A Bio-Computing Platform featuring Large-Scale Representation Learning and Multi-Task Deep Learning.
- liGAN - Deep generative models of 3D grids for structure-based drug discovery. [Article] [Paper]
- LIMO - Latent Inceptionism for targeted MOlecule Generation: a generative model for drug discovery. [Paper]
- DelFTa - Δ-Quantum Machine Learning for medicinal chemistry. [Paper]
- Fréchet ChemNet Distance - Fréchet ChemNet Distance: a quality measure for generative models for molecules. [Paper].
- DrugOOD - A systematic OOD (Out-Of-Distribution) dataset curator and benchmark for AI-aided drug discovery. [Paper]
- PIGNet - a Physics Informed Deep Learning model toward generalized drug-target interaction predictions. [Paper]
- REINVENT - An AI tool for de novo drug design. [Paper]
- MolScore - An automated scoring function to facilitate and standardize evaluation of goal-directed generative models for de novo molecular design. [Article]
- SMILES-RNN - A SMILES-based recurrent neural network used for de novo molecule generation with several reinforcement learning algorithms available for molecule optimization. [Article]
- DiffLinker - Equivariant 3D-Conditional Diffusion Model for molecular linker design. [Paper]
- SQUID - Equivariant shape-conditioned generation of 3D molecules for Ligand-Based Drug Design. [Paper]
- DiffSBDD - A Euclidean diffusion model for structure-based drug design. [Paper]
- MF-PCBA - Multi-fidelity high-throughput screening benchmarks for drug discovery and Machine Learning. [Paper]
- Deep Surrogate Docking - Accelerating automated Drug Discovery with Graph Neural Networks. [Paper]
- HGAN-DTI - Heterogeneous Graph Attention Network for Drug-Target Interaction Prediction. [Paper]
- MolSkill - Learning chemical intuition from humans in the loop. [Paper]
- AI-Bind - Interpretable AI pipeline improving binding predictions for novel protein targets and ligands. [Paper]
- DECIMER - Hand-drawn molecule images dataset - A standardised, openly available benchmark dataset of 5088 hand-drawn depictions of diversely picked chemical structures. [Article]
- UniProt - The world’s leading high-quality, comprehensive and freely accessible resource of protein sequence and functional information.
- UniLanguage - Homology reduced UniProt, train-/valid-/testsets for language modeling.
- ChEMBL - A manually curated database of bioactive molecules with drug-like properties. It brings together chemical, bioactivity and genomic data to aid the translation of genomic information into effective new drugs.
- Molecule OCR Real images Dataset - Test dataset from paper "Image2SMILES: Transformer-based Molecular Optical Recognition Engine". It contains 296 structures: images and Functional Groups SMILES (FG-SMILES). [Paper]
- FS-Mol - A Few-Shot Learning Dataset of Molecules, containing molecular compounds with measurements of activity against a variety of protein targets.
- ProteinNet - A standardized data set for Machine Learning of protein structure.
- SidechainNet - An all-atom protein structure dataset for Deep Learning. It is an extension of the ProteinNet dataset. [Paper]
- DIPS - Database of Interacting Protein Structures. [Paper]
- Aggregated Views of Proteins - Protein data bank in Europe knowledge base.
- ProtCAD - Protein Common Assembly Database. A comprehensive structural resource of protein complexes. [Paper]
- gget - A free, Open Source command-line tool and Python package that enables efficient querying of genomic databases.
- ESM Atlas - An open atlas of 617 million metagenomic protein structures.
- Progres - A Python package to perform fast search structures against pre-embedded structural databases and pre-embed datasets. [Paper]
- ZINC - A free public resource for ligand discovery. The database contains over twenty million commercially available molecules in biologically relevant representations that may be downloaded in popular ready-to-dock formats and subsets. [Paper]
- Papyrus - A large-scale curated dataset aimed at bioactivity predictions. [Paper]
- MISATO - Machine Learning dataset of protein-ligand complexes for structure-based drug discovery. [Paper]
- Interpretable and Explainable Machine Learning for Materials Science and Chemistry - Interpretable and Explainable Machine Learning applied to materials science and chemistry.
- exmol - Explainer for black box models that predict molecule properties. [Article]
- BERTology - Interpreting Attention in Protein Language Models. [Paper]
- DRPreter - Interpretable anticancer drug response prediction using Knowledge-Guided Graph Neural Networks and Transformer. [Paper]
- nglview - A Jupyter widget to interactively view molecular structures and trajectories.
- Panel-Chemistry - Easy exploratory data analysis and build powerful data and viz tools within the domain of Chemistry using Python and HoloViz Panel.
- libmolgrid - A comprehensive library for fast, GPU accelerated molecular gridding for Deep Learning workflows. [Paper]
- stmol - A component for building interactive molecular 3D visualizations within Streamlit web applications.
- MolecularNodes - An add-on and set of pre-made nodes for Blender & Blender’s Geometry Nodes, to import, animate and manipulate molecular data.
- Jupyter Dock - Molecular Docking integrated in Jupyter notebooks.
- AugLiChem - A data augmentation library for chemical structures. [Paper]
- Chemiscope - An interactive structure/property explorer for materials and molecules. While the core project is implemented in a different programming language, it has been added to this list because it provides Python extensions that allow using it within a Jupyter or Colab notebook.
- ReMODE - A Deep Learning-based web server for target-specific drug design. [Paper]