- Fix missing docker images
- Added a contributing guide
- Added the
esm1v
embedder from Meier et al. 2021, which is part of facebook's esm. Note that this is an ensemble model, so you need to passensemble_id
with a value from 1 to 5 to select which weights to use. - Added the
bindEmbed21DL
extract protocol which is an ensemble of 5 convolutional neural network that predicts of 3 different types of binding residues (metal, nucleic acids, small molecules). - Fix model download
- Update jaxlib to fix pip installation
- BETA: in-silico mutagenesis using ProtTransBertBFD. This computes the likelihood that, according to Bert, a residue in a protein can be a certain amino acid, which can be used as an estimate for the effect of a mutation. This adds two a new
mutagenesis
and a new protocolplot_mutagenesis
in thevisualize
stages, of which the first one computes the probabilities and writes them to a csv file while the latter visualizes the results as interactive plotly figure. - Support
half_precision_model
forprottrans_bert_bfd
andprottrans_albert_bfd
- Fix a
n_components: 2
in the plotly protocol
- Added the
prottrans_t5_xl_u50
/ProtTransT5XLU50Embedder
embedder from the latest ProtTrans revision. You should use this overprottrans_t5_bfd
andprottrans_t5_uniref50
. - The
projected_embeddings_file.csv
of project stages has been renamed toprojected_reduced_embeddings_file.h5
. For backwards compatibility,projected_embeddings_file.csv
is still written. - The
projected_embeddings_file
parameter of visualize stages has been renamed toprojected_reduced_embeddings_file
and takes an h5 file. For backwards compatibility,projected_embeddings_file
and csv files are still accepted. - Added the pb_tucker model as project stage. Tucker is a contrastive learning model trained to distinguish CATH superfamilies. It consumes prottrans_bert_bfd embeddings and reduces the embedding dimensionality from 1024 to 128. See https://www.biorxiv.org/content/10.1101/2021.01.21.427551v1
- Renamed
half_model
tohalf_precision_model
- Added
prottrans_t5_uniref50
/ProtTransT5UniRef50Embedder
. This version improves over T5 BFD by being finetuned on UniRef50. - Added a
half_model
option to both T5 models (prottrans_t5_uniref50
andprottrans_t5_bfd
). On the tested GPU (Quadro RTX 3000)half_model: True
reduces memory consumption from 12GB to 7GB while the effect in benchmarks is negligible (±0.1 percentages points in different sets, generally below standard error). We therefore recommend switching tohalf_model: True
for T5. - Added DeepBLAST from Protein Structural Alignments From Sequence (see example/deepblast for an example)
- Dropped python 3.6 support and added python 3.9 support
- Updated the docker example to cache weights
- Updated to pytorch 1.7
- Published the ghcr.io/bioembeddings/bio_embeddings docker image
- Integrated Evolutionary Scale Modeling (ESM) from "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences" (Rives et al., 2019)
- Included example to transfer GO annotations (a-la goPredSim). We also make the reference annotations and embeddings available!
- New language models: ESM, PLUS, CPCProt, bepler and T5 from ProtTrans
- The documentation got a new home https://docs.bioembeddings.com. This includes documentation for the python API.
- Additional pipeline and notebook examples
- Added as
original_id
attribute to embeddings in the h5 files which contains the sequence header from the fasta file - Changed SeqVec to by default run a warmup so that the first embeddings don't have a random error
- Added an
fp16
to save embeddings with half precision