Skip to content

Awesome papers and codes list of analytical chemistry-related deep learning methods

License

Notifications You must be signed in to change notification settings

JosieHong/awesome-mass-spectrometry-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

59 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Machine Learning in Small Molecules Mass Spectrometry

Awesome

Mass spectrometry, also called mass spec, is an analytical technique that is used to measure the mass-to-charge ratio of ions. The results are presented as a mass spectrum, a plot of intensity as a function of the mass-to-charge ratio.

from Wikipedia

Keep updating the awesome machine-learning papers and codes related to small molecules mass spectrometry. Please notice that awesome lists are curations of the best, not everything. Contributes are always welcome!

Contents

Databases

Molecular properties:

  • OC20 & OC22: The Open Catalyst Project focuses on using AI to find new renewable energy storage catalysts, releasing the OC20 and OC22 datasets with 1.3 million molecular relaxations from 260 million DFT calculations for research support.
  • QM9: This dataset includes the computed geometric, energetic, electronic, and thermodynamic properties of 134,000 stable small organic molecules composed of CHONF.
  • GEOM: This dataset features 37 million molecular conformations for over 450,000 molecules, generated using advanced sampling and semi-empirical density functional theory (DFT).
  • MD17 & MD22: The MD22 benchmark dataset includes molecular dynamics trajectories of seven biomolecular and supramolecular systems, with atom counts ranging from 42 to 370, sampled at 400-500 K with 1 fs resolution, and energy and forces calculated using PBE+MBD theory.
  • PCQM4Mv2: PCQM4Mv2 is a quantum chemistry dataset derived from the PubChemQC project, focusing on the ML task of predicting DFT-calculated HOMO-LUMO energy gaps of molecules using their 2D graphs, a significant task due to the expense of obtaining 3D equilibrium structures.
  • MoleculeNet: MoleculeNet is a benchmark for testing machine learning methods on molecular properties, featuring over 700,000 compounds from multiple databases, integrated into the DeepChem package, and evaluates model performances using metrics like AUC-ROC, AUC-PRC, RMSE, and MAE.

MS/MS:

  • NIST23: The NIST MS/MS Library 2023 is a collection of MS/MS spectra and search software. It contains 2,374,064 MS/MS spectra from 399,267 small molecules.
  • MoNA: MoNA currently contains 2,061,612 mass spectral records from experimental and in-silico libraries, as well as from user contributions.
  • GNPS: GNPS is a web-based mass spectrometry ecosystem that aims to be an open-access knowledge base for the community-wide organization and sharing of raw, processed, or annotated fragmentation mass spectrometry data (MS/MS).
  • HMDB 5.0: The Human Metabolome Database (HMDB) Version 5.0 is an extensive and freely accessible electronic resource that contains 220,945 metabolite entries present in the human body and their experimental MS/MS spectra.

Retention time:

  • SMRT: This dataset presents an experimentally acquired reverse-phase chromatography retention time dataset, covering up to 80,038 small molecules.
  • RepoRT: RepoRT currently contains 373 datasets, 8,809 unique compounds, and 88,325 retention time entries measured on 49 different chromatographic columns using various eluents, flow rates, and temperatures.

Collision cross section:

  • AllCCS: This collection includes more than 5,000 experimental CCS records and approximately 12 million calculated CCS values for over 1.6 million small molecules.
  • AllCCS2: Compared to AllCCS, AllCCS2 incorporates newly available experimental CCS data, including 10,384 records from 4,326 compounds. After standardization, 7,713 unified CCS values with confidence scores were added.
  • METLIN-CCS: The METLIN-CCS database includes collision cross section (CCS) values derived from IMS data for more than 27,000 molecular standards across 79 chemical classes.
  • CCSBase: CCSbase is an integrated platform consisting of a comprehensive database of CCS measurements taken from a variety of sources and a high-quality and high-throughput CCS prediction model trained with this database using machine learning. Website

Papers

Survey papers

Discussions in database

Discussions in pre-train models

Small molecular representation learning

According to the information embedded in the model, the molecular representation learning models are categorized as point-based (or quantum-based) methods, graph-based methods, and sequence-based methods. Because the number of graph-based methods is huge, they are further divided into self-supervised learning and supervised learning manners. It is worth noting that the difference between point-based (or quantum-based) methods and graph-based methods is if bonds (i.e. edges) are included in the encoding.

Point-based (or quantum-based) methods

Graph-based methods

Self-Supervised Learning:

Supervised Learning

Other Related Works

Sequence-based methods

Mass spectrometry-related properties prediction

Tandem mass spectra prediction predicton

Retention time prediction

Collision cross section prediction

Mass spectra representation learning and matching

Chemical formula prediction from mass spectra

Mass spectra peak annotation/assignment

Machine learning in small molecules chromatography

Mass spectrometry is often coupled with chromatographic techniques, such as GC-MS (gas chromatography-mass spectrometry) or LC-MS (liquid chromatography-mass spectrometry). In these combined techniques, the chromatographic method separates the compounds, and then the mass spectrometer analyzes each separated compound for identification and quantification.

Related awesome lists

About

Awesome papers and codes list of analytical chemistry-related deep learning methods

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published