Advancements in machine learning technology, the exponential growth of drug-related data, and the widespread availability of user-friendly machine learning frameworks in popular programming languages 1 2 are making machine learning methodologies increasingly prevalent throughout all stages of the drug discovery and development process. 3
Data quality and representation significantly impact the performance of machine learning-based predictive models, as both are crucial for effective pre-training. As a result, there has been a surge of research interest in molecular representation. This research encompasses pre-computed or fixed molecular representations, such as molecular graph representations and linear notations (e.g., SMILES and molecular fingerprints), 4 5 as well as learned molecular representations. 6.
This literature review provides an overview of the various molecular representation approaches used in machine learning-based drug development and explores their applications in conjunction with machine learning models for predicting molecular properties and reactions.
- Molecular Representations/Descriptors in Machine Learning-Based Drug Development
1.1 Molecular Graph Theory
1.1.1 Introduction To The Molecular Graph Representation
1.1.2 Mathematical Defintion of a Graph
1.1.3 Graph Traversal Algorithms
1.1.4 Molecular Graph Reprentations
1.1.5 Advantages of Molecular Graph Representations
1.1.6 Disadvantages of Molecular Graph Representations
1.1.7 Molecular Graphs in AI-Driven Small Molecule Drug Discovery
1.1.8 References
1.2 Molecular Descriptors
1.2.1 Introduction to Molecular Descriptors
1.2.2 Molecular Fingerprints
1.2.3 Key-Based Molecular Fingerprints - MACCS Keys
1.2.4 Hash-Based Molecular Fingerprints - Daylight Fingerprint & ECFPs
1.2.5 Advantages & Applications of Molecular Fingerprints
1.2.6 Molecular Fingerprints in Machine Learning
1.2.7 References - Machine Learning-Based Drug Development
2.1 Introduction to Machine Learning
2.1.1 How does Machine Learning Work?
2.1.2 Machine Learning Methods
2.1.3 Machine Learning Notation
2.1.4 References
2.2 Supervised Learning
2.2.1 Classification Algorithms in Supervised Learning
2.2.2 Regression Algorithms in Supervised Learning
[1] Abadi, M. et al. (2015) ‘TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems’, https://www.tensorflow.org/, Software available from tensorflow.org.
[2] Paszke, A. et al. (2017) ‘NIPS Autodiff Workshop’.
[3] Kim, J. et al. (2021) ‘Comprehensive survey of recent drug discovery using Deep Learning’, International Journal of Molecular Sciences, 22(18), p. 9983.
[4] Rifaioglu, A.S. et al. (2020) ‘DEEPScreen: High performance drug–target interaction prediction with convolutional neural networks using 2-D structural compound representations’, Chemical Science, 11(9), pp. 2531–2557.
[5] David, L. et al. (2020) ‘Molecular representations in AI-Driven Drug Discovery: A review and practical guide’, Journal of Cheminformatics, 12(1).
[6] Yang, K. et al. (2019) ‘Analyzing learned molecular representations for property prediction’, Journal of Chemical Information and Modeling, 59(8), pp. 3370–3388.