Design of new molecule: countless applications in various sectors, e.g. pharmaceuticals and materials.
Pharma: average time discovery starts - market, 13 years. Outside pharma: 25 years
Crucial 1st step: generate pool of promising candidates
Daunting task (chemical space is huge and has complex structural constraints molecules)
Old way
Soon-to-be-old way: high throughput virtual screening (HTVS)
Just existing molecules are explored
Much time lost evaluating bad leads
Goal: traverse chemical space more “effectively”: reach optimal molecules with less evaluations than brute-force screening
Combinatorial optimization problem
Often stochastic and multi-objective
Black-box objective functions
Black-box constraints
The process of automatically proposing novel chemical structures that optimally satisfy desired properties
Optimally satisfy desired properties:
Predictive models to forecast/approximate properties/ objective functions from chemical structure
Automatically proposing novel chemical structures Automatic generation of molecules that optimize properties (predictions from first stage)
Session 1: Predictive (QSAR) Models, with focus in low data regime
Session 2: Generative Models
Session 3: The Tailor’s Drawer (+ Case Study)
Predictive models to forecast properties of molecules given structure, with focus on small data regime
Computational representations of molecules
An overview of predictive models for molecular properties
Evaluating model performance
Molecules are 3D QM objects with: nuclei with defined positions surrounded by electrons described by complex wave-functions
Digital encoding that serves as input to model
Uniqueness and invertibility
Trade-off: information lost vs complexity
3D coord. representation (symmetries?)
More compact 2D (graph) representation
1D, 2D and 3D Representations
Simplified Molecular Input Line Entry System (SMILES)
Molecule as graph (bond length and conformational info lost)
Traverse graph
Generate Sequence of ASCII characters
Non-Unique! Canonical SMILES
Tabular data:
Morgan Fingerprints Capecci et. al. (2020)
Mordred Descriptors Moriwaki et. al. (2018)
More… e.g. molecular embeddings
3D point clouds: \(\mathcal{M} = \lbrace x_i, r_i \rbrace_{i=1}^p\), where \(x_i\) are features and \(r_i\) are coordinates
Minimal information lost (conformational preferences, bond lengths, etc.)
Tailored predictive algorithms that respect 3D translational and rotational invariance
Molecular representation \(x\) and property \(y \in \mathbb{R}\)
Given training data \(\mathcal{D} = \lbrace x_i, y_i \rbrace_{i=1}^p\)…
… predictive regression model of \(y\) given \(x\).
Deterministic models - Point Forecasts
Probabilistic (Bayesian) models - Probabilistic Forecasts
Usual desterministic models: linear regression, RF, XGBoost, SVR…
Low-data regime:
\(p \gg n\): need for regularization
Uncertainty is key \(\Rightarrow\) probabilistic (Bayesian) models
One-hot encoding of SMILES representations
Deep Neural Nets: RNN, 1D Conv, Transformers
BNNs
Computationally expensive to train
Variational Inference: uncertainty underestimation Blei et. al. (2018)
Graph Neural Networks
Sequence of graph-to-graph blocks + output layer
(Infinitely) many architectures: Graph Networks Battaglia et. al. (2018)
Functions on graph-structured data
GN block (graph-to-graph map): primary computational unit in GNN
Graph \(N_v\) nodes and \(N_e\) edges: tuple \(G = (\textbf{u}, V, E)\)
Edge update function \(\phi^e\)
Node update function \(\phi^v\)
Global update function \(\phi^u\).
\(\rho^{e\rightarrow v}\): aggregates edge attributes per node
\(\rho^{e\rightarrow u}\): aggregates edge attributes globally
\(\rho^{v\rightarrow u}\): aggregates node attributes globally.
Various parametric forms for functions
Multilayer perceptrons for the update functions and sums for the aggregate functions
GN blocks can be concatenated
Output layer of GNN depends on the task
The entire architecture can be summarized as follows:
Encode the input graph using independent node and edge update functions to match the internal node and edge feature sizes
Apply multiple GN blocks
Use an output layer to map the updated global features to a property prediction
Once the architecture is defined, the parameters can be optimized using standard optimizers and loss functions.
Geometric Neural Networks
(Again) many architectures
In a Geometric Net Block we update:
Node features, s.t. updated features are invariant to 3D translations and rotations
Node coordinates, s.t. updated coordinates are equivariant to 3D translations and rotations
\(E(n)\) equivariant graph neural nets Satorras et. al. (2022)
Refinement of MPNN
\(G = (V, E)\)
In addition to node features, coordinates: \(V = \lbrace v_i, x_i \rbrace_{i=1:N_{v}}\).
\(\forall\) edges \(k\), \(\textbf{e}'_k = \phi^{e} (\textbf{e}_k, \textbf{v}_{r_k}, \textbf{v}_{s_k})\)
\(\forall\) nodes \(i\)
\(\forall\) edges \(k\), \(\textbf{e}'_k = \phi^{e} (\textbf{e}_k, \textbf{v}_{r_k}, \textbf{v}_{s_k}, \color{red}{\Vert x_{r_k} - x_{s_k} \Vert ^2} )\)
\(\forall\) nodes \(i\)
Usual metrics for regression
RMSE
MAE
MAPE
\(R^2\)
Multiple ways, research area itself! Gneiting and Raftery (2007)
Calibration measures
Idea: create \((100 \cdot q)\)% prediction intervals for the property prediction of every molecules in a test set.
\(C(q)\) is the proportion of the molecules in the test set whose property value is in the interval calculated for such molecule.
If \(C(q) = q\) we say that the model is well calibrated.
If \(C(q) < q\) we say that the model is overconfident.
If \(C(q) > q\) we say that the model is underconfident.