Virtual Screening (VS): Brute-force evaluation of huge libraries of compounds to identify structures that improve desired properties (e.g. drug-likeness)
Main goal: Traverse the chemical space more effectively (better molecules in less evaluations)
Generate compounds in a directed manner
Select the model depending on the encoding information
Performance measures (RMSE, \(R^2\), etc.) + assessment of probabilistic predictions
De-novo design is also referred to as generative chemistry
Discriminative models learn decision boundaries
Generative models model the probability distribution of each class
\(\rightarrow\) Can be instantiated to generate new examples (!)
Not the only way to obtain new compounds…
Untargetted vs. targetted (extra metric to optimize, e.g. QED, PlogP and many more)
Targetted generation depends on having a proper characterization of the property of interest
Gradient-based: Models that use the gradient of the objective function to perform optimization
Gradient-free: Metaheuristic models, based on stochastic population optimization
Chemical representation tailored for each case depending on the data, objective and resources available
Atom based | Fragment based | Reaction based | |
---|---|---|---|
Gradient free | EvoMol* GB-GA |
CReM | AutoGrow4 |
Gradient based | ChemVAE* EDM* PaccMannRL GraphAF |
JT-VAE | DoG |
Many (many) more… VLS3D list of resources
Originally introduced in Bombarelli et al. (2018)
Extension of the ideas from Generative Adversarial Networks (GANs) and autoencoders
AE: Hourglass-structured NN that encodes and decodes the input information, consisting on an encoder, \(f_\theta(x)\), decoder, \(g_\phi(z)\), and the latent space, \(z\)
Attempts to learn the identity function, i.e. \[ \text{VAE} = g_\phi \circ f_\theta \, \quad s.t. \quad \text{VAE}^*(x) = g_\phi(f_\theta(x)) = x \]
Minimize the reconstruction error (\(\epsilon\)): \[ \arg \min_{\theta,\phi} \epsilon(x, \hat{x}) \]
\(\hat{x} \simeq x\) \(\Rightarrow\) Model encodes/decodes correctly
AE can be seen as generative models
Latent space difficult to navigate
VAE: Adds stochasticity to the encoding \(\rightarrow\) Regularize latent space
\[ Loss = \epsilon(x,\hat{x}) + regularizer \]
The regularization forces latent encoding to ressemble a prior: \[ p(z) = \mathcal{N}(0, I) \]
The encoded data will follow \[ z \sim q(z|x) = \mathcal{N}(\mu_x, \sigma_x) \]
where \(\mu_x\) and \(\sigma_x\) are given by \(f_\theta(x)\), which can be seen as \[ \mu_x = f^1_\theta(x), \quad \sigma_x = f^2_\theta(x) \] being \(f^1\) and \(f^2\) the first and second half of the units of the latent layer
KL divergence as regularizer (closed form solution) \[ KL(q(z|x)|p(z)) = \sum_{i=1}^n (\sigma_{x,i})^2 + (\mu_{x,i})^2 - log(\sigma_{x,i})-1 \]
Adding noise, we sample from the latent space and decode itChemVAE: VAE + property predictor
\[ \mathcal{L}_{\text{VAE}} = \epsilon(x, \hat{x}) + KL(q(z|x)|p(z)) + \mathcal{L}_P(x,\hat{x}) \] with \(\mathcal{L}_P(x,\hat{x})\) the property prediction error
Fig: (a) ChemVAE architecture (b) Property optimization via BO
Local behavior + interpolation between compounds possible
generative_models/
variational_autoencoder/
VAE.ipynb
Only a brief introduction though… Check the original repo for extended functionality
VAE-based: Recent interest in using reinforcement learning
EDM: Equivariant diffusion model for 3D molecule generation
The same principle behind Stable Diffusion
Diffusion model learns denoising processes (opposite of a diffusion process)
\(\rightarrow\) progressively add Gaussian noise (\(z_t\)) to signal (\(x\)) \[ q(z_t|x) = \mathcal{N}(z_t|\alpha_tx_t, \sigma_t^2I) \] with \(\alpha_0 \approx 1\) and \(\alpha_T \approx 0\) and \(\sigma_t\) the added noise level
The diffusion process is Markovian with transition distribution \[ q(z_t|z_s) = \mathcal{N}(z_t|\alpha_{t|s}z_s, \sigma_{t|s}^2I)\,, \quad \forall t>s \] with \(\alpha_{t|s} = \alpha_t/\alpha_s\) and \(\sigma_{t|s}^2 = \sigma_t^2 - \alpha_{t|s}^2\sigma_s^2\)
The complete process can be given by: \[ \begin{gathered} q(z_0, z_1, \cdots, z_T|x) = q(z_0|x) \textstyle{\prod_{t=1}^T} q(z_t|z_{t-1}) \\ q(z_s|x, z_t) = \mathcal{N}(z_s|\mu_{t \rightarrow s}(x, z_t), \,\sigma_{t \rightarrow s}^2I) \end{gathered} \] with \(\mu_{t \rightarrow s}(x, z_t)\) and \(\sigma_{t \rightarrow s}^2\) in terms of \(\alpha\)’s , \(\sigma\)’s, \(x\) and \(z\)
Similar to VAE approach, but now only decoding and latent space as pure noise
EDM performs property optimization with a simple extension of \(\phi\) into \(\phi(z_t, [t, c])\), with \(c\) a property of interest
generative_models/
diffusion/DIFFUSION.ipynb
Key idea:
Population of individuals (states) in which the fittest (highest valued state) produce offspring (successor states) that populate the next generation in a process of recombination and mutation.
Many different evolutionary algorithms, they mostly vary on their setup regarding common criteria:
ATGC
for genes), sequences of real numbers (evolution strategies) or even computer programs (genetic programming)Many different evolutionary algorithms, they mostly vary on their setup regarding common criteria:
Many different evolutionary algorithms, they mostly vary on their setup regarding common criteria:
Many different evolutionary algorithms, they mostly vary on their setup regarding common criteria:
Example: (a) Rank population by fitness levels (b), resulting in pairs (c) from mating and producing offspring (d) which are subject to mutations (e)
Child gets the first three digits from the \(1^{st}\) parent (327) and the remaining five from the \(2^{nd}\) parent (48552)
(no mutation here)
Schema: Structure in which some positions are left unspecified
generative_models/
evolutionary_algorithm/GENETIC.ipynb