AI小分子藥物發現的「百科全書」，康乃爾、劍橋、EPFL等研究者綜述登Nature子刊-人工智慧-PHP中文網

AI小分子藥物發現的「百科全書」，康乃爾、劍橋、EPFL等研究者綜述登Nature子刊

作者| 康乃爾大學杜沅豈

編輯| ScienceAI

隨著AI for Science 受到越來越多的關注，人們更加關心AI 如何解決一系列科學問題並且可以被成功借鑒到其他相近的領域。

AI 與小分子藥物發現是其中一個非常有代表性和很早被探索的領域。分子發現是一個非常困難的組合優化問題（由於分子結構的離散性）並且搜索空間非常龐大與崎嶇，同時驗證搜索到的分子屬性又十分困難，通常需要昂貴的實驗，至少是至少是模擬計算、量子化學的方法來提供回饋。

隨著機器學習的高速發展和得益於早期的探索（包括構建了簡單可用的優化目標與效果衡量方法），大量的算法被研發，包括組合優化，搜索，採樣算法（遺傳算法、蒙特卡洛樹搜尋、強化學習、生成流模型/GFlowNet，馬可夫鏈蒙特卡洛等），與連續最佳化演算法，貝葉斯最佳化，基於梯度的最佳化等。同時現有較完整的演算法衡量基準，較客觀且公平的比較方式，也為開發機器學習演算法開拓了廣闊的空間。

近日，康乃爾大學、劍橋大學和洛桑聯邦理工學院（EPFL）的研究人員在《Nature Machine Intelligence》發表了題為《Machine learning-aided generative molecular design》的綜述文章。

AI小分子藥物發現的「百科全書」，康乃爾、劍橋、EPFL等研究者綜述登Nature子刊

論文連結：https://www.nature.com/articles/s42256-024-00843-5

這篇綜述回顧了機器學習在生成式分子設計中的應用。藥物發現和開發需要優化分子以滿足特定的理化性質和生物活性。然而，由於搜尋空間巨大且最佳化函數不連續，傳統方法既昂貴又容易失敗。機器學習透過結合分子生成和篩選步驟，進而加速早期藥物發現過程。

AI小分子藥物發現的「百科全書」，康乃爾、劍橋、EPFL等研究者綜述登Nature子刊

圖示：生成式 ML 輔助分子設計流程。

生成性分子設計任務

生成性分子設計可分為兩大範式：分佈學習和目標導向生成，其中目標導向生成可以進一步分為條件生成和分子優化。每種方法的適用性取決於特定任務和所涉及的資料。

分佈學習 (distribution learning)

分佈學習旨在透過對給定資料集分子的機率分佈建模來描述資料的分佈，從而從學習到的分佈中採樣新分子。

條件產生 (conditional generation)

屬性條件產生 (property-conditioned generation)：產生具有特定屬性的結構，可以為一個文字的描述，或一個特定屬性的數值。
分子子結構條件生成(molecular (sub)structure-conditioned generation)：產生具有特定結構約束的分子，例如設計部分結構、支架跳躍、連接子設計、重新設計整個結構（先導優化）或整個分子的條件生成（構形生成）。
目標條件生成 (target-conditioned generation)：旨在產生對特定疾病相關生物分子靶點具有高結合親和力的分子。與屬性條件生成不同，目標條件生成利用對靶點結構的顯式訪問，透過整合直接的靶點-配體相互作用來提高配體分子與靶點的親和力。
表型條件生成 (phenotype-conditioned generation)：涉及從基於細胞的顯微鏡或其他生物檢測讀數（如轉錄組數據）中學習表型指紋，以提供條件信號，指導生成朝向理想的生物學結果的分子。

分子優化 (molecule optimization)

分子優化在藥物發現中起著關鍵作用，透過細化藥物候選者的屬性來提高其安全性、有效性和藥物動力學特性。涉及對候選分子結構進行小的修改，以優化藥物性質，如溶解度、生物利用度和靶點親和力，從而提高治療潛力並增加臨床終點的成功率。

AI小分子藥物發現的「百科全書」，康乃爾、劍橋、EPFL等研究者綜述登Nature子刊

Illustrations: Illustrations of generation tasks, generation strategies, and molecular characterization.

Molecular generation process

Molecular generation is a complex process including many different combination units. We list the representative work in the figure below and introduce the representative units of each part.

Molecular Representation

When developing molecularly generated neural architectures, it is first necessary to determine machine-readable input and output representations of the molecular structure. The input representation helps inject appropriate inductive biases into the model, while the output representation determines the optimized search space for the molecule. The representation type determines the applicability of the generation method, for example, discrete search algorithms can only be applied to combinatorial representations such as graphs and strings.

While various input representations have been studied, the trade-offs between representation types and the neural architectures that encode them are not yet clear. Representation transformations between molecules are not necessarily bijective; for example, density maps and fingerprints cannot uniquely identify molecules, and further techniques are needed to solve this non-trivial mapping problem. Common molecular representations include strings, two-dimensional topological graphs, and three-dimensional geometric graphs.

String-based molecular structures: usually encoded as strings, such as Simplified Molecular Input Line Entry System (SMILES) or Self-Referential Embedded Strings (SELFIES). SMILES represents the molecule using syntax rules, but the string may be invalid; SELFIES determines the validity of the molecule by modifying these rules. Molecular strings are typically encoded into sequence data via recurrent networks and Transformer models.
Atoms and bonds based on topological and geometric graphs: usually represented as nodes and edges in topological graphs. Graph neural networks (GNNs) are often used to model graph-structured molecular data, updating node and edge features based on adjacent nodes. Geometric GNNs are often used to capture application-relevant symmetries in 3D space, such as translation and rotation invariance or equivariance, when 3D information is available and relevant.

Representation granularity is another consideration in generative model design. Typically, methods utilize atoms or molecular fragments as basic building blocks during generation. Fragment-based representation refines molecular structures into larger units containing groups of atoms, carrying hierarchical information such as functional group identification, thereby aligning with traditional fragment-based or pharmacophore drug design approaches.

Generative methods

Deep generative models are a class of methods that estimate the probability distribution of data and sample from a learning distribution (also called distribution learning). These include variational autoencoders, generative adversarial networks, normalizing flows, autoregressive models, and diffusion models. Each of these generation methods has its use cases, pros and cons, and the choice depends on the required task and data characteristics.

Generation strategy

Generation strategy refers to the way the model outputs the molecular structure, which can generally be divided into one-time generation, sequential generation or iterative improvement.

One-shot generation: One-shot generation generates the complete molecular structure in a single forward pass of the model. This approach often struggles to generate realistic and reasonable molecular structures with high accuracy. Furthermore, one-shot generation often cannot satisfy explicit constraints, such as valence constraints, which are crucial to ensure the accuracy and validity of the generated structure.

Sequential Generation: Sequential generation builds a molecular structure through a series of steps, usually by atoms or fragments. Valence constraints can be easily injected into sequential generation, thereby improving the quality of the generated molecules. However, the main limitation of sequential generation is that the order of generated trajectories needs to be defined during training and is slower in inference.

Iterative improvement: Iterative improvement adjusts the prediction by predicting a series of updates, circumventing the difficulties in one-shot generation methods. For example, the cyclic structure module in AlphaFold2 successfully refined the backbone framework, an approach that inspired related molecule generation strategies. Diffusion modeling is a common technique that generates new data through a series of noise reduction steps. Currently, diffusion models have been applied to a variety of molecule generation problems, including conformational generation, structure-based drug design, and linker design.

Optimization strategy

Combination optimization: For the combinatorial encoding of molecules (pictures or strings), techniques in the field of combinatorial optimization can be directly applied.

Continuous Optimization: Molecules can be represented or encoded in continuous domains, such as point clouds and geometric maps in Euclidean space, or deep generative models that encode discrete data in continuous latent space.

AI小分子藥物發現的「百科全書」，康乃爾、劍橋、EPFL等研究者綜述登Nature子刊

Evaluation of Generative Machine Learning Models

Evaluating generative models requires computational evaluation and experimental verification. Standard metrics include effectiveness, uniqueness, novelty, etc. Multiple metrics should be considered when evaluating a model to fully assess build performance.

Experimental verification

The generated molecules must be explicitly verified through wet experiments, in contrast to existing research that focuses primarily on computational contributions. While generative models are not without weaknesses, the disconnect between predictions and experiments is also due to the expertise, expense, and lengthy testing cycles required to conduct such validations.

AI小分子藥物發現的「百科全書」，康乃爾、劍橋、EPFL等研究者綜述登Nature子刊

Generating model laws

Most studies reporting experimental validation use RNN and/or VAE, with SMILES as the operating object. We summarize four main observations:

SMILES, although capturing limited 3D information, serves as an efficient representation and is suitable for distributed learning and fine-tuning of small data sets.
Many experimentally validated research targets are kinases, which are common targets in popular open source datasets such as ChEMBL.
The vast majority of goal-directed methods use reinforcement learning (alone or as a component) as optimization algorithms, including ligand-based and structure-based drug design.
AlphaFold predicted structures can be successfully used for structure-generated drug design.

Future Directions

Although machine learning algorithms have brought hope to small molecule drug discovery, there are still more challenges and opportunities to face.

Challenge

Out-of-distribution generation: Known chemicals occupy only a small portion of chemical space. Although deep generative models can propose molecules outside the training distribution, they need to be ensured that they are reasonable.
Unrealistic problem formulation: Precise problem formulation is critical to developing models applicable to real-world drug discovery. Fundamental aspects that are often overlooked include conformational dynamics, the role of water, and entropic contributions, while assumptions such as unlimited access to oracle calls are often wrongly taken for granted. This encompasses the issue of sample efficiency, and recent research has made progress in efficient goal-directed generation under limited oracle budgets.
Low-fidelity oracle: Efficient score design on drug discovery-relevant dimensions remains difficult, becoming a bottleneck in deploying generative models in industrial settings. For example, high-throughput binding affinity predictions are often inaccurate in data-driven and physics-based workflows. While alternative high-precision oracles exist, their computational requirements limit scalability. In addition, the inaccessibility of high-quality annotated data has also become an obstacle to developing AI oracles with high accuracy and manageability.
Lack of Uniform Evaluation Protocols: The evaluation protocols used to evaluate the quality of drug candidates are closely tied to our criteria for defining what a good drug is. The easy-to-compute physicochemical descriptors commonly used by the ML community are questionable and certainly do not fully reflect performance. Rigorous comparisons between generative molecular design and virtual screening are also less common.
Lack of large-scale research and benchmarking: Many ML methods have been developed but without fair benchmarking results on different model types in many critical tasks. For example, only a fraction of the available data was used for training, limiting understanding of the model's scalability. Recent benchmarks are an important contribution to standardizing computational evaluation protocols.
Lack of Interpretability: Interpretability is an important but underexplored area in molecular generative models. For example, insights into how a generation or optimization process builds molecules can yield chemical rules that are interpretable to medicinal chemists. This is particularly important in the field of small molecules, as generative models are often used to submit ideas to medicinal chemists and synthesis barriers preclude the possibility of testing all generative designs.

Opportunity

Applications Beyond Small Molecule Design: The methods discussed here may have wider applications in the design of other complex structural materials such as polysaccharides, proteins (especially antibodies), nucleic acids, crystal structures, and polymers.
Large language models demonstrate the potential to revolutionize molecular design through text-guided discovery and decision-making as agents, enabled by the vast amount of available training data, including the scientific literature. Furthermore, models that are customized or fine-tuned for molecular structures provide researchers with additional opportunities to take advantage of established advances in natural language processing.
Later Stages of Drug Development: Molecular design/optimization occupies the early stages of drug discovery. However, late failures due to limited efficacy, poor ADME/T (absorption, distribution, metabolism, excretion and toxicity) properties and safety concerns are pain points in the drug development pipeline. Although limited, integrating clinical data into design pipelines is a promising direction to improve downstream success rates.
Focused Model Purpose: Drug discovery pipelines are the result of years of experience and hard lessons learned by pharmaceutical companies. ML researchers should go beyond designing pure ab initio models (especially when deep representation capabilities are lacking) and instead design models that focus on improving at specific steps over a multi-year process, consistent with real-world constraints.
Automated labs: The increasing need for high-throughput experiments to provide feedback for molecules designed for ML is focusing more and more attention on automated labs to speed up design-manufacturing-testing-analysis cycle.

Author: Du Yuanqi, a second-year doctoral student in the Department of Computer Science at Cornell University. His main research interests include geometric deep learning, probabilistic models, sampling, search, optimization problems, interpretability, and applications in the field of molecular exploration. For specific information, please see: https://yuanqidu.github.io/.

以上是AI小分子藥物發現的「百科全書」，康乃爾、劍橋、EPFL等研究者綜述登Nature子刊的詳細內容。更多資訊請關注PHP中文網其他相關文章！