Molecular is the smallest unit that maintains the chemical stability of a substance. The study of molecules is a fundamental issue in many scientific fields such as pharmacy, materials science, biology, and chemistry.
Molecular Representation Learning has been a very popular direction in recent years and can currently be divided into many schools:
However, current characterization methods still have some limitations. For example, sequence representation lacks explicit structural information of molecules, and the expression ability of existing graph neural networks still has many limitations (Teacher Shen Huawei from the Institute of Computing Technology, Chinese Academy of Sciences discussed this, see Mr. Shen’s report "The Expression Ability of Graph Neural Networks").
What’s interesting is that when we study molecules in high school chemistry, we see images of molecules. When chemists design molecules, they also observe and think based on molecular images. A natural idea arises spontaneously: "Why not directly use molecular images to represent molecules?"If images can be used directly to represent molecules, then in CV (Computer Vision) Can't all the eighteen martial arts be used to study molecules?
Just do it. There are so many models in CV, why don’t you use them to learn molecules? Stop, there is another important issue - data! Especially labeled data! In the field of CV, data annotation does not seem to be difficult. For classic CV and NLP problems such as image recognition or emotion classification, a person can annotate an average of 800 pieces of data. However, in the molecular field, molecular properties can only be assessed through wet experiments and clinical experiments, so labeled data are very scarce.
Based on this, researchers from Hunan University proposed the world's first unsupervised learning framework for molecular images, ImageMol, which uses large-scale unlabeled molecular image data for unsupervised pre-training. It provides a new paradigm for understanding molecular properties and drug targets, proving that molecular images have great potential in the field of intelligent drug research and development. The result was published in the top international journal "Nature Machine Intelligence" under the title "Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework". The success achieved at the intersection of computer vision and molecular fields demonstrates the great potential of using computer vision technology to understand molecular properties and drug target mechanisms, and provides new opportunities for research in the molecular field.
Paper link: https://www.nature.com/articles/s42256-022-00557-6.pdf
The overall structure of ImageMol is shown in the figure below, which is divided into three parts:
(1) Design a molecular encoder ResNet18 (light blue), which can extract latent features from about 10 million molecular images (a).
(2) Considering the chemical knowledge and structural information in the molecular image, five pre-training strategies (MG3C, MRD, JPP, MCL, MIR) are used to optimize the latent representation of the molecular encoder (b). Specifically:
① MG3C (Muti-granularity chemical clusters classification): The structure classifier (dark blue) is used to predict molecular images Chemical structure information;
② MRD (Molecular rationality discrimination): the rationality classifier (green), which is used to distinguish between reasonable and unreasonable molecules;
③ JPP (Jigsaw puzzle prediction): The Jigsaw classifier (light gray) is used to predict the reasonable arrangement of molecules;
④ MCL (MASK-based contrastive learning MASK-based contrastive learning): The contrastive classifier (dark gray) is used to maximize the similarity between the original image and the mask image;
⑤ MIR (Molecular image reconstruction): The generator (yellow) is used to restore latent features to the molecular image, and the discriminator (purple) is used to distinguish between real images and generated images. Fake molecular images generated by the machine.
(3) Fine-tune the preprocessed molecular encoder in downstream tasks to further improve model performance (c).
The authors first evaluated the performance of ImageMol using 8 drug discovery benchmark datasets and used two The most popular splitting strategies (scaffold split and random scaffold split) are used to evaluate the performance of ImageMol on all benchmark datasets. In the classification task, the Receiver Operating Characteristic (ROC) curve and the Area Under Curve (AUC) are used to evaluate. From the experimental results, it can be seen that ImageMol can obtain higher AUC values. (Figure a).
Comparison of the detection results of HIV and Tox21 between ImageMol and Chemception, a classic convolutional neural network framework for predicting molecular images (Figure b), ImageMol’s AUC Value is higher. This article further evaluates the performance of ImageMol in predicting drug metabolism by five major metabolizing enzymes: CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4. Figure c shows that ImageMol achieves better results compared with three state-of-the-art molecular image-based representation models (Chemception46, ADMET-CNN12 and QSAR-CNN47) in the prediction of inhibitors versus non-inhibitors of five major drug metabolizing enzymes. achieved higher AUC values (ranging from 0.799 to 0.893).
This paper further compares the performance of ImageMol with three state-of-the-art molecular representation models, e.g. As shown in Figures d and e. ImageMol has better performance compared to fingerprint-based models (such as AttentiveFP), sequence-based models (such as TF_Robust), and graph-based models (such as N-GRAM, GROVER, and MPG) that use random skeleton partitioning. Furthermore, ImageMol achieved higher AUC values on CYP1A2, CYP2C9, CYP2C19, CYP2D6 and CYP3A4 compared with traditional MACCS-based methods and FP4-based methods (Figure f).
ImageMol is compared with sequence-based models (including RNN_LR, TRFM_LR, RNN_MLP, TRFM_MLP, RNN_RF, TRFM_RF, and CHEM-BERT) and graph-based models (including MolCLRGIN, MolCLRGCN, and GROVER), as shown in Figure g It shows that ImageMol achieves better AUC performance on CYP1A2, CYP2C9, CYP2C19, CYP2D6, and CYP3A4.
In the above comparison between ImageMol and other advanced models, we can see the superiority of ImageMol.
Since the outbreak of COVID-19, we have urgently needed to develop effective treatment strategies for COVID-19. Therefore, the authors evaluated ImageMol accordingly in this aspect.
ImageMol conducted prediction experiments on 13 SARS-CoV-2 targets that are of concern today. -CoV-2 bioassay data set, ImageMol achieved high AUC values of 72.6% to 83.7%. Panel a reveals the potential signature identified by ImageMol, which clusters well on 13 targets or endpoints active and inactive anti-SARS-CoV-2, with higher AUC values than the other The model Jure's GNN is more than 12% higher, reflecting the high accuracy and strong generalization of the model.
The most direct experiment related to the study of drug molecules is here, using ImageMol Directly identify inhibitor molecules! Through the molecular image representation of inhibitors and non-inhibitors of 3CL protease (which has been proven to be a promising therapeutic development target for the treatment of COVID-19) under the ImageMol framework, this study found that 3CL inhibitors and non-inhibitors have significant differences in t- Well separated in the SNE plot, as shown in Figure b below.
In addition, ImageMol identified 10 of the 16 known 3CL protease inhibitors and visualized these 10 drugs into the embedded space in the figure (success rate 62.5%) , indicating high generalization ability in anti-SARS-CoV-2 drug discovery. When using the HEY293 assay to predict anti-SARS-CoV-2 repurposed drugs, ImageMol successfully predicted 42 out of 70 drugs (60% success rate), indicating that ImageMol is also good at inferring potential drug candidates in the HEY293 assay. It has high promotion potential. Figure c below shows ImageMol’s discovery of drugs that are potential inhibitors of 3CL on the DrugBank dataset. Panel d shows the molecular structure of the 3CL inhibitor discovered by ImageMol.
ImageMol can obtain prior knowledge of chemical information from molecular image representations, including = O bonds, -OH bond, -NH3 bond and benzene ring. Panels b and c show 12 example molecules visualized by ImageMol's Grad-CAM. This means that ImageMol accurately captures attention to both global (b) and local (c) structural information simultaneously. These results allow researchers to visually understand how molecular structure affects properties and targets.
The above is the detailed content of Introducing ImageMol, the world's first molecular image generation framework based on self-supervised learning. For more information, please follow other related articles on the PHP Chinese website!