Llama molecule embedding is better than GPT, can LLM understand molecules? Meta defeated OpenAI in this round-AI-php.cn

Llama molecule embedding is better than GPT, can LLM understand molecules? Meta defeated OpenAI in this round

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2024-07-16 13:33:18

Original

744 people have browsed it

Llama molecule embedding is better than GPT, can LLM understand molecules? Meta defeated OpenAI in this round

Editor | Radish Skin

Large language models (LLMs) such as OpenAI’s GPT and Meta AI’s Llama are increasingly recognized for their potential in the field of chemoinformatics, especially in understanding simplified molecular input line input systems ( SMILES) aspect. These LLMs are also able to decode SMILES strings into vector representations.

Researchers at the University of Windsor in Canada compared the performance of pre-trained models on GPT and Llama with SMILES for embedding SMILES strings in downstream tasks, focusing on two key applications: molecular property prediction and drug-drug Interaction prediction.

The study was titled "Can large language models understand molecules?" and was published in "BMC Bioinformatics" on June 25, 2024.

Llama molecule embedding is better than GPT, can LLM understand molecules? Meta defeated OpenAI in this round

1. Application of molecular embedding in drug discovery

Molecular embedding is a crucial task in drug discovery and is widely used in molecular property prediction, drug-target interaction (DTI) prediction and drug-drug interaction Function (DDI) prediction and other related tasks.

2. Molecular embedding technology

Molecular embedding technology can learn features from molecular graphs encoding molecular structural connection information or line annotations of their structures, such as the popular SMILES representation.

3. Molecular embeddings in SMILES strings

Molecular embeddings via SMILES strings have evolved in tandem with advances in language modeling, from static word embeddings to contextualized pre-trained models. These embedding techniques aim to capture relevant structural and chemical information in a compact numerical representation.

Llama molecule embedding is better than GPT, can LLM understand molecules? Meta defeated OpenAI in this round

Illustration: Medicinal chemistry representation. (Source: Paper)

The basic assumption is that molecules with similar structures behave in similar ways. This enables machine learning algorithms to process and analyze molecular structures for property prediction and drug discovery tasks.

With breakthroughs in LLM, a prominent question is whether LLM can understand molecules and make inferences based on molecular data?

More specifically, can LLM produce high-quality semantic representations?

Shaghayegh Sadeghi, Alioune Ngom Jianguo Lu and others at the University of Windsor further explored the ability of these models to effectively embed SMILES. Currently, this capability is underexplored, perhaps in part because of the cost of the API calls.

Researchers found that SMILES embeddings generated using Llama performed better than SMILES embeddings generated using GPT in both molecular property and DDI prediction tasks.

Llama molecule embedding is better than GPT, can LLM understand molecules? Meta defeated OpenAI in this round

Illustration: Results of classification and regression tasks. (Source: paper)
Notably, Llama-based SMILES embeddings show comparable results to pre-trained models on SMILES in the molecular prediction task, and outperform the pre-trained model in the DDI prediction task.
According to this, the team concluded as follows:
(1) LLM does perform better than traditional methods. (2) Performance depends on the task and sometimes on the data. (3) Even when trained on a more general task, the new version of LLM does improve over the old version. (4) Llama’s embedding is generally better than GPT embedding. (5) Furthermore, it is observed that Llama and Llama2 are very close in terms of embedding performance.

Llama molecule embedding is better than GPT, can LLM understand molecules? Meta defeated OpenAI in this round

Illustration: Llama and Llama2 performance comparison. (Source: Paper) Overall, this study highlights the potential of LLMs such as GPT and Llama for molecular embedding.
The team specifically recommends Llama models over GPT due to their superior performance in generating molecular embeddings from SMILES strings. These findings suggest that Llama may be particularly effective at predicting molecular properties and drug interactions.
While models like Llama and GPT are not specifically designed for SMILES string embedding (unlike specialized models like ChemBERTa and MolFormer-XL), they still demonstrate competitiveness. This work lays the foundation for future improvements in LLM molecular embedding.
In the future, the team will focus on improving the quality of LLM molecular embeddings inspired by natural language sentence embedding techniques, such as fine-tuning and modifications to Llama tokenization.
GitHub: https://github.com/sshaghayeghs/LLaMA-VS-GPT
Paper link: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-024-05847-x

The above is the detailed content of Llama molecule embedding is better than GPT, can LLM understand molecules? Meta defeated OpenAI in this round. For more information, please follow other related articles on the PHP Chinese website!