SOTA performance, University of Washington developed Transformer model to convert mass spectra into peptide sequences, published in Nature sub-journal-AI-php.cn

Home

SOTA performance, University of Washington developed Transformer model to convert mass spectra into peptide sequences, published in Nature sub-journal

王林

Aug 12, 2024 pm 04:06 PM

AI protein theory DNA

SOTA performance, University of Washington developed Transformer model to convert mass spectra into peptide sequences, published in Nature sub-journal

Editor | Radish Skin

A fundamental challenge in mass spectrometry-based proteomics is the identification of the peptides generating each tandem mass spectrum (MS/MS). Methods that rely on databases of known peptide sequences are unable to detect unexpected peptides and may be impractical or unapplicable in some cases.

Thus, the ability to assign peptide sequences into MS/MS without prior information (i.e. de novo peptide sequencing) is extremely valuable for tasks such as antibody sequencing, immunopeptidomics, and metaproteomics.

Although many methods have been developed to solve this problem, it remains an open challenge, partly due to the difficulty in modeling the irregular data structure of MS/MS.

Here, researchers at the University of Washington describe Casanovo, a machine learning model that uses the Transformer neural network architecture to convert peak sequences in MS/MS into the amino acid sequences that make up the resulting peptides.

The team trained the Casanovo model on 30 million labeled spectra and demonstrated that the model outperformed several state-of-the-art methods on cross-species benchmark datasets.

The team also developed a version of Casanovo fine-tuned for non-enzymatic peptides. This tool improves the analysis of immunopeptidomics and metaproteomics experiments and enables scientists to delve deeper into the dark proteome.

The study was titled "Sequence-to-sequence translation from mass spectra to peptides with a transformer model" and was published in "Nature Communications" on July 31, 2024.

SOTA performance, University of Washington developed Transformer model to convert mass spectra into peptide sequences, published in Nature sub-journal

1. Mass spectrometry is a mainstream proteome analysis technology used to identify and quantify proteins in complex biological systems.

Tandem mass spectrometry (MS/MS) technology produces complex data, and converting these spectra into protein amino acid sequences is challenging.
Deep learning has become the first choice for de novo peptide sequencing, but its limitations include: small number of annotated MS/MS spectra, difficulty in encoding high-resolution MS/MS data, complex neural networks and post-processing steps.
Casanovo reframes the de novo peptide sequencing task as a machine translation problem, using the Transformer architecture to output predicted peptide sequences directly using m/z and intensity value pairs of MS/MS spectra.
In the latest research, Casanovo has made improvements, including:
- Expanded training set using 669 million spectra in the MassIVE-KB spectral library.
- Strict FDR control, searches data at 1% FDR, retaining only 100 PSMs for each unique precursor, for a total of 30 million high-quality PSMs.
- Beam search decoder that predicts the best peptide for each MS/MS spectrum.
  ## Casanovo: De novo peptide sequencing using the Transformer architecture

Figure 1: Casanovo performs de novo peptide sequencing using the Transformer architecture. (Source: Paper)

Casanovo’s outstanding performance is attributed to two aspects:

Having a large amount of high-quality training data
Using Transformer architecture

Transformer architecture

Transformer architecture is particularly suitable for converting variable lengths The elements of a sequence are placed in context and thus have great success in natural language modeling. Compared to recurrent neural networks, the Transformer architecture is capable of learning long-distance dependencies between sequence elements and can be parallelized for efficient training.

Applications of Casanovo

Casanovo encodes mass spectral peaks into sequences, taking advantage of the Transformer architecture and the rapid development of large language models to improve de novo peptide sequencing of MS/MS spectra.

Application scenarios:

Paleoproteomics
Forensic medicine
Astrobiology
Detection of peptides not present in the database
As a post-processor for standard database searches

Antibody sequencing

Casanovo has not yet explored the use of antibody sequencing. However, a study by Denis Beslic's group at BAM in Germany conducted a systematic comparison of six de novo sequencing tools, including Casanovo, on the issue of antibody sequencing.

SOTA performance, University of Washington developed Transformer model to convert mass spectra into peptide sequences, published in Nature sub-journal

Graphic: Overall recall and precision of

Novor, pNovo 3, DeepNovo, SMSNet, PointNovo and Casanovo for different enzymes on IgG1-Human-HC.

Related links:
https://academic.oup.com/bib/article/24/1/bbac542/6955273?login=false

Results:

Casanovo 在所有考慮指標上均明顯優於競爭方法。值得注意的是，此比較使用了貪婪解碼版本 Casanovo，並且僅對 200 萬個光譜進行訓練。

評估：

Casanovo 團隊對 Casanovo 進行了九種物種基準測試評估。下圖顯示，使用 3000 萬個光譜訓練的更新版本 Casanovo 可以產生更好的抗體定序性能。

SOTA performance, University of Washington developed Transformer model to convert mass spectra into peptide sequences, published in Nature sub-journal

圖示：Casanovo 在九種物種基準測試中表現優於 PointNovo、DeepNovo 和 Novor 等模型。（資料來源：論文）

未來，Casanovo 模型將有許多機會針對特定應用進行微調。研究人員對非酶模型的分析表明，Casanovo 的酶偏差可以透過使用相對較少的訓練數據進行調整。

因此，短期內，該團隊計劃訓練適用於各種不同裂解酶的 Casanovo 變體。 Casanovo 軟體使這種微調變得簡單，因此任何有興趣將模型調整到特定實驗設定的用戶都應該能夠這樣做。

從長遠來看，理想的模型將光譜以及相關元數據（例如消化酶、碰撞能量和儀器類型）作為輸入，並準確預測多種不同類型的實驗設置。

深度學習方法在提高從頭定序能力方面的潛力現已廣受認可。在論文接受審查期間，至少有六種其他深度學習從頭測序方法已發表，包括 GraphNovo、PepNet、Denovo-GCN、Spectralis、π-HelixNovo 和 NovoB。顯然，對這一不斷發展的工具領域進行全面而嚴格的基準比較將使該領域受益。

與此相關的是，現階段該領域的主要瓶頸之一是缺乏嚴格的從頭測序置信度評估方法。

在宏蛋白質組學分析中，研究人員將 Casanovo 預測與目標和相應的誘餌勝肽資料庫進行了匹配，但這種方法忽略了從頭測序將勝肽分配給外來譜的能力。

因此，一個懸而未決的問題是，對於給定的資料依賴型擷取資料集，Casanovo 是否在檢測勝肽的統計能力方面優於標準資料庫搜尋程序。

研究人員表示，透過足夠大的訓練集進行訓練，也許可以結束資料庫搜尋在 DDA 串聯質譜資料分析領域的統治地位。

論文連結：https://www.nature.com/articles/s41467-024-49731-x

The above is the detailed content of SOTA performance, University of Washington developed Transformer model to convert mass spectra into peptide sequences, published in Nature sub-journal. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1423

Laravel Tutorial

1317

PHP Tutorial

1268

C# Tutorial

1242

Related knowledge

Bytedance Cutting launches SVIP super membership: 499 yuan for continuous annual subscription, providing a variety of AI functions Jun 28, 2024 am 03:51 AM

This site reported on June 27 that Jianying is a video editing software developed by FaceMeng Technology, a subsidiary of ByteDance. It relies on the Douyin platform and basically produces short video content for users of the platform. It is compatible with iOS, Android, and Windows. , MacOS and other operating systems. Jianying officially announced the upgrade of its membership system and launched a new SVIP, which includes a variety of AI black technologies, such as intelligent translation, intelligent highlighting, intelligent packaging, digital human synthesis, etc. In terms of price, the monthly fee for clipping SVIP is 79 yuan, the annual fee is 599 yuan (note on this site: equivalent to 49.9 yuan per month), the continuous monthly subscription is 59 yuan per month, and the continuous annual subscription is 499 yuan per year (equivalent to 41.6 yuan per month) . In addition, the cut official also stated that in order to improve the user experience, those who have subscribed to the original VIP

Breaking through the boundaries of traditional defect detection, 'Defect Spectrum' achieves ultra-high-precision and rich semantic industrial defect detection for the first time. Jul 26, 2024 pm 05:38 PM

In modern manufacturing, accurate defect detection is not only the key to ensuring product quality, but also the core of improving production efficiency. However, existing defect detection datasets often lack the accuracy and semantic richness required for practical applications, resulting in models unable to identify specific defect categories or locations. In order to solve this problem, a top research team composed of Hong Kong University of Science and Technology Guangzhou and Simou Technology innovatively developed the "DefectSpectrum" data set, which provides detailed and semantically rich large-scale annotation of industrial defects. As shown in Table 1, compared with other industrial data sets, the "DefectSpectrum" data set provides the most defect annotations (5438 defect samples) and the most detailed defect classification (125 defect categories

Training with millions of crystal data to solve the crystallographic phase problem, the deep learning method PhAI is published in Science Aug 08, 2024 pm 09:22 PM

Editor |KX To this day, the structural detail and precision determined by crystallography, from simple metals to large membrane proteins, are unmatched by any other method. However, the biggest challenge, the so-called phase problem, remains retrieving phase information from experimentally determined amplitudes. Researchers at the University of Copenhagen in Denmark have developed a deep learning method called PhAI to solve crystal phase problems. A deep learning neural network trained using millions of artificial crystal structures and their corresponding synthetic diffraction data can generate accurate electron density maps. The study shows that this deep learning-based ab initio structural solution method can solve the phase problem at a resolution of only 2 Angstroms, which is equivalent to only 10% to 20% of the data available at atomic resolution, while traditional ab initio Calculation

NVIDIA dialogue model ChatQA has evolved to version 2.0, with the context length mentioned at 128K Jul 26, 2024 am 08:40 AM

The open LLM community is an era when a hundred flowers bloom and compete. You can see Llama-3-70B-Instruct, QWen2-72B-Instruct, Nemotron-4-340B-Instruct, Mixtral-8x22BInstruct-v0.1 and many other excellent performers. Model. However, compared with proprietary large models represented by GPT-4-Turbo, open models still have significant gaps in many fields. In addition to general models, some open models that specialize in key areas have been developed, such as DeepSeek-Coder-V2 for programming and mathematics, and InternVL for visual-language tasks.

Google AI won the IMO Mathematical Olympiad silver medal, the mathematical reasoning model AlphaProof was launched, and reinforcement learning is so back Jul 26, 2024 pm 02:40 PM

For AI, Mathematical Olympiad is no longer a problem. On Thursday, Google DeepMind's artificial intelligence completed a feat: using AI to solve the real question of this year's International Mathematical Olympiad IMO, and it was just one step away from winning the gold medal. The IMO competition that just ended last week had six questions involving algebra, combinatorics, geometry and number theory. The hybrid AI system proposed by Google got four questions right and scored 28 points, reaching the silver medal level. Earlier this month, UCLA tenured professor Terence Tao had just promoted the AI Mathematical Olympiad (AIMO Progress Award) with a million-dollar prize. Unexpectedly, the level of AI problem solving had improved to this level before July. Do the questions simultaneously on IMO. The most difficult thing to do correctly is IMO, which has the longest history, the largest scale, and the most negative

PRO | Why are large models based on MoE more worthy of attention? Aug 07, 2024 pm 07:08 PM

In 2023, almost every field of AI is evolving at an unprecedented speed. At the same time, AI is constantly pushing the technological boundaries of key tracks such as embodied intelligence and autonomous driving. Under the multi-modal trend, will the situation of Transformer as the mainstream architecture of AI large models be shaken? Why has exploring large models based on MoE (Mixed of Experts) architecture become a new trend in the industry? Can Large Vision Models (LVM) become a new breakthrough in general vision? ...From the 2023 PRO member newsletter of this site released in the past six months, we have selected 10 special interpretations that provide in-depth analysis of technological trends and industrial changes in the above fields to help you achieve your goals in the new year. be prepared. This interpretation comes from Week50 2023

To provide a new scientific and complex question answering benchmark and evaluation system for large models, UNSW, Argonne, University of Chicago and other institutions jointly launched the SciQAG framework Jul 25, 2024 am 06:42 AM

Editor |ScienceAI Question Answering (QA) data set plays a vital role in promoting natural language processing (NLP) research. High-quality QA data sets can not only be used to fine-tune models, but also effectively evaluate the capabilities of large language models (LLM), especially the ability to understand and reason about scientific knowledge. Although there are currently many scientific QA data sets covering medicine, chemistry, biology and other fields, these data sets still have some shortcomings. First, the data form is relatively simple, most of which are multiple-choice questions. They are easy to evaluate, but limit the model's answer selection range and cannot fully test the model's ability to answer scientific questions. In contrast, open-ended Q&A

The accuracy rate reaches 60.8%. Zhejiang University's chemical retrosynthesis prediction model based on Transformer was published in the Nature sub-journal Aug 06, 2024 pm 07:34 PM

Editor | KX Retrosynthesis is a critical task in drug discovery and organic synthesis, and AI is increasingly used to speed up the process. Existing AI methods have unsatisfactory performance and limited diversity. In practice, chemical reactions often cause local molecular changes, with considerable overlap between reactants and products. Inspired by this, Hou Tingjun's team at Zhejiang University proposed to redefine single-step retrosynthetic prediction as a molecular string editing task, iteratively refining the target molecular string to generate precursor compounds. And an editing-based retrosynthetic model EditRetro is proposed, which can achieve high-quality and diverse predictions. Extensive experiments show that the model achieves excellent performance on the standard benchmark data set USPTO-50 K, with a top-1 accuracy of 60.8%.

See all articles