The application of artificial intelligence has greatly accelerated research on protein engineering.
Recently, a fledgling startup in Berkeley, California, once again made amazing progress.
Scientists used Progen, a protein engineering deep learning language model similar to ChatGPT, to achieve AI prediction of protein synthesis for the first time.
Not only are these proteins completely different from those known, the lowest similarity is even only 31.4 %, but as effective as natural protein.
Now, this work has been officially published in the Nature sub-journal.
Paper address: https://www.nature.com/articles/s41587-022-01618-2
#This experiment also shows that although natural language processing was developed for reading and writing language text, it can also learn some basic principles of biology.
In response, researchers said that this new technology may become more powerful than directed evolution (the Nobel Prize-winning protein design technology ) is more powerful.
"It will revitalize the 50-year-old field of protein engineering by accelerating the development of new proteins that can be used in virtually everything from therapeutics to degrading plastics."
The company is called Profluent. It was founded by the former head of Salesforce AI research and has received US$9 million in start-up funding. Yu established an integrated wet lab and recruited machine learning scientists and biologists.
In the past, it was very laborious to mine proteins in nature or adjust proteins to the required functions. Profulent's goal is to make this process effortless.
They did it.
Profluent founder and CEO Ali Madani
Madani said in the interview that Profulent has designed multiple families of proteins. These proteins function like exemplar proteins and are therefore highly active enzymes.
This task is very difficult and is done in a zero-shot manner, which means that multiple rounds of optimization are not performed, or even any data from the wet laboratory is not provided at all.
The resulting protein is a highly active protein that usually takes hundreds of years to evolve.
As a kind of deep neural network, the conditional language model is not only Semantically and grammatically correct, novel and diverse natural language text can be generated, and input control tags can be leveraged to guide style, topic, and more.
Similarly, researchers have developed today’s protagonist—ProGen, a conditional protein language model with 1.2 billion parameters.
Specifically, ProGen based on the Transformer architecture simulates the interaction of residues through a self-attention mechanism, and can generate different artificial protein sequences across protein families based on input control labels.
Generating artificial proteins using conditional language models
In order to create this model , the researchers fed the amino acid sequences of 280 million different proteins and let them "digest" for several weeks.
They then fine-tuned the model using 56,000 sequences from five lysozyme families and information about these proteins.
Progen’s algorithm is similar to GPT3.5, the model behind ChatGPT. It learns the ordering rules of amino acids in proteins and their relationship with protein structure and function.
Soon, the model generated a million sequences.
The researchers selected 100 for testing based on their similarity to natural protein sequences and the naturalness of their amino acid "syntax" and "semantics."
Of these, 66 produced chemical reactions similar to natural proteins that destroy bacteria in egg whites and saliva.
In other words, these new proteins generated by AI can also kill bacteria.
The artificial proteins generated are diverse and well expressed in experimental systems
Going a step further, the researchers selected the five proteins that reacted most strongly and added them to samples of E. coli.
Among them, there are two artificial enzymes that can break down the cell wall of bacteria.
By comparing with hen egg white lysozyme (HEWL), it can be found that their activity is equivalent to HEWL.
The researchers then used X-rays for imaging.
Although the amino acid sequences of artificial enzymes are up to 30% different from existing proteins, and only 18% are the same between them, their shapes are similar to those in nature. Proteins are not that different and have comparable functions.
Applicability of conditional language modeling to other protein systems
Besides, for a highly evolved natural protein, it may only take a small mutation to stop it from working.
But the researchers found in another round of screening that even though only 31.4% of the sequences of the AI-generated enzymes were identical to known proteins, they still showed considerable activity and Similar structure.
As you can see, the way ProGen works is very similar to ChatGPT similar.
ChatGPT can take MBA and bar exams and write college papers by studying massive data.
And ProGen learned how to generate new proteins by learning the syntax of how amino acids are combined into the 280 million existing proteins.
In the interview, Madani said, “Just like ChatGPT learns human languages such as English, we are learning the language of biology and proteins. ."
"Artificially designed proteins perform much better than proteins inspired by evolutionary processes," said James, co-author of the paper and professor of bioengineering and therapeutic sciences at the UCSF School of Pharmacy. Fraser said.
"Language models are learning aspects of evolution, but it is different from the normal evolutionary process. We now have the ability to adjust the production of these features to obtain specific effects. For example, let a Enzymes that are incredibly thermally stable, or prefer acidic environments, or don't interact with other proteins."
Back in 2020, Salesforce Research developed ProGen . It is based on natural language programming and was originally used to generate English text.
From previous work, researchers know that artificial intelligence systems can teach themselves grammar and word meanings, as well as other basic rules that make writing organized.
“When you train sequence-based models with large amounts of data, they are very powerful at learning structures and rules,” said Nikhil, director of artificial intelligence research at Salesforce Research and senior author of the paper. Dr. Naik said, "They will understand which words can appear together and how to combine them."
"Now, we have demonstrated the ability of ProGen to generate new proteins and made it public Released, everyone can conduct research based on ours."
Lysozyme, which is a protein, although very small , with up to about 300 amino acids.
But with 20 possible amino acids, there are 20^300 possible combinations.
This is more than all human beings throughout the ages multiplied by the number of grains of sand on the earth, multiplied by the number of atoms in the universe.
Given the near-infinite possibilities, it’s truly remarkable that Progen was able to design effective enzymes so easily.
"Generate it from scratch right out of the box," said Dr. Ali Madani, founder of Profluent Bio and former research scientist at Salesforce Research. The ability to create functional proteins shows that we are entering a new era of protein design."
"This is a versatile new tool available to all protein engineers, and we look forward to seeing it used. Applied to treatment."
At the same time, researchers continue to improve ProGen, trying to break through more limitations and challenges.
One of them is that it relies heavily on data.
"We have explored ways to improve sequence design by adding structure-based information," Naik said. "We are also looking at when you don't have much information about a particular protein family or How to improve the model generation capabilities when using data in the field."
It is worth noting that some startups are also trying similar technologies, such as Cradle, and the Biotechnology Incubator Flagship Pioneering's Generate Biomedicines, but these studies have not yet been peer-reviewed.
The above is the detailed content of Beyond the Nobel Prize? For the first time in the biological world, 'ChatGPT' has synthesized a new protein from scratch, and it has been published in the Nature sub-journal!. For more information, please follow other related articles on the PHP Chinese website!