Produced by Huxiu Technology Group
Author|Qi Jian
Editor|Chen Yifan
Header picture|FlagStudio
"Will OpenAI open source large models again?"
When Zhang Hongjiang, Chairman of the Zhiyuan Research Institute, asked OpenAI CEO Sam Altman, who attended the 2023 Zhiyuan Conference online, about open source issues, Sam Altman smiled and said, OpenAI will open more codes in the future. But there is no specific open source timetable.
This discussion comes from one of the topics of this Zhiyuan conference-Open source large model.
On June 9, the 2023 Zhiyuan Conference was held in Beijing, and all seats were packed. At the conference site, AI-related words such as "computing power", "large model" and "ecology" appeared from time to time in the chats of participants, as well as various companies in this industry chain.
At this conference, Zhiyuan Research Institute released the comprehensive open source Wu Dao 3.0. Including visual large model series "Vision", language large model series "Sky Eagle", and original large model evaluation system "Libra" .
Open source for large models means making the model code public for AI developers to study. The "Sky Eagle" basic layer language model in Wu Dao 3.0 is still a commercially available model, and everyone can use this large model for free.
Currently, Microsoft’s in-depth partners OpenAI, Google and BAAI are the three institutions at the forefront of the field of artificial intelligence. "In a recent interview, Microsoft President Brad Smith mentioned BAAI, the "strongest" AI research institution in China, which is as famous as OpenAI and Google. This institution is the Beijing Zhiyuan Artificial Intelligence Research Institute. Many people in the industry believe that , the artificial intelligence conference hosted by this institute is a benchmark for industry trends.
Zhiyuan Research Institute, which is highly recognized by the president of Microsoft, has launched the AI large model "Enlightenment" project as early as October 2020, and has successively released two versions of the Enlightenment model 1.0 and 2.0. Among them, the officially announced parameter scale of Enlightenment 2.0 reaches 1.7 trillion. At that time, it had only been a year since OpenAI released the 175 billion-parameter GPT-3 model.
However, such a pioneer of large AI models has been extremely low-key during the AI large model craze in the past six months.
While big models are emerging one after another among large manufacturers and start-up companies, Zhiyuan has remained "silent" to the outside world for more than three months, except for the "SAM" that collided with Meta's cutout AI "SAM" in early April. SegGPT”, revealing almost nothing about the large AI model to the public.
In this regard, many people inside and outside the AI industry have questions. Why does Zhiyuan Research Institute, a leader in the field of AI large models, seem to be late in the climax of large models?
Will the open source model dismantle OpenAI’s moat?
"Although the competition for large models is fierce now, neither OpenAI nor Google has a moat, because "open source" is rising in the field of AI large models."
In a document leaked by Google, internal Google researchers believe that open source models may lead the future of large model development. The document mentions that "Open source models have faster iterations and are more customizable. Stronger and more private, people will not pay for restricted models when the free, unrestricted alternatives are of equal quality. " This may also be one of the reasons why Zhiyuan chose to develop open source large models one.
At present, there are not many open source commercial large models. Zhiyuan Research Institute conducted a survey on some of the AI large models that have been released. Among the 39 open source language large models released abroad, the commercially available large models include 16. Among the 28 major language models released in China, a total of 11 are open source models, but only one of them is an open source and commercially available model.
The large language model released by Zhiyuan this time is open source and commercially available. It is also one of the few open source large language models currently available for commercial use. This also determines that such a model needs to be more cautious before releasing it.
"As far as Zhiyuan is concerned, we definitely don't want the open source model to be too ugly, so we will release it with caution." An AI researcher at the Zhiyuan conference said that open source models will inevitably have to be repeatedly verified and have bugs picked up by a large number of developers. In order to ensure the quality of the open source model, Zhiyuan's research and development progress may have been slowed down by "open source".
Huang Tiejun, President of Zhiyuan Research Institute, believes that the current open source and openness of large models in our country's market is far from enough. "We should further strengthen open source and open source. Open source and open source are also competitions. There are really good standards and good algorithms. Only by evaluating and comparing can we prove our technical level.”
There is a lack of transparency when domestic manufacturers release large models, and many people doubt whether these manufacturers have truly conducted independent research and development. Some people say that they call ChatGPT via API, while others say that they are trained using the answer data of ChatGPT, the LLaMA model leaked by Meta. The open source model cuts off these doubts from the source.
However, the open source model and improving technical transparency are not to prove one's innocence, but to really "concentrate efforts to do big things." According to Zhiyuan data, the daily training cost of Tianying Big Language Model is more than 100,000 yuan. Under the general trend of the domestic "War of One Hundred Models" or even "War of Thousands of Models", many industries are The repeated expenses caused by a large number of unnecessary repeated trainings may be astronomical.
And open source models can reduce repeated training. For companies with model needs, directly using large open source commercial AI models and combining their own data for training may be the best solution for AI implementation and industry applications. .
Another consideration of open source is to accumulate users and developers in the early stage in order to build a good ecosystem and achieve future commercialization. A founder of a large domestic model company told Huxiu, “OpenAI’s GPT-1 and GPT-2 are both open source large models. This is to accumulate users and improve the recognition of the model. Once the model capabilities of GPT-3 are fully Display, commercialization will become the focus of consideration, and this model will gradually become closed. Therefore, open source models generally will not allow commercial use, which is also due to subsequent commercialization considerations."
But obviously, as a non-profit research institution, Zhiyuan has no commercial considerations when it comes to open source issues. For Zhiyuan, in terms of model open source, on the one hand, it hopes to promote scientific research and innovation in the AI large model industry and accelerate industrial implementation by opening up open source such as underlying models. On the other hand, perhaps we also want to accumulate more user feedback based on open source models and improve the usability of large models in engineering.
However, model open source is not "perfect".
An AI technical director of a major factory told Huxiu that the current commercialization market for large AI models can be divided into three tiers. The first tier is the leading players who are fully capable of self-developed models, and the second tier is those who need to For enterprises that train proprietary models based on specific scenarios, the third layer is for small and medium-sized customers who only need general model capabilities and can use API calls to meet their needs.
In this context, open source models can help leading players with self-research capabilities save a lot of time and cost in developing models. But for second- and third-tier companies, they need to set up their own technical teams to train and tune the models. For many companies with less technical strength, this will make the implementation process more difficult. It gets more complicated, and open source seems to have a bit of a "free stuff is the most expensive" feel to them.
This "enlightenment" is no longer that "enlightenment"
Chiyuan’s Enlightenment 3.0 is a completely redeveloped large-scale model series. This is also one of the reasons for its “late release”.
Now that we have the foundation of Enlightenment 2.0, why does Zhiyuan need to develop a new model system?On the one hand, it is the adjustment of the technical direction of the model, and on the other hand, it is due to the "replacement" of the underlying training data of the model.
"The development of Wu Dao 2.0 will be in 2021, so whether it is a language model (such as GLM) or a Vincentian graph model (such as CogView), the algorithm architecture it is based on is relatively early from now on. In the past year Many, the model architecture in related fields has been more verified or evolved. For example, the decoder only architecture used in the language model has been confirmed that with higher quality data, it can be obtained in the basic model of large-scale parameters. Better generation performance. In the text graph model, we switched to difussion-based for further innovation. So in Wu Dao 3.0, weadopted these updates for the large language model, the large text graph generation model, etc. Lin Yonghua, vice president and chief engineer of Zhiyuan Research Institute, said that based on the research of past models, Wu Dao 3.0 has been reconstructed in many directions. In addition, Wudao 3.0 has also comprehensively optimized and upgraded the training data of the underlying model. The training data uses the updated Wudao Chinese data, including from 2021 to the present, and has undergone more stringent quality cleaning; on the other hand, , a large amount of high-quality Chinese has been added, including Chinese books, literature, etc.;
In addition, high-quality code data sets have been added, so the basic model has also undergone great changes.The underlying model training data is not native Chinese, causing many domestic models to have problems with Chinese understanding capabilities. Many large-scale AI models at home and abroad use massive open source data from abroad for training. Major sources include the famous open source dataset Common Crawl.
Zhiyuan analyzed 1 million Common Crawl web page data and
can extract 39,052 Chinese web pages. From the perspective of website sources, there are 25,842 websites that can extract Chinese, of which only 4,522 have IPs in mainland China, accounting for only 17%.This not only greatly reduces the accuracy of Chinese data, but also reduces security. "The corpus used to train the basic model will largely affect the compliance, security and values generated by AIGC applications, fine-tuned models and other content." Lin Yonghua said that the Chinese ability of the Tianying basic model is not a simple translation; By "pressing enough Chinese knowledge into this model", 99% of its Chinese Internet data comes from domestic websites, and companies can safely conduct continuous training based on it. At the same time, through a large amount of refined processing and cleaning of data and numbers, a model with the same or even better performance can be trained with a small amount of data. This data can even be as low as 30% or 40% of the data amount. It can catch up with or exceed the existing open source model. Now it seems that this path may be a better solution for Zhiyuan. Because in terms of training data, Zhiyuan has shortcomings compared with Internet manufacturers. Large Internet companies have rich user interaction data and a large amount of copyright data for training. Not long ago, Alibaba Damo Academy just released a video language data set Youku-mPLUG. All the content in it comes from Youku, a video platform owned by Alibaba. Since Zhiyuan does not have a deep user base, in terms of training data, it can only obtain authorization through negotiation with the copyright owner, and collect and accumulate bit by bit through some public welfare data projects. However, at present, Zhiyuan’s Chinese data set can only be partially open source. The main reason is that the copyright of Chinese data is scattered in the hands of various institutions. At present, Zhiyuan’s training data is obtained through the coordination of multiple parties. The open source model studies open access. Most of the data can only be applied to Zhiyuan's models and does not have the right to be used for secondary use. "It is very necessary in China to establish an industrial alliance for data sets, unite copyright holders, and conduct unified planning of training data for artificial intelligence, but this requires the wisdom of top-level design." Lin Yonghua told Huxiu. Whampoa Military Academy in the domestic large model industry Wudao 3.0 is telling a different story from Wudao 2.0, and the changes in the R&D team are one of them. As a pioneer in the AI large model industry, Zhiyuan Research Institute is like the Whampoa Military Academy of domestic AI large models. From Zhiyuan scholars to grassroots engineers, they have all become popular in the industry in today's large-scale model craze. The original team of Zhiyuan has also incubated several large-scale entrepreneurial teams. Before Wu Dao 3.0, a large model series was a combination of research results jointly released by multiple external laboratories, but this time Wu Dao 3.0 is a series of models completely self-developed by the Zhiyuan team. Wudao 2.0 model was released in 2021, including Wenyuan, Wenlan, Wenhui and Wensu. Among them, the two core models were completed by two laboratories of Tsinghua University. Today, the two teams have founded their own companies and developed their own independent products in the research and development direction of CPM and GLM. Among them, Tsinghua University’s Knowledge Engineering Laboratory (KEG), the main R&D team of GLM, launched the open source model ChatGLM-6B together with Zhipu AI, and has been widely recognized by the industry; Tsinghua University, the main R&D team of CPM Shenyan Technology, which is composed of some members of the Natural Language Processing and Social Humanities Computing Laboratory (THUNLP) of the Department of Computer Science, has been favored by various capitals since its establishment one year ago. Tencent Investment and Sequoia Investment appeared in the two rounds of financing this year. China, Qiji Chuangtan and other funds are present. Someone close to Zhiyuan Research Institute told Huxiu that Since the rise of large-scale AI models in China, the Zhiyuan team has become the “hunting target” of the talent war. “The entire R&D team is being targeted by other companies or headhunters. superior". At present, in the domestic AI large model industry, what is most in short supply is money, and what is most in short supply is people. Search ChatGPT on the three platforms of Liepin, Maimai, and BOSS Zhipin. The monthly salary for positions with master's and doctoral degrees is generally higher than 30,000, and the highest is 90,000. "In terms of salary, big IT companies don't take much advantage. The research and development of large AI models are all done at a high level. The salary offered by startups may be more competitive." Yu Jia, COO of Xihu Xinchen, told Huxiu that talent The war will become increasingly fierce in the AI industry. "Double salary is not competitive at all in the eyes of many employees at Zhiyuan. Because now they are poaching people with five or even ten times the salary. No matter how ideal you are and how you plan for the future, It is difficult to resist the temptation of an annual salary of more than one million." A person close to Zhiyuan told Huxiu, Since Zhiyuan is a non-profit research institution, the salary level is difficult to match that of major Internet companies or companies with a large amount of capital behind it. compared to supported startups. Through headhunting, Huxiu learned that the starting salary of natural language processing experts currently exceeds 1 million. For some employees with long working years and low wages, it is difficult not to waver when faced with several times their salary. However, judging from the current public data of Zhiyuan Research Institute, most of the core project team leaders of Zhiyuan Research Institute are still responsible for the research and development projects of Zhiyuan Research Institute full-time. "The models of Enlightenment 3.0 are all developed by Zhiyuan's own researchers, including Tianying, Libra, and Vision. " Lin Yonghua said that Zhiyuan Research Institute's current R&D strength has been among the best in the industry. It's top notch. People who are changing and want to change the world are on Huxiu APP
The above is the detailed content of Why is the large model of China's most powerful AI research institute late?. For more information, please follow other related articles on the PHP Chinese website!