The secret of the domestic ChatGPT 'shell' has now been found
"iFlytek is a cover-up for ChatGPT!" "Baidu Wenxin is a cover-up for Stable Diffusion!" "SenseTime's big model is actually plagiarism!"...
It’s not once or twice that the outside world has questioned domestically produced large models.
The explanation for this phenomenon by industry insiders is that there is a real shortage of high-quality Chinese data sets. When training the model, can only use the purchased foreign language annotated data sets to "act as foreign aid". If the data set used for training crashes, similar results will be generated, leading to an own incident.
Among other methods, using existing large models to assist in generating training data is prone to insufficient data cleaning. Reusing tokens will lead to overfitting. Only training sparse large models is not a long-term solution.
The industry has gradually formed a consensus:
The road to AGI will continue to put forward extremely high requirements for both data quantity and data quality.
The current situation requires that in the past two months, many domestic teams have successively open sourced Chinese data sets. In addition to general data sets, they are also targeted at programming, medical, etc. Chuiyu also has a dedicated open source Chinese data set released.
High-quality data sets are available but few
New breakthroughs in large models rely heavily on high-quality and rich data sets.
According to OpenAI's "Scaling Laws for Neural Language Models", the scaling law followed by large models(scaling law)It can be seen that independently increasing the amount of training data can make the pre-trained model The effect becomes better.
This is not the opinion of OpenAI.
DeepMind also pointed out in the Chinchilla model paper that most of the previous large models were insufficiently trained, and also proposed the optimal training formula, which has become a recognized standard in the industry.
△Mainstream For large models, Chinchilla has the fewest parameters but the most sufficient training
However, the mainstream data sets used for training are mainly in English, such as Common Crawl, BooksCorpus, WiKipedia, ROOT, etc., which are the most popular Common Crawl Chinese data only accounts for 4.8%. What is the situation of the Chinese data set? There are not no public data sets - this is confirmed by Qubits from Zhou Ming, founder and CEO of Lanzhou Technology and one of the most accomplished Chinese people in the NLP field today - such as the named entity data set MSRA-NER, Weibo -NER, etc., as well as CMRC2018, CMRC2019, ExpMRC2022, etc. that can be found on GitHub, but the overall number is a drop in the bucket compared to the English data set. Moreover, some of them are old, and they may not know the latest NLP research concepts (research related to new concepts only appears in English on arXiv). Although high-quality Chinese data sets exist, they are small in number and cumbersome to use. This is a severe situation that all teams conducting large-scale model research have to face. At a previous Tsinghua University Department of Electronics forum, Tang Jie, a professor in the Department of Computer Science at Tsinghua University, shared that when preparing data for pre-training of the 100-billion model ChatGLM-130B, he was faced with the situation that after cleaning the Chinese data, the usable amount was less than 2TB. It is urgent to solve the lack of high-quality data sets in the Chinese-speaking world. One of the effective solutions is to directly use English data to train large models. In the Chatbot Arena list of large-scale anonymous arenas rated by human players, GPT-3.5 ranks second in the non-English rankings(the first is GPT-4) . You should know that 96% of the GPT-3.5 training data is in English. Excluding other languages, the amount of Chinese data used for training is so small that it can be calculated by "n thousandths".
make a new high-quality Chinese data set, and supply it to large models.
Open source data set everyone gathers firewood
After noticing the current situation, many large domestic model teams decided to take the second path and started using private databases to create data sets.
Baidu has content ecological data, Tencent has public account data, Zhihu has Q&A data, and Alibaba has e-commerce and logistics data.
With different accumulated private data, it is possible to establish core advantage barriers in specific scenarios and fields. Strict collection, sorting, filtering, cleaning and labeling of these data can ensure the effectiveness and accuracy of the trained model. .
And those large model teams whose private data advantages are not so obvious began to crawl data across the entire network (it is foreseeable that the amount of crawler data will be very large).
In order to build the Pangu large model, Huawei crawled 80TB of text from the Internet and finally cleaned it into a 1TB Chinese data set; the Chinese data set used for Inspur Source 1.0 training reached 5000GB (compared to the GPT3 model training data set of 570GB); the recently released Tianhe Tianyuan large model is also the result of the Tianjin Supercomputing Center’s collection of global web data, and the inclusion of various open source training data and professional field data sets.
At the same time, in the past two months, there has been a phenomenon of people gathering firewood for Chinese data sets -
Many teams have successively released open source Chinese data sets to make up for the current Chinese open source data sets. deficiencies or imbalances.
Some of them are organized as follows:
- CodeGPT: Code-related conversation data set generated by GPT and GPT; the institution behind it is Fudan University.
- CBook-150k: Chinese corpus book collection, including downloading and extraction methods for 150,000 Chinese books, covering many fields such as humanities, education, technology, military, politics, etc.; the organization behind it is Fudan University.
- RefGPT: In order to avoid the expensive cost of manual annotation, we propose a method to automatically generate fact-based dialogues and disclose part of our data, including 50,000 items. Multiple rounds of dialogue in Chinese; behind it are NLP practitioners from Shanghai Jiao Tong University, Hong Kong Polytechnic University and other institutions.
- COIG: The full name is "China Common Open Instruction Data Set", which is a larger and more diverse instruction tuning corpus, and its quality is ensured by manual verification; the background Joint institutions include Beijing Institute of Artificial Intelligence, University of Sheffield, University of Michigan, Dartmouth College, Zhejiang University, Beihang University, and Carnegie Mellon University.
- Awesome Chinese Legal Resources: Chinese legal data resources, collected and organized by Shanghai Jiao Tong University.
- Huatuo: A Chinese medical instruction data set constructed through the medical knowledge graph and GPT3.5 API. On this basis, LLaMA has been fine-tuned to improve the performance of the instructions. LLaMA's question-and-answer effect in the medical field; the source of the project is Harbin Institute of Technology.
- Baize: Use a small number of "seed questions" to let ChatGPT chat with itself, and automatically collect it into a high-quality multi-turn conversation data set; University of California, San Diego (UCSD)The team working with Sun Yat-sen University and MSRA has made the data set collected using this method open source.
When more Chinese data sets are open sourced and brought into the spotlight, the attitude of the industry is one of welcome and joy. For example, the attitude expressed by Zhang Peng, founder and CEO of Zhipu AI:
High-quality Chinese data is just hidden in the boudoir. Now that everyone is aware of this problem, there will naturally be corresponding responses. Solutions, such as open source data.
In short, it is developing in a good direction, isn't it?
It is worth noting that in addition to pre-training data, human feedback data is also indispensable at this stage.
Ready-made examples are before us:
Compared with GPT-3, the important buff superimposed by ChatGPT is to use RLHF (Human Feedback Reinforcement Learning) to generate Fine-tuing of high-quality labeled data enables the development of large models that are aligned with human intentions.
The most direct way to provide human feedback is to tell the AI assistant "your answer is wrong", or to like or dislike directly next to the reply generated by the AI assistant.
# Once you use it first, you can collect a wave of user feedback and let the snowball roll. This is one of the reasons why everyone is rushing to release large models.
Now, domestic ChatGPT-like products, from Baidu Wenxinyiyan, Fudan MOSS to Zhipu ChatGLM, all provide feedback options.
But in the eyes of most experience users, the most important attribute of these large model products is "toys".
When encountering an incorrect or unsatisfactory answer, you will choose to close the dialogue interface directly, which is not conducive to the collection of human feedback by the large model behind it.
The above is the detailed content of The secret of the domestic ChatGPT 'shell' has now been found. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Vibe coding is reshaping the world of software development by letting us create applications using natural language instead of endless lines of code. Inspired by visionaries like Andrej Karpathy, this innovative approach lets dev

February 2025 has been yet another game-changing month for generative AI, bringing us some of the most anticipated model upgrades and groundbreaking new features. From xAI’s Grok 3 and Anthropic’s Claude 3.7 Sonnet, to OpenAI’s G

YOLO (You Only Look Once) has been a leading real-time object detection framework, with each iteration improving upon the previous versions. The latest version YOLO v12 introduces advancements that significantly enhance accuracy

ChatGPT 4 is currently available and widely used, demonstrating significant improvements in understanding context and generating coherent responses compared to its predecessors like ChatGPT 3.5. Future developments may include more personalized interactions and real-time data processing capabilities, further enhancing its potential for various applications.

The article reviews top AI art generators, discussing their features, suitability for creative projects, and value. It highlights Midjourney as the best value for professionals and recommends DALL-E 2 for high-quality, customizable art.

Google DeepMind's GenCast: A Revolutionary AI for Weather Forecasting Weather forecasting has undergone a dramatic transformation, moving from rudimentary observations to sophisticated AI-powered predictions. Google DeepMind's GenCast, a groundbreak

The article discusses AI models surpassing ChatGPT, like LaMDA, LLaMA, and Grok, highlighting their advantages in accuracy, understanding, and industry impact.(159 characters)

OpenAI's o1: A 12-Day Gift Spree Begins with Their Most Powerful Model Yet December's arrival brings a global slowdown, snowflakes in some parts of the world, but OpenAI is just getting started. Sam Altman and his team are launching a 12-day gift ex
