Since the release of ChatGPT, the popularity of conversation models has only increased during this period. While we admire the amazing performance of these models, we should also guess the huge computing power and massive data support behind them.
As far as data is concerned, high-quality data is crucial, and for this reason OpenAI has put a lot of effort into data and annotation work. Multiple studies have shown that ChatGPT is a more reliable data annotator than humans. If the open source community can obtain large amounts of dialogue data from powerful language models such as ChatGPT, it can train dialogue models with better performance. This is proven by the Alpaca family of models – Alpaca, Vicuna, Koala. For example, Vicuna replicated ChatGPT’s nine-step success by fine-tuning instructions for the LLaMA model using user sharing data collected from ShareGPT. Increasing evidence shows that data is the primary productivity for training powerful language models.
ShareGPT is a ChatGPT data sharing website where users upload ChatGPT answers they find interesting. The data on ShareGPT is open but trivial and needs to be collected and organized by researchers themselves. If there is a high-quality, wide-ranging data set, the open source community will get twice the result with half the effort in developing conversation models.
Based on this, a recent project called UltraChat systematically constructed an ultra-high-quality conversation data set. The project authors tried to use two independent ChatGPT Turbo APIs to conduct conversations to generate multiple rounds of conversation data.
Specifically, the project aims to We are building an open source, large-scale, multi-round dialogue data based on Turbo APIs to facilitate researchers to develop powerful language models with universal dialogue capabilities. In addition, taking into account privacy protection and other factors, the project will not directly use data on the Internet as prompts. In order to ensure the quality of the generated data, the researchers used two independent ChatGPT Turbo APIs in the generation process, in which one model plays the role of the user to generate questions or instructions, and the other model generates feedback.
If you directly use ChatGPT to generate it freely based on some seed conversations and questions, it is prone to problems such as single topics and repeated content, making it difficult to guarantee data. diversity itself. To this end, UltraChat has systematically classified and designed the topics and task types covered by the conversation data, and also conducted detailed prompt engineering for the user model and reply model, which consists of three parts:
These three parts of data cover most users’ requirements for AI models. At the same time, these three types of data will also face different challenges and require different construction methods.
For example, the main challenge of the first part of the data is how to cover common knowledge in human society as widely as possible in a total of hundreds of thousands of conversations. To this end, the researchers used automatically generated topics and sources from Wikidata Two aspects of entities are filtered and constructed.
The challenges in the second and third parts mainly come from how to simulate user instructions and make the generation of user models as diverse as possible in subsequent conversations without deviating from the ultimate goal of the conversation ( Generate materials or rewrite materials as required), for which the researchers have fully designed and experimented with the input prompts of the user model. After the construction was completed, the authors also post-processed the data to weaken the hallucination problem.
Currently, the project has released the first two parts of the data, with a data volume of 1.24 million, which should be the largest related data set in the open source community. The content contains rich and colorful conversations in the real world, and the final part of the data will be released in the future.
World problem data comes from 30 representative and diverse meta-themes, as shown in the figure below:
Additionally, this project collected the 10,000 most commonly used named entities from Wikidata; used the ChatGPT API to generate 5 meta-questions for each entity; for each meta Questions, 10 more specific questions and 20 related but general questions were generated; 200,000 specific questions, 250,000 general questions and 50,000 meta-questions were sampled, and 3~7 dialogue rounds were generated for each question.
Next let’s look at a specific example:
We tested the data on the UltraChat platform Search results. For example, if you enter "music", the system will automatically search for 10,000 sets of music-related ChatGPT conversation data, and each set is a multi-round conversation
The search results for entering the keyword "mathematics (math)", there are 3346 groups of multi-round conversations:
Currently, UltraChat covers There are already many information fields, including medical, education, sports, environmental protection and other topics. At the same time, the author tried to use the open source LLaMa-7B model to perform supervised instruction fine-tuning on UltraChat, and found that after only 10,000 steps of training, there were very impressive effects. Some examples are as follows:
World Knowledge: List 10 good Chinese and American universities respectively
Imagination question: What are the possible consequences after space travel becomes possible?
Syllogism: Is a whale a fish?
##Hypothetical question: Prove that Jackie Chan is better than Bruce Lee
Overall, UltraChat is a high-quality, wide-ranging ChatGPT conversation data set that can be combined with other data sets to significantly improve the quality of open source conversation models. At present, UltraChat only releases the English version, but it will also release the Chinese version of the data in the future. Interested readers are welcome to explore it.
The above is the detailed content of Implementing Tsinghua UltraChat multi-round conversations using multiple ChatGPT APIs. For more information, please follow other related articles on the PHP Chinese website!