BELLE is based on Stanford Alpaca and optimized for Chinese. Model tuning only uses data produced by ChatGPT (does not include any other data).
It has been almost four months since the initial release of ChatGPT. When GPT-4 was released last week, ChatGPT immediately launched the new version. But a well-known secret is that neither ChatGPT nor GPT-4 are likely to be open source. Coupled with the huge investment in computing power and massive training data, there are many hurdles for the research community to replicate its implementation process.
Faced with the onslaught of large models such as ChatGPT, open source replacement is a good choice. At the beginning of this month, Meta "open sourced" a new large model series - LLaMA (Large Language Model Meta AI), with parameter sizes ranging from 7 billion to 65 billion. The 13 billion parameter LLaMA model outperforms the 175 billion parameter GPT-3 "on most benchmarks" and can run on a single V100 GPU.
After a few days, Stanford fine-tuned a new model Alpaca with 7 billion parameters based on LLaMA 7B. They used the technology introduced in the Self-Instruct paper to generate 52K instruction data, and made some modifications. , In preliminary human evaluations, the Alpaca 7B model performed similarly to the text-davinci-003 (GPT-3.5) model on Self-Instruct instruction evaluation.
Unfortunately, Alpaca’s seed tasks are all in English, and the data collected are also in English, so the trained model is not optimized for Chinese. In order to improve the effectiveness of the dialogue model in Chinese, is there a better way? Don’t worry, the project introduced next can solve this problem very well.
Open source Chinese dialogue large modelBELLE (Bloom-Enhanced Large Language model Engine) with 7 billion parameters is here. It is based on Stanford's Alpaca, but with Chinese optimization and some modifications to the generated code. Not only that, model tuning only uses data produced by ChatGPT (does not contain any other data).
In terms of data, the project open sourced the data collection code based on Alpaca. Based on this code, about 1 million pieces of Chinese data were generated. Combined with 50,000 pieces of English data from Alpaca, it was trained on the BLOOMZ-7B model. The checkpoint is uploaded to Hugging Face.
Hugging Face Address: https://huggingface.co/BelleGroup
The project author stated: This project aims to promote the development of the Chinese dialogue large model open source community.
Project address: https://github.com/LianjiaTech/BELLE
The project mainly includes the following four parts Content:
Data release
1. zh_seed_tasks.jsonl: Contains 175 seed tasks, the sample is as follows
{"id ": "seed_task_20", "name": "horror_movie_opening", "instruction": "You need to write a creative opening scene for a horror movie.", "instances": [{"input": "","output ":" The sun had set, leaving behind a dark town. A gentle breeze blew through the empty streets, sending a chill through anyone who ventured outside. The only sound was the slight rustling of leaves blown by the wind. Sound. Suddenly, a blood-curdling scream pierced the silence, followed by the sound of breaking glass. A house turned on its lights, and a figure could be seen running towards the center of the town. When > The figure became more and more When I got closer, I could clearly see that it was a young woman, covered in blood."}],"is_classification": false}
2. prompt_cn.txt: The prompt used to generate
3. 0.5M generated data
Data generation
Follow Alpaca’s method:
pip install -r requirements.txt
export OPENAI_API_KEY=YOUR_API_KEY
python generate_instruction.py generate_instruction_following_data
Use Completion API by default, model text-davinci-003. If you want to use the Chat API and use the gpt-3.5-turbo model, you can control it through parameters:
python generate_instruction.py generate_instruction_following_data
--api=chat --model_name=gpt-3.5-turbo
The output file is in Belle.train.json and can be manually filtered before use.
Model tuning
This project is based on the BLOOMZ-7B1-mt model and the Belle.train.json training model. The specific parameters are as follows:
In addition, the project also uses instruction learning data sets of different sizes (200,000, 600,000, 1 million and 2 million samples) to train the model, and the different model versions are as follows:
Model usage examples
##Limitations and usage restrictions
The SFT model trained based on the current data and the basic model still has the following problems in terms of effect:The above is the detailed content of To make up for the shortcomings of Stanford's 7 billion parameter 'Alpaca', a large model proficient in Chinese is here and has been open source. For more information, please follow other related articles on the PHP Chinese website!