Google Research used its own BIG-Bench benchmark to establish the "BIG-Bench Mistake" data set and evaluate the error probability and error correction capabilities of popular language models on the market. Research. This initiative aims to improve the quality and accuracy of language models and provide better support for applications in the fields of intelligent search and natural language processing.
Google researchers said they created a special dataset called "BIG-Bench Mistake" to evaluate the error probability and self-correction of large language models ability. The purpose of this dataset is to fill the gap in the past lack of datasets to assess these capabilities.
The researchers ran 5 tasks on the BIG-Bench benchmark using the PaLM language model. Subsequently, they modified the generated "Chain-of-Thought" trajectory, added a "logical error" part, and used the model again to determine errors in the chain-of-thought trajectory.
In order to improve the accuracy of the data set, Google researchers repeated the above process and formed a dedicated benchmark data set called "BIG-Bench Mistake", which contained 255 logical errors.
The researchers pointed out that the logical errors in the "BIG-Bench Mistake" data set are very obvious, so it can be used as a good testing standard to help the language model start practicing from simple logical errors and gradually improve the ability to identify errors. ability.
The researchers used this data set to test models on the market and found that although the vast majority of language models can identify logical errors that occur during the reasoning process and correct themselves, this process "is not sufficient." Ideal" , often requiring human intervention to correct what the model outputs.
▲ Picture source Google Research Press Release This site found from the report that Google claims that "currently the most advanced large language model" self-corrects errors The ability is also relatively limited. The model that performed best in the relevant test results only found 52.9% of the logical errors.
Google researchers also claimed that this BIG-Bench Mistake data set will help improve the model’s self-correction ability. After fine-tuning the model on relevant test tasks, “even Small models also generally perform better than large models with zero-sample cues."
Accordingly, Google believes that in terms of model error correction, proprietary small models can be used to "supervise" large models. Instead of letting large language models learn to "correct self-errors",
deployment is dedicated to supervising large models. Small, specialized models of models help improve efficiency, reduce associated AI deployment costs, and make fine-tuning easier.The above is the detailed content of Google releases BIG-Bench Mistake dataset to help AI language models improve self-correction capabilities. For more information, please follow other related articles on the PHP Chinese website!