Home Technology peripherals AI Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

Jan 23, 2024 pm 01:15 PM
meta project New York University self reward method

Artificial Intelligence Feedback (AIF) is going to replace RLHF?


In the field of large models, fine-tuning is an important step to improve model performance. As the number of open source large models gradually increases, people have summarized many methods of fine-tuning, some of which have achieved good results.

Recently, researchers from Meta and New York University used the "self-reward method" to allow large models to generate their own fine-tuning data, which brought something new to people. Shocking.

In the new method, the author conducted three iterations of fine-tuning Llama 2 70B, and the generated model outperformed a number of existing important large-scale models in the AlpacaEval 2.0 rankings. Models, including Claude 2, Gemini Pro and GPT-4.
Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4
Therefore, the paper attracted people’s attention just a few hours after it was posted on arXiv.

Although the method is not yet open source, it is believed that the method used in the paper is clearly described and should not be difficult to reproduce.

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

It is well known that tuning large language models (LLMs) using human preference data can greatly improve the instruction tracking performance of pre-trained models. In the GPT series, OpenAI proposed a standard method of human feedback reinforcement learning (RLHF), which allows large models to learn reward models from human preferences, and then allows the reward models to be frozen and used to train LLM using reinforcement learning. This method has gained A huge success.

A new idea that has emerged recently is to avoid training reward models entirely and directly use human preferences to train LLM, such as direct preference optimization (DPO). In both cases above, tuning is bottlenecked by the size and quality of the human preference data, and in the case of RLHF, the quality of tuning is also bottlenecked by the quality of the frozen reward models trained from them.

In new work in Meta, the authors propose to train a self-improving reward model that is not frozen but continuously updated during LLM adjustment to avoid this A bottleneck.

The key to this approach is to develop an agent with all the capabilities required during training (rather than splitting into a reward model and a language model), and let the instructions follow the task The pre-training and multi-task training allow task transfer by training multiple tasks simultaneously.

Therefore the author introduces a self-reward language model, whose agents both act as instructions to follow the model, generating responses for given prompts, and can also generate and evaluate new ones based on examples. instructions to add to their own training set.

The new approach uses a framework similar to iterative DPO to train these models. Starting from a seed model, as shown in Figure 1, in each iteration there is a self-instruction creation process, where the model generates candidate responses for the newly created prompts, and rewards are then assigned by the same model. The latter is achieved through prompts from LLM-as-a-Judge, which can also be viewed as an instruction-following task. Build a preference dataset from the generated data and train the next iteration of the model through DPO.

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

  • Paper title: Self-Rewarding Language Models

  • Paper link: https://arxiv. org/abs/2401.10020

Self-rewarded language model

The method proposed by the author first assumes: access to a basic pre-trained language model and a small amount of human-annotated seed data, and then build a model designed to possess both skills: Generate high-quality, helpful (and harmless) responses.

#2. Self-instruction creation: Ability to generate and evaluate new instructions following examples to add to your own training set.

#These skills are used to enable the model to perform self-alignment, i.e. they are the components used to iteratively train itself using Artificial Intelligence Feedback (AIF).

The creation of self-instructions involves generating candidate responses and then letting the model itself judge its quality, i.e. it acts as its own reward model, thereby replacing the need for an external model. This is achieved through the LLM-as-a-Judge mechanism [Zheng et al., 2023b], i.e. by formulating response evaluation as an instruction following task. This self-created AIF preference data was used as the training set.

So during the fine-tuning process, the same model is used in both roles: as a "learner" and as a "judge". Based on the emerging judge role, the model can further improve performance through contextual fine-tuning.

The overall self-alignment process is an iterative process that proceeds by building a series of models, each one an improvement over the last. What’s important in this is that since the model can both improve its generative capabilities and use the same generative mechanism as its own reward model, this means that the reward model itself can improve through these iterations, which is consistent with the standard inherent in reward models. There are differences in approach.

Researchers believe that this method can increase the upper limit of the potential of these learning models to improve themselves in the future and eliminate restrictive bottlenecks.

Figure 1 shows an overview of the method.

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

Experiment

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

In the experiment, The researchers used Llama 2 70B as the basic pre-training model. They found that self-reward LLM alignment not only improved instruction following performance but also improved reward modeling capabilities compared to the baseline seed model.

This means that in iterative training, the model is able to provide itself with a better quality preference data set in a given iteration than in the previous iteration. While this effect tends to saturate in the real world, it offers the interesting possibility that the resulting reward model (and thus the LLM) is better than a model trained solely from raw seed data written by humans.

In terms of command following ability, the experimental results are shown in Figure 3:

The researchers evaluated the self-reward on the AlpacaEval 2 ranking list model, the results are shown in Table 1. They observed the same conclusion as the head-to-head evaluation, that is, the winning rate of training iterations was higher than that of GPT4-Turbo, from 9.94% in iteration 1, to 15.38% in iteration 2, to 20.44% in iteration 3. Meanwhile, the Iteration 3 model outperforms many existing models, including Claude 2, Gemini Pro, and GPT4 0613.

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4The reward modeling evaluation results are shown in Table 2. The conclusions include:

Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4

EFT has improved on the SFT baseline, using IFT EFT improved all five measurements compared to IFT alone. For example, the pairwise accuracy agreement with humans increased from 65.1% to 78.7%.
  • Improve reward modeling capabilities through self-training. After a round of self-reward training, the model's ability to provide self-rewards for the next iteration is improved, and its ability to follow instructions is also improved.

  • LLMas-a-Judge Importance of Tips. The researchers used various prompt formats and found that LLMas-a-Judge prompts had higher pairwise accuracy when using the SFT baseline.

The author believes that the self-reward training method not only improves the model's instruction tracking ability, but also improves the model's reward modeling ability in iterations.

Although this is only a preliminary study, it appears to be an exciting direction for such models to better allocate rewards in future iterations. , to improve instruction compliance and achieve a virtuous cycle.

This method also opens up certain possibilities for more complex judgment methods. For example, large models can verify the accuracy of their answers by searching a database, resulting in more accurate and reliable output.

Reference content: https://www.reddit.com/r/MachineLearning/comments/19atnu0/r_selfrewarding_language_models_meta_2024/

The above is the detailed content of Large models under self-reward: Llama2 optimizes itself through Meta learning, surpassing the performance of GPT-4. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

New affordable Meta Quest 3S VR headset appears on FCC, suggesting imminent launch New affordable Meta Quest 3S VR headset appears on FCC, suggesting imminent launch Sep 04, 2024 am 06:51 AM

The Meta Connect 2024event is set for September 25 to 26, and in this event, the company is expected to unveil a new affordable virtual reality headset. Rumored to be the Meta Quest 3S, the VR headset has seemingly appeared on FCC listing. This sugge

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI ​​model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

The first open source model to surpass GPT4o level! Llama 3.1 leaked: 405 billion parameters, download links and model cards are available The first open source model to surpass GPT4o level! Llama 3.1 leaked: 405 billion parameters, download links and model cards are available Jul 23, 2024 pm 08:51 PM

Get your GPU ready! Llama3.1 finally appeared, but the source is not Meta official. Today, the leaked news of the new Llama large model went viral on Reddit. In addition to the basic model, it also includes benchmark results of 8B, 70B and the maximum parameter of 405B. The figure below shows the comparison results of each version of Llama3.1 with OpenAIGPT-4o and Llama38B/70B. It can be seen that even the 70B version exceeds GPT-4o on multiple benchmarks. Image source: https://x.com/mattshumer_/status/1815444612414087294 Obviously, version 3.1 of 8B and 70

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

See all articles