Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm-AI-php.cn

Table of Contents

Background

How to train an Aligner model?

Aligner vs existing alignment paradigm

Weak-to-strong Generalization

Team Introduction

Home

Technology peripherals

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

PHPz

Feb 07, 2024 pm 10:06 PM

ai Model

Background

Large language models (LLMs) have demonstrated powerful capabilities, but they can also produce unpredictable and harmful output, such as offensive Responses, false information and leakage of private data cause harm to users and society. Ensuring that the behavior of these models aligns with human intentions and values is an urgent challenge.

Although reinforcement learning based on human feedback (RLHF) offers a solution, it faces complex training architecture, high sensitivity to parameters, and reward models Multiple challenges such as instability on different data sets. These factors make RLHF technology difficult to implement, effective, and reproducible. In order to overcome these challenges, the Peking University team proposed a new efficient alignment paradigm-

Aligner

, whichThe core is to learn the corrected residual between the aligned and misaligned answers, thereby bypassing the cumbersome RLHF process. Drawing on the ideas of residual learning and scalable supervision, Aligner simplifies the alignment process. It uses a Seq2Seq model to learn implicit residuals and optimize alignment through replication and residual correction steps.

Compared with the complexity of RLHF, which requires training multiple models, the advantage of Aligner is that alignment can be achieved simply by adding a module after the model to be aligned. Furthermore, the computational resources required depend primarily on the desired alignment effect rather than the size of the upstream model. Experiments have proven that using Aligner-7B can significantly improve the helpfulness and security of GPT-4, with the helpfulness increasing by 17.5% and the security increasing by 26.9%. These results show that Aligner is an efficient and effective alignment method, providing a feasible solution for model performance improvement.

In addition, using the Aligner framework, the author enhances the performance of the strong model (Llama-70B) through the weak model (Aligner-13B) supervision signal, achieving weak-to-strong

Generalization provides a practical solution for super alignment.

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm ##Paper address: https://arxiv.org/abs/2402.02416

Project homepage & open source address: https://aligner2024.github.io
Title: Aligner: Achieving Efficient Alignment through Weak-to-Strong Correction
What is Aligner?

Based on Core Insight:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

Correcting unaligned answer is easier than generating aligned answers.

As an efficient alignment method, Aligner has the following excellent features:

As an autoregressive Seq2Seq model, Aligner Train on the Query-Answer-Correction (Q-A-C) data set to learn the difference between aligned and unaligned answers, thereby achieving more accurate model alignment. For example, when aligning 70B LLM, Aligner-7B massively reduces the amount of training parameters, which is 16.67 times smaller than DPO and 30.7 times smaller than RLHF.

The Aligner paradigm realizes generalization from weak to strong. It uses an Aligner model with a high number of parameters and a small number of parameters to supervise LLMs with a large number of signal fine-tuning parameters, which significantly improves the performance of the strong model. For example, fine-tuning Llama2-70B under Aligner-13B supervision improved its helpfulness and safety by 8.2% and 61.6%, respectively.
Due to the plug-and-play nature of Aligner and its insensitivity to model parameters, it can align models such as GPT3.5, GPT4 and Claude2 that cannot obtain parameters. With just one training session, Aligner-7B aligns and improves the helpfulness and safety of 11 models, including closed-source, open-source, and secure/unsecured aligned models. Among them, Aligner-7B significantly improves the helpfulness and security of GPT-4 by 17.5% and 26.9% respectively.
Aligner overall performance

The author shows Aligner of various sizes (7B, 13B, 70B ) can improve performance in both API-based models and open source models (including security-aligned and non-security-aligned). In general, as the model becomes larger, the performance of Aligner gradually improves, and the density of information it can provide during correction gradually increases, which also makes the corrected answer safer and more helpful.

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

How to train an Aligner model?

1.Query-Answer (Q-A) Data Collection

The author obtains Query from various open source data sets, Includes conversations shared by Stanford Alpaca, ShareGPT, HH-RLHF, and others. These questions undergo a process of duplicate pattern removal and quality filtering for subsequent answer and corrected answer generation. Uncorrected answers were generated using various open source models such as Alpaca-7B, Vicuna-(7B,13B,33B), Llama2-(7B,13B)-Chat, and Alpaca2-(7B,13B).

2. Answer correction

The author uses GPT-4, Llama2-70B-Chat and manual annotation to The 3H criteria (helpfulness, safety, honesty) of large language models are used to correct the answers in the Q-A data set.

For answers that already meet the criteria, leave them as is. The modification process is based on a set of well-defined principles that establish constraints for the training of Seq2Seq models, with a focus on making answers more helpful and safer. The distribution of answers changes significantly before and after the correction. The following figure clearly shows the impact of the modification on the data set:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

3. Model training

Based on the above process, the author constructed a new revised data set Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm , where represents the user’s problem, is the original answer to the question, and is the revised answer based on established principles.

The model training process is relatively simple. The authors train a conditional Seq2Seq model Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm parameterized by such that the original answers are redistributed to aligned answers.

The alignment answer generation process based on the upstream large language model is:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

The training loss is as follows:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

The second item has nothing to do with the Aligner parameter. The training goal of Aligner can be derived as:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

The following figure dynamically shows the intermediate process of Aligner:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

It is worth noting that Aligner is training and None of the inference stages require access to the parameters of the upstream model. Aligner's reasoning process only needs to obtain the user's questions and the initial answers generated by the upstream large language model, and then generate answers that are more consistent with human values.

Correction of existing answers rather than direct answers allows Aligner to easily align with human values, significantly reducing the requirements on model capabilities.

Aligner vs existing alignment paradigm

Aligner vs SFT

Contrary to Aligner, SFT directly creates a cross-domain mapping from the Query semantic space to the Answer semantic space. This process of learning relies on the upstream model to infer and simulate various contexts in the semantic space, which is much more difficult than learning to modify the signal.

Aligner training paradigm can be considered as a form of residual learning (residual correction). The author created the "copy (correct)" learning paradigm in Aligner. Thus, Aligner essentially creates a residual mapping from the answer semantic space to the revised answer semantic space, where the two semantic spaces are distributionally closer.

To this end, the author constructed Q-A-A data in different proportions from the Q-A-C training data set, and trained Aligner to perform identity mapping learning (also called copy mapping) (called pre- Hot steps). On this basis, the entire Q-A-C training data set is used for training. This residual learning paradigm is also used in ResNet to solve the problem of gradient disappearance caused by stacking too deep neural networks. Experimental results show that the model can achieve the best performance when the preheating ratio is 20%.

Aligner vs RLHF

RLHF trains a reward model (RM) on a human preference dataset and utilizes This reward model is used to fine-tune the LLMs of the PPO algorithm so that the LLMs are consistent with human preferred behavior.

Specifically, the reward model needs to map human preference data from discrete to continuous numerical space for optimization, but compared to Seq2Seq, which has strong generalization ability in text space Model, this kind of numerical reward model has weak generalization ability in the text space, which leads to the unstable effect of RLHF on different models.

Aligner learns the difference (residual error) between aligned and unaligned answers by training a Seq2Seq model, thereby effectively avoiding the RLHF process and achieving better results than RLHF More generalizable performance.

Aligner vs. Prompt Engineering

Prompt Engineering is a common method to stimulate the capabilities of LLMs. However, there are some key problems with this method, such as: it is difficult to design prompts, and different designs need to be carried out for different models. The final effect depends on the capabilities of the model. When the capabilities of the model are not enough to solve the task, multiple iterations may be required, wasting context. Window, the limited context window of small models will affect the effect of prompt word engineering, and for large models, occupying too long context greatly increases the cost of training.

Aligner itself can support the alignment of any model. After one training, it can align 11 different types of models without occupying the context window of the original model. It is worth noting that Aligner can be seamlessly combined with existing prompt word engineering methods to achieve 1 1>2 effects.

In general: Aligner shows the following significant advantages:

1.Aligner Training is simpler. Compared with RLHF’s complex reward model learning and reinforcement learning (RL) fine-tuning process based on this model, Aligner’s implementation process is more direct and easy to operate. Looking back at the multiple engineering parameter adjustment details involved in RLHF and the inherent instability and hyperparameter sensitivity of the RL algorithm, Aligner greatly simplifies the engineering complexity.

#2.Aligner has less training data and obvious alignment effect. Training an Aligner-7B model based on 20K data can improve the helpfulness of GPT-4 by 12% and the security by 26%, and improve the helpfulness of the Vicuna 33B model by 29% and 45.3 % security, while RLHF requires more preference data and refined parameter adjustment to achieve this effect.

3.Aligner does not need to touch the model weights. While RLHF has proven effective in model alignment, it relies on direct training of the model. The applicability of RLHF is limited in the face of non-open source API-based models such as GPT-4 and their fine-tuning requirements in downstream tasks. In contrast, Aligner does not require direct manipulation of the original parameters of the model and achieves flexible alignment by externalizing the alignment requirements in an independent alignment module.

4.Aligner is insensitive to model type. Under the RLHF framework, fine-tuning different models (such as Llama2, Alpaca) not only requires re-collection of preference data, but also requires adjustment of training parameters in the reward model training and RL phases. Aligner can support the alignment of any model through one-time training. For example, by only needing to be trained once on a rectified dataset, Aligner-7B can align 11 different models (including open source models, API models such as GPT) and improve performance by 21.9% and 23.8% in terms of helpfulness and safety respectively.

5.Aligner’s demand for training resources is more flexible. RLHF Fine-tuning a 70B model is still extremely computationally demanding, requiring hundreds of GPU cards to perform. Because the RLHF method also requires additional loading of reward models, actor models, and critic models that are equivalent to the number of model parameters. Therefore, in terms of training resource consumption per unit time, RLHF actually requires more computing resources than pre-training.

In comparison, Aligner provides a more flexible training strategy, allowing users to flexibly choose the training scale of Aligner based on their actual computing resources. For example, for the alignment requirement of a 70B model, users can choose Aligner models of different sizes (7B, 13B, 70B, etc.) based on the actual available resources to achieve effective alignment of the target model.

This flexibility not only reduces the absolute demand for computing resources, but also provides users with the possibility of efficient alignment under limited resources.

Weak-to-strong Generalization

# #Weak-to-strong generalization The issue discussed is whether the labels of the weak model can be used to train a strong model, so that the performance of the strong model can be improved. OpenAI uses this analogy to solve the problem of SuperAlignment. Specifically, they use ground truth labels to train weak models.

OpenAI researchers conducted some preliminary experiments. For example, on the task of text classification (text classification), the training data set was divided into two parts, the input in the first half and the true value. The labels are used to train the weak model, while the second half of the training data only retains the input, labels produced by the weak model. Only the weak labels produced by the weak model are used to provide supervision signals for the strong model when training the strong model.

The purpose of training a weak model using true value labels is to enable the weak model to gain the ability to solve the corresponding task, but the input used to generate weak labels and the input used to train the weak model are not the same. This paradigm is similar to the concept of "teaching", that is, using weak models to guide strong models.

The author proposes a novel weak-to-strong generalization paradigm based on the properties of Aligner.

The author's core point is to let Aligner act as a "supervisor standing on the shoulders of giants." Unlike OpenAI's method of directly supervising the "giant", Aligner will modify stronger models through weak to strong corrections to provide more accurate labels in the process.

Specifically, during Aligner’s training process, the rectified data contains GPT-4, human annotators, and larger model annotations. Subsequently, the author uses Aligner to generate weak labels (i.e. corrections) on the new Q-A data set; and then uses the weak labels to fine-tune the original model.

Experimental results show that this paradigm can further improve the alignment performance of the model.

Experimental results

Aligner vs SFT/RLHF/DPO

##The author uses Aligner’s Query -Answer-Correction training data set, fine-tuning Alpaca-7B through SFT/RLHF/DPO method respectively.

When performing performance evaluation, the open source BeaverTails and HarmfulQA test prompt data sets are used, and the answers generated by the fine-tuned model and the answers to the original Alpaca-7B model are corrected using Aligner. The generated answers, compared in terms of helpfulness and security, are as follows:

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

Experimental results show that Aligner compares to SFT/RLHF/DPO Such a mature LLM alignment paradigm has obvious advantages, and is significantly ahead in both indicators of helpfulness and safety.

Analyzing specific experimental cases, it can be found that the alignment model fine-tuned using the RLHF/DPO paradigm may be more inclined to produce conservative answers in order to improve security, but it cannot take security into account in the process of improving helpfulness. sex, leading to an increase in dangerous information in answers.

Aligner vs Prompt Engineering

##Comparison of Aligner-13B and CAI/Self-Critique methods on the same upstream model Performance improvement, the experimental results are shown in the figure below: Aligner-13B improves GPT-4 in both helpfulness and security than the CAI/Self-Critique method, which shows that the Aligner paradigm has more advantages than the commonly used prompt engineering method. obvious advantage.

It is worth noting that CAI prompts are only used during reasoning in the experiment to encourage them to self-modify their answers, which is also one of the forms of Self-Refine.

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

In addition, the authors also conducted further exploration. They corrected the answers using the CAI method through Aligner, and After direct comparison of the answers before and after Aligner, the experimental results are shown in the figure below.

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

Method A：CAI Aligner Method B：CAI only

Use Aligner to correct CAI After the second revision of the answer, the answer has been significantly improved in terms of helpfulness without losing security. This shows that Aligner is not only highly competitive when used alone, but can also be combined with other existing alignment methods to further improve its performance.

Weak-to-strong Generalization

Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm

Method: weak-to -strong The training data set consists of (q, a, a′) triples, where q represents the questions from the Aligner training data set - 50K, a represents the answer generated by the Alpaca-7B model, and a′ represents the Aligner-7B given Alignment answer (q, a). Unlike SFT, which only utilizes a′ as the ground truth label, in RLHF and DPO training, a′ is considered better than a.

The author used Aligner to correct the original answer on the new Q-A data set, used the corrected answer as a weak label, and used these weak labels as supervision signals to train a larger model. . This process is similar to OpenAI’s training paradigm.

The author trains strong models based on weak labels through three methods: SFT, RLHF and DPO. The experimental results in the table above show that when the upstream model is fine-tuned through SFT, the weak labels of Aligner-7B and Aligner-13B improve the performance of the Llama2 series of strong models in all scenarios.

Outlook: Potential research directions of Aligner

As an innovative alignment method, Aligner has huge research potential. In the paper, the author proposed several Aligner application scenarios, including:

1. Application of multi-turn dialogue scenarios. In multi-round conversations, the challenge of facing sparse rewards is particularly prominent. In question-and-answer conversations (QA), supervision signals in scalar form are typically only available at the end of the conversation.

This sparsity problem will be further amplified in multiple rounds of dialogue (such as continuous QA scenarios), making it difficult for reinforcement learning-based human feedback (RLHF) to be effective. Investigating Aligner’s potential to improve dialogue alignment across multiple rounds is an area worthy of further exploration.

#2. Alignment of human values to the reward model. In the multi-stage process of building reward models based on human preferences and fine-tuning large language models (LLMs), there are huge challenges in ensuring that LLMs are aligned with specific human values (e.g. fairness, empathy, etc.) challenge.

By handing over the value alignment task to the Aligner alignment module outside the model, and using specific corpus to train Aligner, it not only provides new ideas for value alignment, but also enables Aligner to correct the previous Set the model's output to reflect specific values.

3. Streaming and parallel processing of MoE-Aligner. By specializing and integrating Aligners, you can create a more powerful and comprehensive hybrid expert (MoE) Aligner that can meet multiple hybrid security and value alignment needs. At the same time, further improving Aligner’s parallel processing capabilities to reduce the loss of inference time is a feasible development direction.

#4. Fusion during model training. By integrating the Aligner layer after a specific weight layer, real-time intervention in the output during model training can be achieved. This method not only improves alignment efficiency, but also helps optimize the model training process and achieve more efficient model alignment.

Team Introduction

This work was independently completed by Yang Yaodong’s research team at the AI Security and Governance Center of the Institute of Artificial Intelligence of Peking University. The team is deeply involved in the alignment technology of large language models, including the open source million-level safe alignment preference data set BeaverTails (NeurIPS 2023) and the safe alignment algorithm SafeRLHF (ICLR 2024 Spotlight) for large language models. Related technologies have been adopted by multiple open source models. Wrote the industry's first comprehensive review of artificial intelligence alignment and paired it with the resource website www.alignmentsurvey.com (click on the original text to jump directly), systematically expounding on the four perspectives of Learning from Feedback, Learning under Distribution Shift, Assurance, and Governance. AI alignment problem below. The team’s views on alignment and super-alignment were featured on the cover of the 2024 issue 5 of Sanlian Life Weekly.

The above is the detailed content of Significantly improving GPT-4/Llama2 performance without RLHF, Peking University team proposes Aligner alignment new paradigm. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

1 months ago By DDD

Roblox: Dead Rails - How To Complete Every Challenge

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7629

CakePHP Tutorial

1389

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

141

Related knowledge

Web3 trading platform ranking_Web3 global exchanges top ten summary Apr 21, 2025 am 10:45 AM

Binance is the overlord of the global digital asset trading ecosystem, and its characteristics include: 1. The average daily trading volume exceeds $150 billion, supports 500 trading pairs, covering 98% of mainstream currencies; 2. The innovation matrix covers the derivatives market, Web3 layout and education system; 3. The technical advantages are millisecond matching engines, with peak processing volumes of 1.4 million transactions per second; 4. Compliance progress holds 15-country licenses and establishes compliant entities in Europe and the United States.

What are the top ten platforms in the currency exchange circle? Apr 21, 2025 pm 12:21 PM

The top exchanges include: 1. Binance, the world's largest trading volume, supports 600 currencies, and the spot handling fee is 0.1%; 2. OKX, a balanced platform, supports 708 trading pairs, and the perpetual contract handling fee is 0.05%; 3. Gate.io, covers 2700 small currencies, and the spot handling fee is 0.1%-0.3%; 4. Coinbase, the US compliance benchmark, the spot handling fee is 0.5%; 5. Kraken, the top security, and regular reserve audit.

Rexas Finance (RXS) can surpass Solana (Sol), Cardano (ADA), XRP and Dogecoin (Doge) in 2025 Apr 21, 2025 pm 02:30 PM

In the volatile cryptocurrency market, investors are looking for alternatives that go beyond popular currencies. Although well-known cryptocurrencies such as Solana (SOL), Cardano (ADA), XRP and Dogecoin (DOGE) also face challenges such as market sentiment, regulatory uncertainty and scalability. However, a new emerging project, RexasFinance (RXS), is emerging. It does not rely on celebrity effects or hype, but focuses on combining real-world assets (RWA) with blockchain technology to provide investors with an innovative way to invest. This strategy makes it hoped to be one of the most successful projects of 2025. RexasFi

Top 10 cryptocurrency exchange platforms The world's largest digital currency exchange list Apr 21, 2025 pm 07:15 PM

Exchanges play a vital role in today's cryptocurrency market. They are not only platforms for investors to trade, but also important sources of market liquidity and price discovery. The world's largest virtual currency exchanges rank among the top ten, and these exchanges are not only far ahead in trading volume, but also have their own advantages in user experience, security and innovative services. Exchanges that top the list usually have a large user base and extensive market influence, and their trading volume and asset types are often difficult to reach by other exchanges.

Global Asset launches new AI-driven intelligent trading system to improve global trading efficiency Apr 20, 2025 pm 09:06 PM

Global Assets launches a new AI intelligent trading system to lead the new era of trading efficiency! The well-known comprehensive trading platform Global Assets officially launched its AI intelligent trading system, aiming to use technological innovation to improve global trading efficiency, optimize user experience, and contribute to the construction of a safe and reliable global trading platform. The move marks a key step for global assets in the field of smart finance, further consolidating its global market leadership. Opening a new era of technology-driven and open intelligent trading. Against the backdrop of in-depth development of digitalization and intelligence, the trading market's dependence on technology is increasing. The AI intelligent trading system launched by Global Assets integrates cutting-edge technologies such as big data analysis, machine learning and blockchain, and is committed to providing users with intelligent and automated trading services to effectively reduce human factors.

How to avoid losses after ETH upgrade Apr 21, 2025 am 10:03 AM

After ETH upgrade, novices should adopt the following strategies to avoid losses: 1. Do their homework and understand the basic knowledge and upgrade content of ETH; 2. Control positions, test the waters in small amounts and diversify investment; 3. Make a trading plan, clarify goals and set stop loss points; 4. Profil rationally and avoid emotional decision-making; 5. Choose a formal and reliable trading platform; 6. Consider long-term holding to avoid the impact of short-term fluctuations.

'Black Monday Sell' is a tough day for the cryptocurrency industry Apr 21, 2025 pm 02:48 PM

The plunge in the cryptocurrency market has caused panic among investors, and Dogecoin (Doge) has become one of the hardest hit areas. Its price fell sharply, and the total value lock-in of decentralized finance (DeFi) (TVL) also saw a significant decline. The selling wave of "Black Monday" swept the cryptocurrency market, and Dogecoin was the first to be hit. Its DeFiTVL fell to 2023 levels, and the currency price fell 23.78% in the past month. Dogecoin's DeFiTVL fell to a low of $2.72 million, mainly due to a 26.37% decline in the SOSO value index. Other major DeFi platforms, such as the boring Dao and Thorchain, TVL also dropped by 24.04% and 20, respectively.

How to win KERNEL airdrop rewards on Binance Full process strategy Apr 21, 2025 pm 01:03 PM

In the bustling world of cryptocurrencies, new opportunities always emerge. At present, KernelDAO (KERNEL) airdrop activity is attracting much attention and attracting the attention of many investors. So, what is the origin of this project? What benefits can BNB Holder get from it? Don't worry, the following will reveal it one by one for you.

See all articles