Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO-AI-php.cn

Home

Technology peripherals

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO

PHPz

Aug 05, 2024 pm 09:19 PM

project Identity preference optimization Direct preference optimization

Although methods that collect human labels on the relative quality of model-generated content and fine-tune unsupervised large language models to match these preferences through reinforcement learning from human feedback (RLHF) have given a huge boost to conversational AI development of. However, because RLHF is a complex and often unstable process, research on directly using optimization functions to align human preferences with model results has become a hot issue nowadays.

This article is a blog on hugging face, which compares the performance of three common human preference optimization algorithms nowadays. The authors conducted extensive experiments aimed at evaluating three feasible methods for tuning language models without reinforcement learning (or preference tuning), using different models and different hyperparameters. The three optimization methods are:

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO

The five authors of this article

Too long to read

In this blog, the author evaluates three excellent LLM alignment algorithms, namely: direct preference optimization (DPO), identity preference optimization (IPO) and Taversky Optimization optimization (KTO), and experiments were performed on two high-quality LLMs of 7b parameter size. These LLMs were supervised fine-tuned but not adjusted for human preferences. The authors found that while it was possible to find the algorithm that performed best, some key hyperparameters had to be tuned to obtain the best results.

Alignment without reinforcement learning

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO Schematic diagram of the principle of DPO (https://arxiv.org/abs/2305.18290)

^{Direct preference optimization (DPO) has become a major language model ( LLM) is a promising approach that combines human or artificial intelligence preferences. Different from traditional alignment methods based on reinforcement learning, DPO redefines the alignment formula into a simple loss function, which can be directly optimized on the preference data set {(x, y_w, y_l)}, where x is prompt, y_w , y_l are the preferred and non-preferred responses respectively.}

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO Example of human preference adjustment data set

^{DPO’s simple and easy-to-use features make it popular and has been successfully used in the training of models such as the Zephyr model and NeuralChat proposed by Intel.}The success of DPO has inspired researchers to study new loss functions, which can be summarized into the following two main directions:

Robustness: One disadvantage of DPO is that it quickly degrades on human preference datasets Will overfit. To avoid this, researchers at Google DeepMind introduced identity preference optimization (IPO), which adds a regularizer to the DPO loss and allows the model to converge without using techniques such as "early stopping".

Assignment to pairwise preference data: Like most alignment methods, DPO requires a pairwise preference data set
to be able to label which model responds based on a set of criteria (such as helpfulness or harmfulness) better. In practice, creating this data is a time-consuming and costly endeavor. ContextualAI recently proposed an interesting alternative called Kahneman-Taversky Optimization (KTO), which is based entirely on samples that are labeled as "good" or "bad" (such as the icons seen in the chat UI? or ?) to define the loss function. These tags are easier to obtain, and it can be said that KTO is a promising method to continuously update chat models running in production environments.

与此同时，需要注意这些方法都有相应的超参数，其中最重要的是 β ，这是一个控制对使用模型的偏好程度的权重。随着这些方法已经可以通过第三方库（如 huggingface TRL）来使用，接下来自然而然的问题是「在这些方法和超参数中，哪个组合能产生最好的聊天模型？」

本文旨在通过对这三种方法进行实验分析来回答这个问题，并且还要对关键超参数逐个分析，例如 β 和训练步数，最后通过 MT-Bench 评估所得模型的性能。MT-Bench 是衡量聊天模型功能的常见基准。

源代码地址：https://github.com/huggingface/alignment-handbook

使用链接

以下是相关资料的获取地址：

执行超参数扫描的代码和配置文件：https://github.com/huggingface/alignment-handbook/tree/main/recipes/pref_align_scan
本文使用的数据集和模型的集合：https://huggingface.co/collections/alignment-handbook/dpo-vs-kto-vs-ipo-65a69c5f03548d61dbe29ef8

实验设置

在进行对齐实验时，需要考虑两个主要因素：需要优化的模型和数据集。为了获得更多数据，作者考虑了两个模型，OpenHermes-2.5-Mistral-7B 和 Zephyr-7B-β-sft，以及两个对齐数据集：Intel 的 orca_dpo_paries 数据集和 ultrafeedback-binarized（https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized）数据集。

orca_dpo_paries 数据集地址：https://huggingface.co/datasets/Intel/orca_dpo_pairs
ultrafeedback-binarized 数据集地址：https://huggingface.co/datasets/HuggingFaceH4/ultrafeedback_binarized

在第一个实验中，作者使用了 OpenHermes-2.5-Mistral-7B，因为它是不使用任何对齐方法的条件下，最好的 7B 级聊天模型之一。然后，本文使用了 Intel 的 orca_dpo_paries 数据集，该数据集包含 13k 个 prompt，其中被选择的响应结果由 GPT-4 生成，不需要的响应由 Llama Chat 13b 生成。这也是 NeuralChat 和 NeuralHermes-2.5-Mistral-7B 使用的数据集。

由于 KTO 本身不需要成对的偏好数据，作者直接将 GPT-4 生成的响应归类为「好」标签，将 Llama Chat 13b 的响应视为「坏」标签。虽然 GPT-4 的响应可能比 Llama Chat 13b 普遍更受欢迎，但在某些情况下，Llama-Chat-13b 可能会产生更好的响应，但作者认为这只是小概率事件，可以忽略不计。

第二个实验基于 ultrafeedback-binarized 数据集，在 Zephyr-7b-β-sft 模型上进行了偏好比对。ultrafeedback-binarized 数据集包含 66k 个 prompt 以及成对的偏好与拒绝的响应。该数据集之前被用于训练原始 Zephyr 模型，该模型当时在许多自动化基准测试和人工评估方面是 7B 类模型中最好的。

实验配置

对齐手册提供了一种配置单个实验的简单方法，这些参数可以在 run_dpo.py 中配置。

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO

作者在 Zephyr 上的实验配置也基本类似。

聊天模板由基本聊天模型中自动推断，OpenHermes-2.5 使用 ChatML，Zephyr 使用 H4。如果用户想使用自己的聊天格式，分词库现在已经启用了使用 jinja 格式字符串的用户定义聊天模板：

# Example of the Zephyr chat template"{% for message in messages %}\n {% if message [&#39;role&#39;] == &#39;user&#39; %}\n {{ &#39;<|user|>\n&#39; + message [&#39;content&#39;] + eos_token }}\n {% elif message [&#39;role&#39;] == &#39;system&#39; %}\n {{ &#39;<|system|>\n&#39; + message [&#39;content&#39;] + eos_token }}\n {% elif message [&#39;role&#39;] == &#39;assistant&#39; %}\n {{ &#39;<|assistant|>\n&#39;  + message [&#39;content&#39;] + eos_token }}\n {% endif %}\n {% if loop.last and add_generation_prompt %}\n {{ &#39;<|assistant|>&#39; }}\n {% endif %}\n {% endfor %}"

Copy after login

如下可以将对话格式化：

# <|system|># You are a friendly chatbot who always responds in the style of a pirate.</s># <|user|># How many helicopters can a human eat in one sitting?</s># <|assistant|># Ah, me hearty matey! But yer question be a puzzler! A human cannot eat a helicopter in one sitting, as helicopters are not edible. They be made of metal, plastic, and other materials, not food!

Copy after login

遍历超参数

实验中，作者逐次调整 β 值，分别在 0.01、0.1、0.2、…、0.9 时，对 DPO、IPO 和 KTO 三种方法进行实验。之所以包括 0.01，是因为作者观察到一些对齐算法对这个参数特别敏感。所有实验都只训练了一个 epoch。期间其他超参数保持不变，包括随机种子。

然后，作者使用上面定义的基本配置在 hugging face 的模型上逐个进行实验。

# Define an array containing the base configs we wish to fine tuneconfigs=("zephyr" "openhermes")# Define an array of loss typesloss_types=("sigmoid" "kto_pair" "ipo")# Define an array of beta valuesbetas=("0.01" "0.1" "0.2" "0.3" "0.4" "0.5" "0.6" "0.7" "0.8" "0.9")# Outer loop for loss typesfor config in "${configs [@]}"; dofor loss_type in "${loss_types [@]}"; do# Inner loop for beta valuesfor beta in "${betas [@]}"; do# Determine the job name and model revision based on loss typejob_name="$config_${loss_type}_beta_${beta}"model_revision="${loss_type}-${beta}"# Submit the jobsbatch --job-name=${job_name} recipes/launch.slurm dpo pref_align_scan config_$config deepspeed_zero3 \\"--beta=${beta} --loss_type=${loss_type} --output_dir=data/$config-7b-align-scan-${loss_type}-beta-${beta} --hub_model_revision=${model_revision}"donedonedone

Copy after login

实验结果

Die Autoren bewerteten alle Modelle mit MT Bench, einem Multi-Turn-Dialog-Benchmark. Der Benchmark verwendet GPT-4, um die Modellleistung in acht verschiedenen Kategorien zu beurteilen: Schreiben, Rollenspiel, Argumentation, Mathematik, Codierung, Extraktion, MINT und Geisteswissenschaften. Obwohl es einige Mängel aufweist, ist MT Bench immer noch eine gute Möglichkeit, Konversations-LLM zu bewerten.

Zephyr-7b-β-SFT

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO

^{Scores des Zephyr-Modells auf MT Bench unter verschiedenen β-Werten.}

Für das Zephyr-Modell stellten die Autoren fest, dass die beste Modellleistung erzielt wurde, wenn der β-Wert 0,01 betrug. Diese Schlussfolgerung ist bei allen drei getesteten Algorithmen konsistent, und ein interessantes Folgeexperiment wäre die Durchführung eines feinkörnigeren Scans im Bereich von 0,0 bis 0,2. Während DPO die höchsten MT-Bench-Werte erzielt, stellen wir fest, dass KTO (paarweise) in allen Einstellungen bessere Ergebnisse erzielt, mit Ausnahme eines Hyperparameterfalls. Börsengänge scheinen trotz ihrer stärkeren theoretischen Garantien in allen Fällen bis auf einen schlechter zu sein als das Basismodell.

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO

MT Bench Die besten Ergebnisse jedes Algorithmus auf dem Zephyr-Modell in jeder Kategorie. Die Stärken und Schwächen dieser Modelle können ermittelt werden, indem die besten Ergebnisse jedes Algorithmus in verschiedene Kategorien aufgeschlüsselt werden, die von MT Bench bewertet wurden. Wie man sieht, gibt es noch viel Raum für Verbesserungen bei Argumentation, Kodierung und mathematischen Fragen.

OpenHermes-7b-2.5

Obwohl die beobachteten Ergebnisse jedes Algorithmus in diesem Modell mit OpenHermes übereinstimmen, d. h. DPO>KTO>IPO, sind die optimalen Wertepunkte von β unterschiedlich. Die optimale β-Auswahl für DPO, KTO und IPO beträgt 0,6, 0,3 bzw. 0,01.

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO

MT Bench punktet für unterschiedliche β beim OpenHermes-Modell. OpenHermes-7b-2.5 ist eindeutig ein stärkeres Basismodell mit nur einer Verbesserung des MT-Bench-Scores um 0,3 nach Anpassung der menschlichen Präferenzen.

Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO

MT Bench Die besten Ergebnisse der drei Algorithmen auf dem OpenHermes-Modell in jeder Kategorie.

Zusammenfassung

In diesem Blogbeitrag betonte der Autor, wie wichtig es ist, bei der Präferenzausrichtung die richtigen Hyperparameter auszuwählen. Es wurde experimentell nachgewiesen, dass DPO KTO in paarweisen Präferenzeinstellungen übertrifft, obwohl die Leistung von IPO trotz stärkerer theoretischer Garantien schlecht zu sein scheint.

Diese experimentellen Ergebnisse sind reproduzierbar und die Code- und Konfigurationsdateien sind jetzt im Ausrichtungshandbuch zu finden. Sie können auch die Modelle und Datensätze mit der besten Leistung sehen.

Zukunftsausblick

Der Autor wird weiterhin neue Algorithmen zur Ausrichtung menschlicher Präferenzen erforschen und deren Leistung bewerten. Zumindest derzeit ist DPO der robusteste und leistungsstärkste Ausrichtungsalgorithmus für große Sprachmodelle. KTO ist auch deshalb vielversprechend, weil sowohl DPO als auch IPO paarweise Präferenzdaten erfordern und KTO auf jeden Datensatz angewendet werden kann, der positive und negative Labels enthält.

Originallink: https://huggingface.co/blog/pref-tuning?continueFlag=480af4490eaf8a2f4544fe3658589730

The above is the detailed content of Which human preference optimization algorithm is better? Follow the master to understand DPO, IPO and KTO. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1421

Laravel Tutorial

1315

PHP Tutorial

1266

C# Tutorial

1239

Related knowledge

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

LLM is really not good for time series prediction. It doesn't even use its reasoning ability. Jul 15, 2024 pm 03:59 PM

Can language models really be used for time series prediction? According to Betteridge's Law of Headlines (any news headline ending with a question mark can be answered with "no"), the answer should be no. The fact seems to be true: such a powerful LLM cannot handle time series data well. Time series, that is, time series, as the name suggests, refers to a set of data point sequences arranged in the order of time. Time series analysis is critical in many areas, including disease spread prediction, retail analytics, healthcare, and finance. In the field of time series analysis, many researchers have recently been studying how to use large language models (LLM) to classify, predict, and detect anomalies in time series. These papers assume that language models that are good at handling sequential dependencies in text can also generalize to time series.

See all articles