Home Technology peripherals AI High-scoring paper from COLM, the first large model conference: Preference search algorithm PairS makes text evaluation of large models more efficient

High-scoring paper from COLM, the first large model conference: Preference search algorithm PairS makes text evaluation of large models more efficient

Aug 05, 2024 pm 02:31 PM
rlhf project COLM

首届大模型顶会COLM 高分论文:偏好搜索算法PairS,让大模型进行文本评估更高效
The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The authors of the article are all from the Language Technology Laboratory of Cambridge University. One is Liu Yinhong, a third-year doctoral student, and his supervisors are professors Nigel Collier and Ehsan Shareghi. His research interests are large model and text evaluation, data generation, etc. Zhou Han, a second-year doctoral student in Tongyi, is mentored by professors Anna Korhonen and Ivan Vulić. His research interest is in efficient large models.

The large model exhibits excellent command following and task generalization capabilities. This unique ability comes from the use of command following data and human feedback reinforcement learning (RLHF) in LLMs training. In the RLHF training paradigm, the reward model is aligned with human preferences based on ranking comparison data. This enhances the alignment of LLMs with human values, thereby generating responses that better assist humans and adhere to human values.

Recently, the first large model conference COLM has just announced the acceptance results. One of the high-scoring works analyzed the score bias problem that is difficult to avoid and correct when LLM is used as a text evaluator, and proposed to convert the evaluation problem into a preference ranking. problem, and thus designed the PairS algorithm, an algorithm that can search and sort from pairwise preferences. By leveraging the assumptions of uncertainty and LLM transitivity, PairS can give efficient and accurate preference rankings and demonstrate higher consistency with human judgment on multiple test sets.

首届大模型顶会COLM 高分论文:偏好搜索算法PairS,让大模型进行文本评估更高效

  • Paper link: https://arxiv.org/abs/2403.16950

  • Paper title: Aligning with Human Judgment: The Role of Pairwise Preference in Large Language Model Evaluators

  • Github address: https://github.com/cambridgeltl/PairS

What are the problems with large model evaluation?

A large number of recent works have demonstrated the excellent performance of LLMs in evaluating text quality, forming a new paradigm for reference-free evaluation of generative tasks, avoiding expensive human annotation costs. However, LLM evaluators are highly sensitive to prompt design and may even be affected by multiple biases, including positional bias, verbosity bias, and context bias. These biases prevent LLM evaluators from being fair and trustworthy, leading to inconsistencies and misalignments with human judgment.

首届大模型顶会COLM 高分论文:偏好搜索算法PairS,让大模型进行文本评估更高效

To reduce biased predictions of LLMs, previous work developed calibration techniques to reduce bias in LLM predictions. We first conduct a systematic analysis of the effectiveness of calibration techniques in aligning pointwise LLM estimators. As shown in Figure 2 above, existing calibration methods still do not align the LLM estimator well even when supervision data is provided.

As shown in Formula 1, we believe that the main reason for the misalignment of evaluation is not the biased priors over evaluation score distribution of LLM, but the misalignment of the evaluation standard, that is, the LLM evaluator The likelihood (likelihood). We believe that LLM evaluators will have more consistent evaluation criteria with humans when doing pairwise evaluation, so we explore a new LLM evaluation paradigm to promote more aligned judgments.

首届大模型顶会COLM 高分论文:偏好搜索算法PairS,让大模型进行文本评估更高效

Inspiration brought by RLHF

As shown in Figure 1 below, inspired by the alignment of reward models through preference data in RLHF, we believe that the LLM evaluator can be obtained by generating a preference ranking. More human-aligned predictions. Some recent work has begun to obtain preference rankings by asking LLM to perform pairwise comparisons. However, evaluating the complexity and scalability of preference rankings has been largely overlooked. They ignore the transitivity assumption, making the number of comparisons O (N^2), making the evaluation process expensive and infeasible.

首届大模型顶会COLM 高分论文:偏好搜索算法PairS,让大模型进行文本评估更高效

PairS: Efficient Preference Search Algorithm

In this work, we propose two pairwise preference search algorithms (PairS-greedy and PairS-beam). PairS-greedy is an algorithm based on complete transitivity assumption and merge sort, and can obtain global preference sorting with only O (NlogN) complexity. The transitivity assumption means that, for example, for 3 candidates, LLM always has if A≻B and B≻C, then A≻C. Under this assumption we can directly use traditional ranking algorithms to obtain preference rankings from pairwise preferences.

But LLM does not have perfect transitivity, so we designed the PairS-beam algorithm. Under the looser transitivity assumption, we derive and simplify the likelihood function for preference ranking. PairS-beam is a search method that performs a beam search based on the likelihood value in each merge operation of the merge sort algorithm, and reduces the pairwise comparison space through the uncertainty of preferences. PairS-beam can adjust the contrast complexity and ranking quality, and efficiently provide the maximum likelihood estimate (MLE) of preference ranking. In Figure 3 below we show an example of how PairS-beam performs a merge operation.

首届大模型顶会COLM 高分论文:偏好搜索算法PairS,让大模型进行文本评估更高效

Experimental results

We tested on multiple representative data sets, including the closed-ended abbreviation tasks NewsRoom and SummEval, and the open-ended story generation task HANNA, and compared multiple Baseline methods for LLM single-point evaluation, including unsupervised direct scoring, G-Eval, GPTScore and supervised training UniEval and BARTScore. As shown in Table 1 below, PairS has higher consistency with human ratings than them on every task. GPT-4-turbo can even achieve SOTA effects.

首届大模型顶会COLM 高分论文:偏好搜索算法PairS,让大模型进行文本评估更高效

In the article, we also compared two baseline methods for preference ranking, win rate and ELO rating. PairS can achieve their same quality preference ranking with only about 30% of the number of comparisons. The paper also provides more insights into how pairwise preferences can be used to quantitatively compute the transitivity of LLM estimators, and how pairwise estimators can benefit from calibration.

For more research details, please refer to the original paper.

The above is the detailed content of High-scoring paper from COLM, the first large model conference: Preference search algorithm PairS makes text evaluation of large models more efficient. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI ​​model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

Axiomatic training allows LLM to learn causal reasoning: the 67 million parameter model is comparable to the trillion parameter level GPT-4 Axiomatic training allows LLM to learn causal reasoning: the 67 million parameter model is comparable to the trillion parameter level GPT-4 Jul 17, 2024 am 10:14 AM

Show the causal chain to LLM and it learns the axioms. AI is already helping mathematicians and scientists conduct research. For example, the famous mathematician Terence Tao has repeatedly shared his research and exploration experience with the help of AI tools such as GPT. For AI to compete in these fields, strong and reliable causal reasoning capabilities are essential. The research to be introduced in this article found that a Transformer model trained on the demonstration of the causal transitivity axiom on small graphs can generalize to the transitive axiom on large graphs. In other words, if the Transformer learns to perform simple causal reasoning, it may be used for more complex causal reasoning. The axiomatic training framework proposed by the team is a new paradigm for learning causal reasoning based on passive data, with only demonstrations

See all articles