The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o-AI-php.cn

Home

Technology peripherals

The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o

王林

Jun 29, 2024 am 12:06 AM

project Visual subtitle recovery

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

The author of this article, Zhang Tianyu, studied at the Mila Artificial Intelligence Institute in Canada and studied under Professor Yoshua Bengio, the winner of the Turing Award. The main work during the doctoral period focused on multi-modal, GFlowNet, multi-agent reinforcement learning, and the application of AI in climate change. Currently, he has published papers at top machine learning conferences such as ICML, ICLR, and ICASSP. Represented as Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation (CLAP).

To achieve the ultimate goal of general artificial intelligence AGI, the first thing that must be achieved is that the model must be able to complete tasks that humans can easily do. In order to do this, one of the key guidelines for large model development is how to make machines think and reason like humans. Technologies such as attention mechanisms and Chain-of-Thought were inspired by this.

However, many people may not realize that many very simple cognitive tasksfor humans are often accompanied by very complex reasoning processes. As an example, please try filling in the blocked text gaps based on the image below:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Machine learning researchers from around the world are excited about the new GPU. Its cutting-edge features can also enable Large-scale experiments are more efficient and cheaper, even if it is as big as a stove )

For most native Chinese speakers, this task should not be difficult, and I believe you can get the answer in just a few seconds. But inferring the complete text from the exposed part of the text still requires a very complex reasoning process: contemporary neuroscience research shows that recovering partially occluded objects requires a high degree of involvement of the prefrontal cortex, which is capable of high-level decision-making.

We know that the current visual language models (Vision-Language Models, VLM) can perform object recognition and text recognition very accurately. However, when the occluded part is text; when the optical character recognition (OCR) of the model fails; when the only key information is only a few pixels of the occluded text, can the model simulate the human reasoning process to complete this task?

To this end, the team from Turing Award winner Yoshua Bengio proposed a new visual question and answer task: Visual Caption Restoration (VCR). Let us use this task to explore the reasoning capabilities of visual language models: How far are the current visual language models from human cognitive levels?

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

Paper title: VCR: Visual Caption Restoration
Paper link: arxiv.org/abs/2406.06462
Code repository: github.com/tianyu-z/VCR (Click to read the original text for direct access, including Review the data generation code for model evaluation and pre-training)
Hugging Face link: huggingface.co/vcr-org

VCR dataset introduction

For development For the VCR task, the researchers built a process for generating VCR composite images from image-text. In this process, you can change the visibility of the text in the image by controlling the size of the white rectangle that covers the text, thereby controlling the difficulty of the task.

With this data generation process, the researchers generated the VCR-wiki data set through Wikipedia’s main image - introduction pair . There are two difficulty levels for both languages: “Easy” and “Hard”. Among them:

"Easy" difficulty VCR task can make the OCR model invalid ;
"Difficulty" VCR task only retain 1-2 top and bottom for each occluded text The height of pixels, but still allows users of the corresponding language to complete the task.

In each language and difficulty, there are 5000 samples in the test set and validation set, and the remaining samples are in the training set.

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

^{- - to}

The example at the beginning of the article is only a small challenge for humans. It cannot well demonstrate the ultimate level of humans in doing this task and the thinking and skills humans use when solving problems. A sample VCR mission on "Hard" difficulty is shown below. Readers can focus more intently on trying to fill in the blank text gaps below themselves.

(Correct answer: The Great Treatise, a treatise on mathematics and astronomy compiled by Ptolemy in ancient Greece in about 140 AD, which proposed the complex motion paths of stars and planets. Until the Middle Ages and the early Renaissance, the The geocentric model proposed in the book was adopted by Islam and Europe...)

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点 How do humans complete partially obscured text?

There is a concept in education and cognitive science called

meta-cognition. When designing AI, we humans, as teachers, can use monitoring our own thinking processes

as a reference to help students who serve as models improve their learning efficiency. Therefore, thinking about “how humans complete VCR tasks” can be instructive for model design.

The picture below shows one of the author's problem-solving ideas for the VCR task as a reference:

It seems like there are many steps, but in fact, it is just constantly getting information through different areas
and then verifying it repeatedly

to increase the answers confidence level.

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

When I first saw the picture, I only had a vague guess in my mind. As I continued to read the pictures to obtain new information, I gradually verified the guess. After reading, when you start to fill in the blanks, you still don’t stop comparing different aspects of the information to confirm your answers. When the "hypothesis" is not consistent with other information, the "hypothesis" is overturned and a new hypothesis is tried again.

Human evaluation results

How good are humans at VCR tasks? The figure below shows the accuracy of native speakers or fluent users of each language in English/Chinese on easy/hard settings:

If errors including time, place names, and people’s names are taken into account, The average accuracy of Chinese in easy difficulty is about 98.58%, and the average accuracy of Chinese in hard difficulty is about 91.84%. Excluding these errors due to time, place names, and people's names, humans are almost close to full marks in the easy Chinese difficulty level, and the accuracy rate in the Chinese hard difficulty level has also reached 96.63%. As can be seen, the VCR task is very simple for humans.

Existing model results

The author tested the "all-star lineup": Claude 3 Opus, Claude 3.5 Sonnet, Gemini 1.5 Pro, GPT-4o, GPT-4 Turbo, Qwen-VL- Max, Reka Core and some of the best performing open source models available today.

The following figure shows the performance of each model on the simple difficulty of VCR-Wiki Chinese:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

The red box measurement indicators include representatives including image (VI) and text in the image ( TEI)The two parts are used as contextual information, and the model can restore the accuracy of the obscured text. The blue box indicates the accuracy of the model that can restore the covered text that only contains the text in the image (TEI) as contextual information and does not include the image (VI).

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

The results show that:

The vast majority of models are currently unable to do this task;
The vast majority of models do not make good use of image information, not because of image information (VI) And improve the accuracy.

On the Chinese Hard difficulty , the model ran into greater trouble. The best performer is GPT-4o, but its accuracy is only 2.2%. Except for CogVLM2-Chinese and Qwen-VL-Max, the accuracy of most models is close to 0%.

It can be observed that in hard mode, the original model has a hard time answering this question correctly at a significant rate, let alone getting close to humans.

English VCR evaluation results

The author also tested the current best open source and closed source visual-language models on the English VCR-Wiki. Before showing the test results, please take a look at two examples of the English VCR-Wiki task:

Simple English example:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Since the United States Post Office issued its first stamp in 1847, over 4,000 stamps have been issued and over 800 people featured. Many of these people...)

English Difficulty Example:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

(Correct answer: Lincoln is the luxury vehicle division of American automobile manufacturer Ford. Marketed among the top luxury vehicle brands in the United States, for...)

The test results of the English VCR-Wiki shown in the article are as follows:

Bengio团队提出多模态新基准，直指Claude 3.5和GPT-4o弱点

從結果整體來看，模型在英文的簡單模式和困難模式下都分別比中文表現得要好。這個結果與我們一般認為的 "因為特殊的模組化構形，殘缺的中文更加容易被補全" 的直覺不一致。或許這是由於在預訓練過程中，英文在資料量和資料品質上相比中文有更大的優勢。

在所測試的眾多模型中，GPT-4o 是閉源模型中的效果最佳的，CogVLM2 是開源模型中表現最佳的。

一個很有趣的現像是加入了圖片對 CogVLM2 來說有了明顯的幫助（在困難模式下提升了 20.3%），而對於 GPT-4o 而言反而結果有下降。在中文測驗中，也有相似的現象。筆者認為這是模型的結構所導致的。具體的細節，歡迎讀者參閱 CogVLM 系列的論文以及程式碼。

另外，閉源模型普遍取得了比開源模型更優的結果，這可能歸功於更優的訓練策略或更多的模型參數。但即使如此，模型依然在 “困難” 設定下遇到了很大的挑戰。開源模型雖然可以部分完成 “簡單” 設定，但在困難設定下，大多數開源模型都無法完成這個對人類而言十分簡單的任務。

相關任務簡介

VQA

VQA

的圖像

由於沒有唯一的標準答案，評估 VQA 具有很大的挑戰性

。傳統的 VQA 方法主要集中在圖像中可見元素的直接查詢，而不涉及圖像中嵌入的文字內容與整體圖像上下文之間的複雜關係。

在一些文字在圖片中資訊佔比比較大的 VQA 評測中，模型的視覺模組甚至可能完全不需要與語言模組對齊就可以勝任。此類流程為：影像輸入至 OCR 視覺模組，OCR 視覺模組輸出影像中的字元資訊並以此為上下文輸入給語言模組。這樣就導致了 VQA 任務退化了不需要影像資訊的 QA 任務。原本比較不同的 VLM 所需的視覺模組對齊能力被忽略而 OCR 能力被重視。

OCR

光學字元辨識（Optical Character Recognition, OCR）任務通常輸入影像中的完整字元，並輸出表示影像中字元的字串文字，而無需考慮影像中的完整字元。

預訓練過 OCR 的模型能夠從輸入圖像中提取嵌入的文本，即使這些文本是不完整或模糊的。然而，

隨著文字組件模糊或被遮蔽的程度增加

，只利用可見部分恢復原始文字變得困難，

OCR 方法在這種情況下效果有限

。

可以看出，VQA 任務沒有標準答案，評估模型回答的品質仍然是一個開放性問題。而 OCR 任務不需要透過上下文來完成，無法檢驗模型是否真的學會利用了上下文中的資訊。

VCR 任務的不可替代性

視覺字幕恢復（Visual Caption Restoration, VCR

。
VCR 任務的獨特挑戰在於要求
模型在視覺和文字訊息之間進行精確的對齊
利用可用的部分像素級文字提示和視覺上下文來準確地重建被遮蔽的內容
。這不僅測試了模型處理嵌入文字和視覺元素的能力，還考驗了其保持內部一致性的能力，類似於人類透過情境和視覺線索進行理解和回應的認知過程。
VCR 任務的問題有唯一的答案
，這使得評估可以透過準確度進行，使評測指標更加明確。
🎜透過調整文本的遮蓋比例，可以控制任務的難度🎜，從而提供一個豐富的測試環境。 🎜🎜🎜🎜與 OCR 任務一樣，VCR 任務也可以充當 VLM 的訓練任務。作者開放了 transform 程式碼，可以產生任意給定圖像 - 文字對的 VCR 任務圖。

小結

本文提出的視覺字幕恢復（VCR）透過看似簡單的字幕恢復任務巧妙地揭開了現有模型，現有圖片模型與人類在高階認知任務上的推理能力差異。相信這項任務可以啟發未來更有效的 VLM 訓練、評測和推理方法，進一步拉近多模態模型和人類認知能力的差距。

The above is the detailed content of The Bengio team proposes a new multi-modal benchmark, targeting the weaknesses of Claude 3.5 and GPT-4o. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1422

Laravel Tutorial

1316

PHP Tutorial

1267

C# Tutorial

1239

Related knowledge

The author of ControlNet has another hit! The whole process of generating a painting from a picture, earning 1.4k stars in two days Jul 17, 2024 am 01:56 AM

It is also a Tusheng video, but PaintsUndo has taken a different route. ControlNet author LvminZhang started to live again! This time I aim at the field of painting. The new project PaintsUndo has received 1.4kstar (still rising crazily) not long after it was launched. Project address: https://github.com/lllyasviel/Paints-UNDO Through this project, the user inputs a static image, and PaintsUndo can automatically help you generate a video of the entire painting process, from line draft to finished product. follow. During the drawing process, the line changes are amazing. The final video result is very similar to the original image: Let’s take a look at a complete drawing.

Topping the list of open source AI software engineers, UIUC's agent-less solution easily solves SWE-bench real programming problems Jul 17, 2024 pm 10:02 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com The authors of this paper are all from the team of teacher Zhang Lingming at the University of Illinois at Urbana-Champaign (UIUC), including: Steven Code repair; Deng Yinlin, fourth-year doctoral student, researcher

From RLHF to DPO to TDPO, large model alignment algorithms are already 'token-level' Jun 24, 2024 pm 03:04 PM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com In the development process of artificial intelligence, the control and guidance of large language models (LLM) has always been one of the core challenges, aiming to ensure that these models are both powerful and safe serve human society. Early efforts focused on reinforcement learning methods through human feedback (RL

arXiv papers can be posted as 'barrage', Stanford alphaXiv discussion platform is online, LeCun likes it Aug 01, 2024 pm 05:18 PM

cheers! What is it like when a paper discussion is down to words? Recently, students at Stanford University created alphaXiv, an open discussion forum for arXiv papers that allows questions and comments to be posted directly on any arXiv paper. Website link: https://alphaxiv.org/ In fact, there is no need to visit this website specifically. Just change arXiv in any URL to alphaXiv to directly open the corresponding paper on the alphaXiv forum: you can accurately locate the paragraphs in the paper, Sentence: In the discussion area on the right, users can post questions to ask the author about the ideas and details of the paper. For example, they can also comment on the content of the paper, such as: "Given to

Posthumous work of the OpenAI Super Alignment Team: Two large models play a game, and the output becomes more understandable Jul 19, 2024 am 01:29 AM

If the answer given by the AI model is incomprehensible at all, would you dare to use it? As machine learning systems are used in more important areas, it becomes increasingly important to demonstrate why we can trust their output, and when not to trust them. One possible way to gain trust in the output of a complex system is to require the system to produce an interpretation of its output that is readable to a human or another trusted system, that is, fully understandable to the point that any possible errors can be found. For example, to build trust in the judicial system, we require courts to provide clear and readable written opinions that explain and support their decisions. For large language models, we can also adopt a similar approach. However, when taking this approach, ensure that the language model generates

A significant breakthrough in the Riemann Hypothesis! Tao Zhexuan strongly recommends new papers from MIT and Oxford, and the 37-year-old Fields Medal winner participated Aug 05, 2024 pm 03:32 PM

Recently, the Riemann Hypothesis, known as one of the seven major problems of the millennium, has achieved a new breakthrough. The Riemann Hypothesis is a very important unsolved problem in mathematics, related to the precise properties of the distribution of prime numbers (primes are those numbers that are only divisible by 1 and themselves, and they play a fundamental role in number theory). In today's mathematical literature, there are more than a thousand mathematical propositions based on the establishment of the Riemann Hypothesis (or its generalized form). In other words, once the Riemann Hypothesis and its generalized form are proven, these more than a thousand propositions will be established as theorems, which will have a profound impact on the field of mathematics; and if the Riemann Hypothesis is proven wrong, then among these propositions part of it will also lose its effectiveness. New breakthrough comes from MIT mathematics professor Larry Guth and Oxford University

The first Mamba-based MLLM is here! Model weights, training code, etc. have all been open source Jul 17, 2024 am 02:46 AM

The AIxiv column is a column where this site publishes academic and technical content. In the past few years, the AIxiv column of this site has received more than 2,000 reports, covering top laboratories from major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work that you want to share, please feel free to contribute or contact us for reporting. Submission email: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com. Introduction In recent years, the application of multimodal large language models (MLLM) in various fields has achieved remarkable success. However, as the basic model for many downstream tasks, current MLLM consists of the well-known Transformer network, which

LLM is really not good for time series prediction. It doesn't even use its reasoning ability. Jul 15, 2024 pm 03:59 PM

Can language models really be used for time series prediction? According to Betteridge's Law of Headlines (any news headline ending with a question mark can be answered with "no"), the answer should be no. The fact seems to be true: such a powerful LLM cannot handle time series data well. Time series, that is, time series, as the name suggests, refers to a set of data point sequences arranged in the order of time. Time series analysis is critical in many areas, including disease spread prediction, retail analytics, healthcare, and finance. In the field of time series analysis, many researchers have recently been studying how to use large language models (LLM) to classify, predict, and detect anomalies in time series. These papers assume that language models that are good at handling sequential dependencies in text can also generalize to time series.

See all articles