GPT-4 solved the famous Internet meme "Chihuahua or blueberry muffin", which once amazed countless people.
However, now it is accused of "cheating"!
Pictures
The pictures that appear in the original question are all used, but the order and arrangement are messed up.
The latest version of GPT-4 is famous for its all-in-one feature. Surprisingly, however, it made errors in the number of images it recognized, and even the Chihuahua, which was originally correctly recognized, also had recognition errors
Pictures
What is the reason why GPT-4 performs well on the original image?
According to UCSC Assistant Professor Xin Eric Wang’s speculation, the reason for conducting this test is because the original images on the Internet are too popular. He believes that GPT-4 has encountered the original answers many times during the training process and successfully memorized them
LeCun, one of the three Turing Award winners, also paid attention to this matter and said:
Be careful about testing on the training set.
Picture
How popular is the original picture, not only on the Internet The famous problem has even become a classic problem in the field of computer vision, and has appeared many times in related paper research.
Picture
Many netizens have proposed their own test plans regarding the areas where GPT-4’s capabilities are limited, regardless of the impact of the original image
In order to rule out whether the arrangement is too complicated and has any impact, some people changed it to a simple 3x3 arrangement and made a lot of mistakes.
Pictures
Pictures
Someone took out some of the pictures and sent them to GPT separately- 4, got a 5/5 accuracy rate.
Picture
Xin Eric Wang believes that putting these easily confused images together is at the heart of this challenge
Picture
In the end, someone successfully used the two key techniques of letting the artificial intelligence "take a deep breath" and "think step by step" at the same time, and got the correct results
Picture
GPT-4's wording in the answer "This is an example of a visual pun or a famous meme" also reveals that the original image may indeed exist in the training data. Rephrased as follows: However, GPT-4 used in its answer: "This is an example of a visual pun or a famous meme", which also reveals that the original image may indeed exist in the training data
Picture
Finally, someone also tested the "Teddy or fried chicken" test that often appears together, and found that GPT-4 cannot distinguish well.
Picture
This "blueberry or chocolate bean" is a bit too much...
Picture
The "nonsense" of large models is called an illusion problem in academia, multi-modal large models The problem of visual hallucinations has become a hot research direction recently.
In a study at EMNLP 2023, we created the GVIL dataset, which contains 1,600 data points, and conducted a systematic evaluation of the problem of visual illusions
Picture
Studies show that larger scale models are more susceptible to illusions and are closer to human perception
Picture
Another recent study focuses on assessing two types of illusions: bias and interference
Picture
Picture
The study pointed out that GPT-4V often gets confused when interpreting multiple images together, and performs better when sending images separately, consistent with Observations from the “Chihuahua or Waffle” test.
Picture
Popular mitigation measures, such as self-correction and thought chain prompts, do not effectively solve these problems, and testing shows that LLaVA and Bard, etc. Modal models also have similar problems
In addition, research also found that GPT-4V is better at interpreting images with Western cultural backgrounds or images with English text.
For example, GPT-4V can correctly count the seven dwarfs Snow White, but it counts the seven gourd dolls into 10.
Picture
Reference link: [1]https://twitter.com/xwang_lk/status/1723389615254774122[2]https://arxiv. org/abs/2311.00047[3]https://arxiv.org/abs/2311.03287
The above is the detailed content of GPT-4 was exposed as cheating! LeCun calls for caution when testing on training set, chihuahua or muffin order confusion leads to errors. For more information, please follow other related articles on the PHP Chinese website!