Some time ago, researchers from LMSYS Org (led by UC Berkeley) made big news - the large language model version qualifying competition!
This time, the team not only brought 4 new players, but also a (quasi) Chinese ranking list.
There is no doubt that as long as GPT-4 participates in the battle, it will definitely be ranked first.
However, unexpectedly, Claude not only ranked second than GPT-3.5, which brought OpenAI to the altar, but was only 50 points behind GPT-4.
In contrast, the third-ranked GPT-3.5 is only 72 points higher than Vicuna, the strongest open source model with 13 billion parameters.
The 14 billion parameter "pure RNN model" RWKV-4-Raven-14B relies on its excellent performance to surpass all Transformer models and rank 6th - except for the Vicuna model , RWKV wins more than 50% of non-tie games against all other open source models.
In addition, the team also created two separate rankings: "English only" and "Non-English" (most of which are in Chinese) List.
It can be seen that the rankings of many models have changed significantly.
For example, ChatGLM-6B, which was trained with more Chinese data, did perform better, and GPT-3.5 also successfully surpassed Claude and ranked second.
The main contributors to this update are Sheng Ying, Lianmin Zheng, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica.
Sheng Ying is one of the three founders of LMSYS Org (the other two are Lianmin Zheng and Hao Zhang) and a doctoral student in the Department of Computer Science at Stanford University.
She is also a work of the popular FlexGen system that can run 175B model inference on a single GPU. It has currently received 8k stars.
##Paper address: https://arxiv.org/abs/2303.06865
Project address: https://github.com/FMInference/FlexGen
Personal homepage: https://sites.google.com /view/yingsheng/home
"Open Source" VS "Closed Source"With the help of the community, the team collected a total of 13k anonymous votes. And some interesting discoveries were made.
Among the three proprietary models, Anthropic’s Claude model is more popular than GPT-3.5-turbo Users welcome.
Moreover, Claude also performed very competitively when competing with the most powerful GPT-4.
Judging from the winning rate chart below, of the 66 non-tie games between GPT-4 and Claude, Claude won 32 games (48%).
##In all non-tie A vs B battles, the proportion of Model A winning
However, there is still a big gap between other open source models and these three proprietary models.In particular, GPT-4 leads the rankings with an Elo score of 1274. That’s nearly 200 points higher than the best open source alternative on the list, Vicuna-13B. After removing ties, GPT-4 won 82% of the games against Vicuna-13B and even won 79% against the previous generation GPT-3.5-turbo the match of. However, it is worth noting that these open source models on the leaderboard generally have fewer parameters than proprietary models, ranging between 3 billion - 14 billion. In fact, recent advances in LLM and data curation have made it possible to achieve significant performance improvements using smaller models. Google’s latest PaLM 2 is a good example: we know that PaLM 2 achieves better performance than its predecessor when using smaller model sizes. Therefore, the team is optimistic that open source language models will catch up. In the image below, a user has asked a tricky question that requires careful reasoning and planning. While Claude and GPT-4 provided similar answers, Claude's response was slightly better. However, due to the random nature of sampling, the team found that this situation could not always be replicated. Sometimes GPT-4 can also give the same sequence as Claude, but it failed in this generation trial. Additionally, the team noticed that GPT-4 behaves slightly differently when using the OpenAI API and ChatGPT interface, which may be due to different prompts, sampling parameters, or other unknown factors caused. An example of a user preferring Claude over GPT-4 In the figure below, although both Claude and GPT-4 have amazing capabilities, they still struggle to handle this type of complex inference problem. An example of a user thinking both Claude and GPT-4 are wrong In addition to these tricky situations, there are many simple problems that do not require complex reasoning or knowledge. In this case, open source models like Vicuna can perform comparably to GPT-4, so we might be able to use a slightly weaker (but smaller or cheaper) large Language Model (LLM) to replace more powerful models like GPT-4. The chatbot arena has never been more competitive since three powerful proprietary models got involved. Since open source models lose quite a few games against proprietary models, their Elo scores have all dropped. Finally, the team also plans to open some APIs so that users can register their own chatbots to participate in ranked matches. When will GPT-4 "overturn"?
Elo Score Changes
The above is the detailed content of UC Berkeley LLM quasi-Chinese rankings are here! GPT-4 ranks first, and the Chinese open source RNN model breaks into the top six. For more information, please follow other related articles on the PHP Chinese website!