News on March 28, according to the latest benchmark report released by LMSYS Org, Claude-3’s score surpassed GPT-4 by a narrow margin, becoming the platform’s " "Best" large language model.
This website first introduces the LMSYS Org, which is a research organization jointly created by the University of California, Berkeley, the University of California, San Diego, and Carnegie Mellon University.
The system launches Chatbot Arena, a benchmark platform for large language models (LLM), which uses crowdsourcing to anonymously and randomly test large model products. Its ratings are based on widespread use in competitive games such as chess. Elo scoring system.
Through the rating results generated by user voting, the system will randomly select two different large model robots to chat with users each time, and allow users to anonymously choose which large model product performs better. , overall relatively fair.
Chatbot Arena Since its launch last year, GPT-4 has been firmly in the top spot and has even become the gold standard for evaluating large models.
But yesterday, Anthropic’s Claude 3 Opus defeated GPT-4 by a narrow margin of 1253 to 1251, and OpenAI’s LLM was pushed out of the top spot. Because the score was too close, the agency ranked Claude 3 and GPT-4 tied for first place due to error rate considerations, and another preview version of GPT-4 was also tied for first place.
Even more impressive is Claude 3 Haiku making it into the top ten. Haiku is Anthropic’s local size model, equivalent to Google’s Gemini Nano. It is much smaller than Opus which has trillions of parameters, so it is much faster in comparison. According to LMSYS data, Haiku ranks seventh on the list, with performance comparable to GPT-4.The above is the detailed content of Tied for first place with GPT-4, the LMSYS benchmark shows that the Claude-3 model performs well. For more information, please follow other related articles on the PHP Chinese website!