Table of Contents
First-hand actual measurement of Claude3
Which one is correct?
Claude 3 Series Model
Home Technology peripherals AI Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked

Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked

Mar 06, 2024 pm 01:00 PM
ai Model arrangement

The plain text direction of the large model has been rolled to the end?

Last night, OpenAI’s biggest competitor Anthropic released a new generation of AI large model series - Claude 3.

This series contains three models, ranked from weakest to strongest, namely Claude 3 Haiku, Claude 3 Sonnet and Claude 3 Opus. Among them, Opus, the most capable, has scored higher than GPT-4 and Gemini 1.0 Ultra in multiple benchmark tests, setting new industry benchmarks in multiple dimensions such as mathematics, programming, multi-language understanding, and vision.

Anthropic states that Claude 3 Opus possesses knowledge at the level of a human undergraduate.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

After the release of the new model, Claude brings support for multi-modal capabilities for the first time (the Opus version has an MMMU score of 59.4%, exceeding GPT-4V, on par with Gemini 1.0 Ultra). Users can now upload photos, charts, documents and other types of unstructured data for AI to analyze and answer.

In addition, these three models also retain the consistent advantages of the Claude series models, namely the long context window. The initial stage supports a context window of 200K tokens, but Anthropic said that all three models support a context input of 1 million tokens (for specific customers), which is equivalent to the English version of "Moby Dick" or "Harry Potter and the Deathly Hallows" 》length.

However, in terms of pricing, the most powerful Claude 3 is also much more expensive than GPT-4 Turbo: GPT-4 Turbo has an input/output charge of 10/per million tokens. $30; while the Claude 3 Opus is $15/75.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Opus and Sonnet models are now available in claude.ai and the Claude API, with Haiku models coming soon. Amazon Cloud Technologies has announced that their new model is now available on Amazon Bedrock. Anthropic announced the official demo, the details are as follows:

After Anthropic’s official announcement, many researchers who got the opportunity to try it out also shared their experiences. Some say that Claude 3 Sonnet has solved a puzzle that only GPT-4 could solve before.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

However, some people say that in terms of actual experience, Claude 3 did not completely defeat GPT-4.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

First-hand actual measurement of Claude3

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Address: https ://claude.ai/

Does Claude 3 really surpass GPT-4 in performance as officially claimed? At present, most people think that it does have some meaning.

The following are some of the actual measurement results:

First of all, let’s do a brain teaser. Which month has twenty-eight days? The actual correct answer is every month. It seems that Claude 3 is not good at doing this kind of questions yet.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Then we tested the areas that Claude 3 is good at. From the official introduction, we can see that Claude is good at "understanding and processing images", including Extract text from images, convert UI to front-end code, understand complex equations, transcribe handwritten notes, and more.

For large models, it is often difficult to distinguish between fried chicken and teddy. When we input a picture containing teddy and fried chicken, Claude 3 gave this The answer "This image is a collage of dogs and chicken nuggets or nuggets that bear a striking resemblance to the dogs themselves..." is a passing question.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Then asked how many people were in it, Claude 3 also answered correctly, "This animation depicts seven small cartoon characters."

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Claude 3 can extract text from photos, even the vertical sequence of Chinese and Japanese can be correctly recognized:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

If I use memes from the Internet, how will it respond? Regarding the picture of visual error, GPT-4 and Claude3 gave opposite guesses:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Which one is correct?

In addition to understanding images, Claude is also capable of processing long texts. The full series of large models released this time can provide 200k context windows and accept more than 1 million token inputs.

What is the effect? We gave it a recent paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits" published by Microsoft and the National University of Science and Technology, and asked it to summarize the main points of the article in the form of 1, 2, and 3. We recorded it. Time, the time to output the overall answer is about 15 seconds.

But this is only the output effect of Claude 3 Sonnet. If you use the Claude Pro version, it will be faster, but it will cost $20 a month.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

It is worth noting that Claude now requires that the size of the uploaded article does not exceed 10MB. If it exceeds, there will be a prompt:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

In Claude 3's blog, Anthropic proposed that the coding capabilities of the new model have been greatly improved. Someone directly threw the basic ASCII code to Claude and found that it was stress-free:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

We should be able to confirm that Claude 3 has stronger coding capabilities than GPT-4.

Some time ago, Karpathy, who had just resigned from OpenAI, proposed a "word segmenter" challenge. Specifically, he put his 2 hour and 13 minute tutorial video into LLM and had it translated into the format of a book chapter or blog post about tokenizers.

Faced with this task, Claude 3 took it. The following are the results posted by AnthropicAI research engineer Emmanuel Ameisen:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

GPT-4时代已过?全球网友实测Claude 3,只有震撼


Perhaps it is no longer related to interests, Karpathy gave a relatively full and objective evaluation:

From a style point of view, it is indeed quite good! If you look closely, you'll notice some subtle issues/illusions. Regardless, it's impressive to have a system that works almost out of the box. I'm looking forward to playing more with the Claude 3, it looks like a strong model.

If there's anything relevant I have to say, it's that people should be extremely careful when making assessment comparisons, and not just because the assessments themselves are worse than you think , but also because many evaluation results are overfitted in undefined ways, and because the comparisons made can be misleading. The encoding rate (HumanEval) of GPT-4 is not 67%. Whenever I see this comparison used in place of coding performance, the corners of my eyes start to twitch.

Based on the above various tricky test results, some people have already shouted "Anthropic is so back".

Finally, anthropopic also launched a prompt library that contains prompt content in multiple directions. If you want to learn more about Claude 3’s new features, give it a try.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Link: https://docs.anthropic.com/claude/prompt-library

Claude 3 Series Model

## The three versions of the #Claude 3 series models are Claude 3 Opus, Claude 3 Sonnet and Claude 3 Haiku.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Among them, Claude 3 Opus is the most intelligent model, supporting a 200k tokens context window and achieving current SOTA performance on highly complex tasks. . The model handles open prompts and unseen scenes with excellent fluency and human-level understanding. Claude 3 Opus shows us the limits of what is possible with generative AI.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Claude 3 Sonnet delivers the ideal balance between intelligence and speed, especially for enterprise workloads. It delivers powerful performance at a lower cost than similar models and is designed for high durability in large-scale AI deployments. Claude 3 Sonnet supports a context window of 200k tokens.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Claude 3 Haiku is the fastest and most compact model with near real-time responsiveness. Interestingly, the context window it supports is also 200k. The model is able to answer simple queries and requests at unparalleled speed, allowing users to build seamless AI experiences that mimic human interactions.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Let’s take a closer look at the features and performance of the Claude 3 series models.

Comprehensively surpass GPT-4 and achieve a new SOTA level of intelligence

As the model with the highest level of intelligence in the Claude 3 series, Opus has the highest level of intelligence in the AI ​​system It is better than competing products on most evaluation benchmarks, including undergraduate level expert knowledge (MMLU), graduate level expert reasoning (GPQA), basic mathematics (GSM8K) and other benchmarks. Moreover, Opus demonstrates near-human-level understanding and fluency on complex tasks, leading the frontier of general intelligence.

Additionally, all Claude 3 Series models, including Opus, feature performance in analytics and predictions, granular content creation, code generation, and conversation in non-English languages ​​such as Spanish, Japanese, and French Enhanced capabilities.

The following figure shows the comparison between the Claude 3 model and competing models on multiple performance benchmarks. It can be seen that the strongest Opus is better than OpenAI's GPT-4.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Near real-time response

Claude 3 model can support real-time customer chat , automated replenishment, and data extraction are tasks where response must be immediate and real-time.

Haiku is the fastest and most cost-effective model on the market in the smart category. It can read an arXiv platform paper (~10k tokens) containing dense chart and graphical information in less than three seconds.

For the vast majority of jobs, Sonnet is 2x faster and more intelligent than Claude 2 and Claude 2.1. It excels at tasks that require fast responses, such as knowledge retrieval or sales automation. The Opus is similar in speed to the Claude 2 and 2.1, but with a higher level of intelligence.

Powerful visual capabilities

Claude 3 has features comparable to other head models Complex visual functions. They can process data in a variety of visual formats, including photos, charts, graphs, and technical diagrams.

Anthropic says some of their customers have more than 50% of their knowledge bases programmed in various data formats, such as PDFs, flowcharts or presentation slides. Therefore, the new model's powerful visual capabilities are very helpful.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Fewer rejection replies

The previous Claude model often made unnecessary rejections, indicating a lack of contextual understanding by the model. Anthropic has made meaningful progress in this area: Opus, Sonnet, and Haiku are significantly less likely to reject an answer than previous generations of models, even when user prompts are close to the system's bottom line. As shown below, the Claude 3 model exhibits a more nuanced understanding of requests, is able to identify truly harmful prompts, and refuses to answer harmless prompts much less frequently.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Accuracy improvement

To evaluate the accuracy of the model, Anthropic A large number of complex, factual questions are used to address known weaknesses in the current model. Anthropic classifies answers into correct answers, incorrect answers (or hallucinations), and uncertain answers, where the model does not know the answer, rather than providing incorrect information. Compared to Claude 2.1, Opus doubled the accuracy (or correct answers) on these challenging open-ended questions while also reducing incorrect answers.

In addition to producing more trustworthy responses, Anthropic will enable citations in the Claude 3 model so that the model can point to precise sentences in reference material to substantiate responses.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

##Long context and near-perfect recall

Claude 3 Series Models will initially offer 200K context windows at launch. However, officials say that all three models are capable of receiving inputs of more than 1 million tokens, and this capability will be provided to specific users who require enhanced processing capabilities.

In order to effectively handle long contextual cues, the model needs strong recall capabilities. The Needle In A Haystack (NIAH) assessment measures a model's ability to accurately recall information from large amounts of data. Anthropic enhanced the robustness of this benchmark by testing it on a different crowdsourced document base using 30 random Needle/question pairs in each prompt. Claude 3 Opus not only achieves near-perfect recall but also exceeds 99% accuracy. And in some cases, it even identified limitations in the assessment itself, realizing that the "needle" sentences appeared to have been artificially inserted into the original text.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Safe and easy to use

Anthropic said , which has established dedicated teams to track and mitigate security risks. The company is also developing methods such as Constitutional AI to improve model security and transparency and mitigate privacy concerns that new models may raise.

While the Claude 3 model series has made progress in key indicators of biological knowledge, network-related knowledge and autonomy compared to previous models, according to the research, the new model is at the forefront of AI Within Security Level 2 (ASL-2).

In terms of user experience, Claude 3 is better at following complex multi-step instructions than previous models, and is better able to adhere to brand and response guidelines, so that it can better develop trustworthy applications. Additionally, Anthropic says Claude 3 models are now better at producing popular structured output in formats like JSON, making it easier to guide Claude for use cases like natural language classification and sentiment analysis.

What is written in the technical report

Currently, Anthropic has released a 42-page technical report "The Claude 3 Model Family: Opus, Sonnet, Haiku".

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Report address: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

We saw the training data, evaluation criteria and more detailed experimental results of the Claude 3 series models.

In terms of training data, Claude 3 series models are trained on a proprietary mix of data publicly available on the Internet as of August 2023, as well as non-public data from third-party, data labeling services Data provided by vendors and paid contractors, data within Claude.

Claude 3 Series models have been extensively evaluated on multiple metrics including:

  • Reasoning ability
  • Multi-language ability
  • Long context
  • Reliability/factuality
  • Multi-modal ability

The first is the evaluation results on reasoning, programming and question and answer tasks , Claude 3 series models were compared with competing models on a series of industry-standard benchmarks for reasoning, reading comprehension, mathematics, science and programming. The results showed that they not only surpassed their previous models, but also achieved new SOTA in most cases. .

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Anthropic on the Law School Admission Test (LSAT), Multistate Bar Examination (MBE), American Mathematical Competition 2023 Math Competition, and Graduate Record Examination The Claude 3 series models were evaluated on the (GRE) General Examination, and the specific results are shown in Table 2 below.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Claude 3 series models have multi-modal (image and video frame input) capabilities and are great at solving complex multi-modal problems beyond simple text understanding Significant progress has been made on inference challenges.

A typical example is the performance of the Claude 3 model on the AI2D Scientific Chart Benchmark, a visual question-and-answer assessment that involves chart parsing and answering corresponding questions in a multiple-choice format .

Claude 3 Sonnet achieved SOTA level in 0-shot setting - 89.2%, followed by Claude 3 Opus (88.3%) and Claude 3 Haiku (80.6%), specific results As shown in Table 3 below.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

## In response to this technical report, Fu Yao, a doctoral student at the University of Edinburgh, gave his own analysis immediately.

First of all, in his opinion, the several models evaluated have basically no distinction in several indicators such as MMLU / GSM8K / HumanEval. What really needs to be concerned about is why the best one is The model still has 5% error on GSM8K.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

He believes that what can really distinguish the models is MATH and GPQA. These super difficult problems are the goals that AI models should aim for next. .

GPT-4时代已过?全球网友实测Claude 3,只有震撼

The areas where improvements are greater compared to Claude’s previous model are finance and medicine.

GPT-4时代已过?全球网友实测Claude 3,只有震撼

In terms of vision, the visual OCR capabilities of Claude 3 make people see its huge potential in data collection. .

GPT-4时代已过?全球网友实测Claude 3,只有震撼

In addition, he also found some other trends:

GPT-4时代已过?全球网友实测Claude 3,只有震撼

GPT-4时代已过?全球网友实测Claude 3,只有震撼

Judging from the current evaluation benchmarks and experience, Claude 3 has made great strides in terms of intelligence level, multi-modal capabilities and speed. improvement. With the further optimization and application of the new series of models, we may see a more diversified large model ecosystem.

Blog address: https://www.anthropic.com/news/claude-3-family

The above is the detailed content of Is the era of GPT-4 over? Netizens around the world tested Claude 3 and were shocked. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to configure Debian Apache log format How to configure Debian Apache log format Apr 12, 2025 pm 11:30 PM

This article describes how to customize Apache's log format on Debian systems. The following steps will guide you through the configuration process: Step 1: Access the Apache configuration file The main Apache configuration file of the Debian system is usually located in /etc/apache2/apache2.conf or /etc/apache2/httpd.conf. Open the configuration file with root permissions using the following command: sudonano/etc/apache2/apache2.conf or sudonano/etc/apache2/httpd.conf Step 2: Define custom log formats to find or

How Tomcat logs help troubleshoot memory leaks How Tomcat logs help troubleshoot memory leaks Apr 12, 2025 pm 11:42 PM

Tomcat logs are the key to diagnosing memory leak problems. By analyzing Tomcat logs, you can gain insight into memory usage and garbage collection (GC) behavior, effectively locate and resolve memory leaks. Here is how to troubleshoot memory leaks using Tomcat logs: 1. GC log analysis First, enable detailed GC logging. Add the following JVM options to the Tomcat startup parameters: -XX: PrintGCDetails-XX: PrintGCDateStamps-Xloggc:gc.log These parameters will generate a detailed GC log (gc.log), including information such as GC type, recycling object size and time. Analysis gc.log

How to implement file sorting by debian readdir How to implement file sorting by debian readdir Apr 13, 2025 am 09:06 AM

In Debian systems, the readdir function is used to read directory contents, but the order in which it returns is not predefined. To sort files in a directory, you need to read all files first, and then sort them using the qsort function. The following code demonstrates how to sort directory files using readdir and qsort in Debian system: #include#include#include#include#include//Custom comparison function, used for qsortintcompare(constvoid*a,constvoid*b){returnstrcmp(*(

How to optimize the performance of debian readdir How to optimize the performance of debian readdir Apr 13, 2025 am 08:48 AM

In Debian systems, readdir system calls are used to read directory contents. If its performance is not good, try the following optimization strategy: Simplify the number of directory files: Split large directories into multiple small directories as much as possible, reducing the number of items processed per readdir call. Enable directory content caching: build a cache mechanism, update the cache regularly or when directory content changes, and reduce frequent calls to readdir. Memory caches (such as Memcached or Redis) or local caches (such as files or databases) can be considered. Adopt efficient data structure: If you implement directory traversal by yourself, select more efficient data structures (such as hash tables instead of linear search) to store and access directory information

How debian readdir integrates with other tools How debian readdir integrates with other tools Apr 13, 2025 am 09:42 AM

The readdir function in the Debian system is a system call used to read directory contents and is often used in C programming. This article will explain how to integrate readdir with other tools to enhance its functionality. Method 1: Combining C language program and pipeline First, write a C program to call the readdir function and output the result: #include#include#include#includeintmain(intargc,char*argv[]){DIR*dir;structdirent*entry;if(argc!=2){

How to configure firewall rules for Debian syslog How to configure firewall rules for Debian syslog Apr 13, 2025 am 06:51 AM

This article describes how to configure firewall rules using iptables or ufw in Debian systems and use Syslog to record firewall activities. Method 1: Use iptablesiptables is a powerful command line firewall tool in Debian system. View existing rules: Use the following command to view the current iptables rules: sudoiptables-L-n-v allows specific IP access: For example, allow IP address 192.168.1.100 to access port 80: sudoiptables-AINPUT-ptcp--dport80-s192.16

How to learn Debian syslog How to learn Debian syslog Apr 13, 2025 am 11:51 AM

This guide will guide you to learn how to use Syslog in Debian systems. Syslog is a key service in Linux systems for logging system and application log messages. It helps administrators monitor and analyze system activity to quickly identify and resolve problems. 1. Basic knowledge of Syslog The core functions of Syslog include: centrally collecting and managing log messages; supporting multiple log output formats and target locations (such as files or networks); providing real-time log viewing and filtering functions. 2. Install and configure Syslog (using Rsyslog) The Debian system uses Rsyslog by default. You can install it with the following command: sudoaptupdatesud

Where is the Debian Nginx log path Where is the Debian Nginx log path Apr 12, 2025 pm 11:33 PM

In the Debian system, the default storage locations of Nginx's access log and error log are as follows: Access log (accesslog):/var/log/nginx/access.log Error log (errorlog):/var/log/nginx/error.log The above path is the default configuration of standard DebianNginx installation. If you have modified the log file storage location during the installation process, please check your Nginx configuration file (usually located in /etc/nginx/nginx.conf or /etc/nginx/sites-available/ directory). In the configuration file

See all articles