Generative AI model big PK——GPT-4, Claude 2.1 and Claude 3.0 Opus-AI-php.cn

生成式AI模型大PK——GPT-4、Claude 2.1和Claude 3.0 Opus

To learn more about AIGC, please visit:

51CTO AI.x Community

https ://www.51cto.com/aigc/

Introduction

Currently, new evaluations of RAG (Retrieval Augmentation Generation) systems seem to be released every day, and many of them focus on the framework in question. retrieval stage. However, the generative aspect—how the model synthesizes and expresses this retrieved information—may be equally important in practice. Many practical application cases prove that the system not only needs to return data from the context, but also needs to transform this information into a more complex response.

To this end, we have conducted several experiments to evaluate and compare the generation capabilities of three models: GPT-4, Claude 2.1 and Claude 3 Opus. This article will detail our research methods, results, and nuances of these models we encountered along the way, and explain why these are important to those building with generative AI.

If interested readers want to reproduce the results of the above experiment, everything needed in the experiment can be found in the GitHub repository (https://github.com/Arize-ai/LLMTest_NeedleInAHaystack).

Supplementary Notes

Although initial findings suggested that Claude outperformed GPT-4, subsequent testing showed that with the advent of strategic prompt engineering, GPT-4 Excellent performance was demonstrated in broader evaluations. In short, there are still many problems in the model behavior and prompt engineering inherent in the RAG system.
Simply adding "Please explain yourself and then answer the question" to the prompt template significantly improves (more than twice) the performance of GPT-4. It's clear that when LLM says the answer, it seems to help develop the idea further. Through interpretation, it is possible for the model to re-execute the correct answer in the embedding/attention space.

The importance of RAG stage and generation

生成式AI模型大PK——GPT-4、Claude 2.1和Claude 3.0 Opus

Figure 1: Chart created by the author

While in a retrieval enhanced generation The retrieval part of the system is responsible for identifying and retrieving the most relevant information, but it is the generation phase that takes this raw data and transforms it into a coherent, meaningful and contextual response. The task of the generation step is to synthesize the retrieved information, fill in the gaps, and present it in a way that is easy to understand and relevant to the user query. The task of the generation step is to synthesize the retrieved information, fill in the gaps, and present it in a way that is easy to understand and relevant to the user query. Through the generation phase, blank information is filled in order to achieve a complete and understandable interpretation of the relevant information. At the same time, users can query information presented in relevant ways as needed. Through the processing in the generation stage, by filling in the blank information, the final generated result is made more complete and easier to understand. This provides a way to understand and query relevant information, helping users conduct deeper exploration and research.

In many real-world applications, the value of RAG systems lies not only in their ability to locate specific facts or information, but also in their ability to integrate and contextualize information within a broader framework. The generation phase enables RAG systems to go beyond simple fact retrieval and provide truly intelligent and adaptive responses.

Test #1: Date Mapping

The initial test we ran involved generating a date string from two randomly retrieved numbers: one representing the month and the other representing the day. The task of the model is:

Retrieve the random number #1
Isolate the last digit and increment it by 1
Generate a month for our date string based on the result
Retrieve random number #2
Generate date string from random number 2

For example, random numbers 4827143 and 17 represent April 17th.

The figures are placed in context of different lengths at different depths. The model initially had a rather difficult time accomplishing this task.

生成式AI模型大PK——GPT-4、Claude 2.1和Claude 3.0 Opus

Figure 2: Initial test results

While both models performed poorly, Claude 2.1 performed significantly better in our initial tests Better than GPT-4, the success rate is almost four times higher. It is here that the verbose nature of Claude's model - providing detailed, explanatory answers - seems to give it a clear advantage, resulting in more accurate results compared to GPT-4's original terse answers.

Motivated by these unexpected experimental results, we introduced a new variable in the experiment. We instructed GPT-4 to “explain yourself, then answer the question,” a prompt that encouraged more detailed responses similar to those naturally output by the Claude model. Therefore, the impact of this small adjustment is far-reaching.

生成式AI模型大PK——GPT-4、Claude 2.1和Claude 3.0 Opus

Figure 3: Initial test of targeted prompt results

The performance of the GPT-4 model improved significantly, achieving perfect results in subsequent tests. The Claude model's results also improved.

This experiment not only highlights differences in how language models handle generation tasks, but also demonstrates the potential impact of hint engineering on their performance. Claude's strength appears to be verbosity, which turns out to be a replicable strategy for GPT-4, suggesting that the way a model handles and presents inference can significantly affect its accuracy in generation tasks. Overall, in all our experiments, including the seemingly small "explain yourself" sentence played a role in improving the performance of the model.

Further tests and results

生成式AI模型大PK——GPT-4、Claude 2.1和Claude 3.0 Opus

Figure 4: Four further tests used to evaluate the generation

We performed it four more times Tests to evaluate the ability of mainstream models to synthesize and convert retrieved information into various formats:

String concatenation: Combine text fragments into coherent strings, testing the model's basic text operations Skill.
Currency formatting: Format numbers into currency, round, and calculate percentage changes to evaluate the model's accuracy and ability to handle numeric data.
Date mapping: Converting numeric representations into month names and days requires hybrid retrieval and context understanding.
Modular operations: Perform complex number operations to test the model's mathematical generation capabilities.

As expected, each model showed strong performance in string concatenation, which also reiterates the previous understanding that text manipulation is a fundamental strength of language models.

生成式AI模型大PK——GPT-4、Claude 2.1和Claude 3.0 Opus

Figure 5: Currency formatting test results

As for the currency formatting test, Claude 3 and GPT-4 performed almost flawlessly. Claude 2.1's performance is generally poor. Accuracy does not vary much across mark lengths, but is generally lower as the pointer is closer to the beginning of the context window.

生成式AI模型大PK——GPT-4、Claude 2.1和Claude 3.0 Opus

Figure 6: Official test results from the Haystack website

Despite achieving excellent results in the first generation test, the accuracy of the Claude 3 was at decreased in a retrieval-only experiment. In theory, simply retrieving numbers should also be easier than manipulating them - which makes the drop in performance surprising and an area we plan to test further. If anything, this counterintuitive drop only further confirms the idea that both retrieval and generation should be tested when developing with RAG.

Conclusion

By testing various generation tasks, we observed that although both models, Claude and GPT-4, are good at trivial tasks such as string operations, they fail in more complex scenarios. , their advantages and disadvantages become obvious (https://arize.com/blog-course/research-techniques-for-better-retrieved-generation-rag/). LLM is still not very good at math! Another key result is that the introduction of "self-explanatory" hints significantly improves the performance of GPT-4, emphasizing the importance of how to hint the model and how to clarify its reasoning to achieve accurate results.

These findings have broader implications for the assessment of LLM. When comparing models like the detailed Claude and the initially less detailed GPT-4, it becomes clear that the RAG evaluation (https://arize.com/blog-course/rag-evaluation/) criteria must go beyond the previous emphasis on only being correct sex this. The verbosity of model responses introduces a variable that can significantly affect their perceived performance. This subtle difference may suggest that future model evaluations should consider average response length as a noteworthy factor to better understand the capabilities of the model and ensure a fairer comparison.

Translator Introduction

Zhu Xianzhong, 51CTO community editor, 51CTO expert blogger, lecturer, computer teacher at a university in Weifang, and a veteran in the freelance programming industry.

Original title: Tips for Getting the Generation Part Right in Retrieval Augmented Generation, Author: Aparna Dhinakaran

Link:

nce.com/tips-for-getting-the -generation-part-right-in-retrieval-augmented-generation-7deaa26f28dc.

To learn more about AIGC, please visit:

51CTO AI.x Community

https://www.51cto.com/ aigc/

The above is the detailed content of Generative AI model big PK——GPT-4, Claude 2.1 and Claude 3.0 Opus. For more information, please follow other related articles on the PHP Chinese website!