To learn more about AIGC, please visit:
51CTO AI.x Community
https ://www.51cto.com/aigc/
Currently, new evaluations of RAG (Retrieval Augmentation Generation) systems seem to be released every day, and many of them focus on the framework in question. retrieval stage. However, the generative aspect—how the model synthesizes and expresses this retrieved information—may be equally important in practice. Many practical application cases prove that the system not only needs to return data from the context, but also needs to transform this information into a more complex response.
To this end, we have conducted several experiments to evaluate and compare the generation capabilities of three models: GPT-4, Claude 2.1 and Claude 3 Opus. This article will detail our research methods, results, and nuances of these models we encountered along the way, and explain why these are important to those building with generative AI.
If interested readers want to reproduce the results of the above experiment, everything needed in the experiment can be found in the GitHub repository (https://github.com/Arize-ai/LLMTest_NeedleInAHaystack).
Figure 1: Chart created by the author
While in a retrieval enhanced generation The retrieval part of the system is responsible for identifying and retrieving the most relevant information, but it is the generation phase that takes this raw data and transforms it into a coherent, meaningful and contextual response. The task of the generation step is to synthesize the retrieved information, fill in the gaps, and present it in a way that is easy to understand and relevant to the user query. The task of the generation step is to synthesize the retrieved information, fill in the gaps, and present it in a way that is easy to understand and relevant to the user query. Through the generation phase, blank information is filled in order to achieve a complete and understandable interpretation of the relevant information. At the same time, users can query information presented in relevant ways as needed. Through the processing in the generation stage, by filling in the blank information, the final generated result is made more complete and easier to understand. This provides a way to understand and query relevant information, helping users conduct deeper exploration and research.
In many real-world applications, the value of RAG systems lies not only in their ability to locate specific facts or information, but also in their ability to integrate and contextualize information within a broader framework. The generation phase enables RAG systems to go beyond simple fact retrieval and provide truly intelligent and adaptive responses.
The initial test we ran involved generating a date string from two randomly retrieved numbers: one representing the month and the other representing the day. The task of the model is:
For example, random numbers 4827143 and 17 represent April 17th.
The figures are placed in context of different lengths at different depths. The model initially had a rather difficult time accomplishing this task.
Figure 2: Initial test results
While both models performed poorly, Claude 2.1 performed significantly better in our initial tests Better than GPT-4, the success rate is almost four times higher. It is here that the verbose nature of Claude's model - providing detailed, explanatory answers - seems to give it a clear advantage, resulting in more accurate results compared to GPT-4's original terse answers.
Motivated by these unexpected experimental results, we introduced a new variable in the experiment. We instructed GPT-4 to “explain yourself, then answer the question,” a prompt that encouraged more detailed responses similar to those naturally output by the Claude model. Therefore, the impact of this small adjustment is far-reaching.
Figure 3: Initial test of targeted prompt results
The performance of the GPT-4 model improved significantly, achieving perfect results in subsequent tests. The Claude model's results also improved.
This experiment not only highlights differences in how language models handle generation tasks, but also demonstrates the potential impact of hint engineering on their performance. Claude's strength appears to be verbosity, which turns out to be a replicable strategy for GPT-4, suggesting that the way a model handles and presents inference can significantly affect its accuracy in generation tasks. Overall, in all our experiments, including the seemingly small "explain yourself" sentence played a role in improving the performance of the model.
Figure 4: Four further tests used to evaluate the generation
We performed it four more times Tests to evaluate the ability of mainstream models to synthesize and convert retrieved information into various formats:
As expected, each model showed strong performance in string concatenation, which also reiterates the previous understanding that text manipulation is a fundamental strength of language models.
Figure 5: Currency formatting test results
As for the currency formatting test, Claude 3 and GPT-4 performed almost flawlessly. Claude 2.1's performance is generally poor. Accuracy does not vary much across mark lengths, but is generally lower as the pointer is closer to the beginning of the context window.
Figure 6: Official test results from the Haystack website
Despite achieving excellent results in the first generation test, the accuracy of the Claude 3 was at decreased in a retrieval-only experiment. In theory, simply retrieving numbers should also be easier than manipulating them - which makes the drop in performance surprising and an area we plan to test further. If anything, this counterintuitive drop only further confirms the idea that both retrieval and generation should be tested when developing with RAG.
By testing various generation tasks, we observed that although both models, Claude and GPT-4, are good at trivial tasks such as string operations, they fail in more complex scenarios. , their advantages and disadvantages become obvious (https://arize.com/blog-course/research-techniques-for-better-retrieved-generation-rag/). LLM is still not very good at math! Another key result is that the introduction of "self-explanatory" hints significantly improves the performance of GPT-4, emphasizing the importance of how to hint the model and how to clarify its reasoning to achieve accurate results.
These findings have broader implications for the assessment of LLM. When comparing models like the detailed Claude and the initially less detailed GPT-4, it becomes clear that the RAG evaluation (https://arize.com/blog-course/rag-evaluation/) criteria must go beyond the previous emphasis on only being correct sex this. The verbosity of model responses introduces a variable that can significantly affect their perceived performance. This subtle difference may suggest that future model evaluations should consider average response length as a noteworthy factor to better understand the capabilities of the model and ensure a fairer comparison.
Zhu Xianzhong, 51CTO community editor, 51CTO expert blogger, lecturer, computer teacher at a university in Weifang, and a veteran in the freelance programming industry.
Original title: Tips for Getting the Generation Part Right in Retrieval Augmented Generation, Author: Aparna Dhinakaran
Link:
nce.com/tips-for-getting-the -generation-part-right-in-retrieval-augmented-generation-7deaa26f28dc.
To learn more about AIGC, please visit:
51CTO AI.x Community
https://www.51cto.com/ aigc/
The above is the detailed content of Generative AI model big PK——GPT-4, Claude 2.1 and Claude 3.0 Opus. For more information, please follow other related articles on the PHP Chinese website!