Summary generation is a task of natural language generation (NLG), whose main purpose is to compress long texts into short summaries. It can be applied to a variety of content, such as news articles, source code, and cross-language texts, etc.
With the emergence of large models (LLM), traditional fine-tuning on specific data sets method is no longer applicable.
We can’t help but ask, how effective is LLM in generating summaries?
In order to answer this question, researchers from Peking University conducted a detailed discussion in the paper "Summarization is (Almost) Dead". They evaluated the performance of LLM on various summarization tasks (single news, multiple news, dialogue, source code, and cross-language summarization) using human-generated evaluation datasets. Quantitative and qualitative comparisons of LLM-generated summaries, human-written summaries, and fine-tuned model-generated summaries revealed that LLM-generated summaries were significantly favored by human evaluators
In the past After sampling and examining 100 papers related to summarization methods published in ACL, EMNLP, NAACL and COLING in 3 years, the researchers found that the main contribution of about 70% of the papers was to propose a summary summarization method and use it on standard data Its effectiveness has been verified on the set. Therefore, the study stated that "Summarization is (Almost) Dead"
Despite this, the researchers said that the field still faces some challenges, such as the need for higher Issues such as quality reference data sets and improved evaluation methods still need to be resolved
Paper link: https://arxiv.org/pdf/2309.09558. pdf
Methods and results
When performing single news, multiple news and conversation summary tasks, we used methods similar to the CNN/DailyMail and Multi-News data set construction methods for simulation. For the cross-language summarization task, we adopt the same strategy as that proposed by Zhu et al. As for the code summary task, the method proposed by Bahrami et al.
After the data set is constructed, the next step is the method. Specifically, this article uses BART and T5 for single news tasks; Pegasus and BART for multiple news tasks; T5 and BART for dialogue tasks; MT5 and MBART for cross-language tasks; and Codet5 for source code tasks.
In this experiment, the study used human evaluators to compare the overall quality of different abstracts. According to the results in Figure 1, the summaries generated by LLM outperform the manually generated summaries and the summaries generated by the fine-tuned model in all tasks
#This raises the question of why LLM is able to outperform human-written summaries, which are traditionally thought to be flawless. Furthermore, preliminary observations indicate that LLM-generated summaries are very smooth and coherent
This paper further recruits annotators to identify hallucination issues in human and LLM-generated summary sentences, and the results are shown in Table 1 , human-written summaries exhibit the same or a higher number of hallucinations compared to summaries generated by GPT-4. In specific tasks such as multiple news items and code summarization, human-written summaries exhibit significantly poorer factual consistency.
Table 2 shows the proportion of hallucinations in human-written summaries and GPT-4 generated summaries
This article also found that human-written reference summaries have a problem that lacks fluency. As shown in Figure 2 (a), human-written reference summaries sometimes suffer from incomplete information. And in Figure 2(b), some human-written reference summaries exhibit hallucinations.
This study also found that the summaries generated by fine-tuning models usually have a fixed and strict length, while LLM is able to adjust the output length based on input information. Furthermore, when the input contains multiple topics, the summaries generated by the fine-tuned model have low coverage of the topics, as shown in Figure 3, while the LLM is able to capture all topics when generating summaries
According to the results in Figure 4, it can be seen that the human preference score for large models exceeds 50%, which shows that people have a strong preference for its summary and highlights the ability of LLM in text summarization
The above is the detailed content of How smooth is the performance of GPT-4? Can human writing be surpassed?. For more information, please follow other related articles on the PHP Chinese website!