Not long after the release of ChatGPT, Microsoft successfully launched the "New Bing". Not only did its stock price surge, it even threatened to replace Google and usher in a new era of search engines.
But is New Bing really the right way to play a large language model? Are the generated answers actually useful to users? How credible is the quotation in the sentence?
Recently, Stanford researchers collected a large number of user queries from different sources and analyzed the four popular generative search engines, Bing Chat, NeevaAI, Human evaluation was performed by perplexity.ai and YouChat.
Paper link: https://arxiv.org/pdf/2304.09848.pdf
Experimental results found that responses from existing generative search engines are fluent and informative, but often contain statements without evidence and inaccurate quotes.
On average, only 51.5% of the citations can fully support the generated sentences, and only 74.5% of the citations can be used as evidence support for the relevant sentences.
The researchers believe that this result is too low for systems that may become the main tool for information-seeking users, especially considering that some sentences are only plausible. Generative search engines still need further optimization.
##Personal homepage: https://cs.stanford.edu/~nfliu/
The first author Nelson Liu is a fourth-year doctoral student in the Natural Language Processing Group of Stanford University. His supervisor is Percy Liang. He graduated from the University of Washington with a bachelor's degree. His main research direction is building practical NLP systems, especially for information search. s application.
Don’t Trust Generative Search EnginesIn March 2023, Microsoft reported that “approximately one-third of daily preview users use [Bing] every day "Chat", and Bing Chat provided 45 million chats in the first month of its public preview. In other words, integrating large language models into search engines is very marketable and is very likely to change the search entrance to the Internet.
But at present, the existing generative search engines based on large-scale language model technology still have the problem of low accuracy, but specifically The accuracy of the search engine has not yet been fully evaluated, and the limitations of the new search engine have not yet been fully understood.
Verifiability is the key to improving the credibility of search engines, that is, providing external links to citations for each sentence in the generated answer. As evidence support, it can make it easier for users to verify the accuracy of answers.
The researchers conducted manual evaluation on four commercial generative search engines (Bing Chat, NeevaAI, perplexity.ai, YouChat) by collecting questions from different types and sources.
##Evaluation indicatorsmainly include fluency, that is Whether the generated text is coherent; Usefulness, that is, whether the search engine's reply is helpful to the user, and whether the information in the answer can solve the problem; citation recall, that is, the generated The proportion of sentences about external websites that contain citation support; Citation Precision, that is, the proportion of generated citations that support its related sentences. Fluency Simultaneously display the user query, the generated reply and the statement "The reply is fluent and semantically coherent", Annotators rated the data on a five-point Likert scale. Perceived utility Similar to fluency, Annotators are asked to rate their agreement with the statement that the response is useful and informative to the user's query. Citation recall (citation recall) Citation recall refers to the value of citations that are fully supported by their related citations The proportion of sentences that are verified, so the calculation of this indicator requires identifying the sentences in the responses that are worthy of verification, and assessing whether each sentence worthy of verification is supported by relevant citations. In the "Identifying Sentences Worth Verifying" process, the researchers consider each generated sentence about the external world It’s all worth verifying, even the ones that may seem obvious and trivial, because what may seem like obvious “common sense” to some readers may not actually be correct. The goal of a search engine system should be to provide a reference source for all generated sentences about the outside world so that readers can easily verify any narrative in the generated reply. This cannot be done for the sake of simplicity. Sacrifice verifiability. So in fact the annotators verify all the generated sentences, except for those responses where the system is the first person, such as "As a language model, I am not capable of... ", or questions to users, such as "Do you want to know more?" etc. Assess "Whether a statement worthy of verification is adequately supported by its relevant citations" can be attributed to the identified source (AIS, attributable to identified) sources) Evaluation framework, the annotator performs binary annotation, that is, if an ordinary listener agrees that "based on the quoted web page, it can be concluded...", then the citation can fully support the reply. Citation accuracy In order to measure the accuracy of citations, annotators need to judge Whether each quotation provides full, partial, or irrelevant support for the sentence to which it relates. Full support : All information in the sentence is supported by the citation. Partial support : Some information in the sentence is supported by the citation, but other parts may be missing or contradictory. Irrelevant support (No support) : If the referenced web page is completely irrelevant or contradictory. For sentences with multiple relevant citations, annotators will be additionally required to use the AIS evaluation framework to determine whether all relevant citation web pages as a whole provide sufficient support for the sentence (II metajudgment). In the fluency and usefulness evaluation, it can be seen that each search engine is able to generate very smooth and useful replies. In the specific search engine evaluation, you can see that Bing Chat has the lowest fluency/usefulness rating (4.40/4.34), followed by NeevaAI (4.43/4.48), perplexity.ai (4.51/4.56), and YouChat (4.59/4.62). In different categories of user queries, it can be seen that shorter retrieval questions are usually smoother than long questions, and usually only answer factual knowledge; some difficult questions Questions often require aggregation of different tables or web pages, and the synthesis process reduces the overall flow. In the citation evaluation, it can be seen that existing generative search engines often fail to fully or correctly cite web pages, and on average only 51.5% of the generated sentences are fully supported by citations ( Recall), only 74.5% of the citations fully support their related sentences (precision). This value is unacceptable for a search engine system that already has millions of users , especially when generating responses often contains a large amount of information. And There are large differences in citation recall and precision between different generative search engines , with perplexity.ai achieving the highest recall ( 68.7), while NeevaAI (67.6), Bing Chat (58.7) and YouChat (11.1) are lower. On the other hand, Bing Chat achieved the highest accuracy (89.5) , followed by perplexity.ai (72.7), NeevaAI (72.0) and YouChat ( 63.6) Across different user queries, the citation recall gap between NaturalQuestions queries with long answers and non-NaturalQuestions queries is close to 11% (respectively 58.5 and 47.8); Similarly, citation recall between NaturalQuestions queries with short answers and NaturalQuestions queries without short answers The difference is nearly 10% (63.4 for queries with short answers, 53.6 for queries with only long answers, and 53.4 for queries with no long or short answers). The citation rate will be lower in questions without web page support. For example, when evaluating open-ended AllSouls paper questions, generative search engines will The citation recall rate is only 44.3Experimental results
The above is the detailed content of Comprehensive comparison of four 'ChatGPT search' models! Hand-annotated by a Chinese doctor from Stanford: New Bing has the lowest fluency, and nearly half of the sentences are not quoted.. For more information, please follow other related articles on the PHP Chinese website!