Let’s take a look at the definition of the task first, and give a relatively simple example:
For example, the Shanghai blockade During this period, a certain self-media claimed that "Li Liqun was caught sneaking downstairs to buy meat." Based on this claim (Claim) alone, we actually cannot determine whether he secretly went downstairs to buy meat and was caught. In order to verify the authenticity of this statement, the most intuitive idea is to look for evidence (Evidence). Evidence is information that you can collect and can help us verify the authenticity of a statement. For example, in the picture below, I just ended up tearing it up with my hands, which can be used as evidence.
The statement cited above is relatively simple. It only requires simple evidence and does not need to be based on evidence. reasoning. Let's look at a relatively complex example below. For example, there is a statement: In 2019, a total of 120,800 people took the Chengdu High School Entrance Examination, but the enrollment plan was only 43,000. It is relatively difficult to verify this statement. If we find relevant documents reporting on the 2019 Chengdu High School Entrance Examination:
...A total of 120,800 people took the high school entrance examination this year , this is the total reference number of Chengdu city, including 20 districts, High-tech Zone and Tianfu New District. A few months ago, the Education Bureau announced the 2019 general high school enrollment plan. The number of enrollment plans has further increased, and the chances of getting into the general high school are even greater. ......
In 2019, the enrollment plan for the central city (13 districts) is 43,015 people.
This document contains a lot of information related to the statement, but what is directly relevant and can help us verify the statement is the second half of the second paragraph above part, and the first sentence after many paragraphs. Based on these pieces of evidence, we can know that there are indeed 120,800 people taking the high school entrance examination in the 20 districts of Chengdu, and the enrollment plan for the central urban area (only including 13 districts) is indeed only 43,000. Although the numbers are correct, the concept has been changed here. When discussing the number of people taking the high school entrance examination, the number of people in 20 districts is used, but when discussing the enrollment plan, the range of 20 districts is reduced to 13 districts, thus misleading readers. To verify this kind of statement, we often need to extract directly relevant evidence from one or more documents, and at the same time make inferences based on the extracted evidence. In order to promote Chinese fact-checking machine learning systems, we propose such an evidence-based Chinese data set.
According to the review of fact-checking [1], the current fact-checking data sets can be roughly divided into two categories: Artificial ( Artificial) and natural (Natural).
Artificial (Artificial): The annotator is asked to rewrite the sentence according to Wikipedia as a statement, and the relevant paragraphs in the document can be used as evidence Verify this statement. If it is a synonymous conversion, then the statement is supported by the evidence (Supported). If the entities in the sentence are replaced, or a series of modifications such as negation are added, then the statement is rejected by the evidence (Refuted).
This annotation paradigm was originally FEVER[2], and many later famous data sets such as TabFact[3] also followed this paradigm. The advantage of this type of artificial data set is that it can be scaled up. Annotators are asked to label 100,000 statements, which is very suitable for training neural networks. On the other hand, relevant evidence is also easy to obtain. The disadvantage is that these statements are not statements that we will encounter in daily life and are popular among the general public. For example, you would not rewrite Li Liqun's Wikipedia statement to say "He secretly went downstairs to buy meat and was caught." On the other hand, this type of dataset assumes that Wikipedia contains all the knowledge to verify the claims, which is a relatively strong assumption. This assumption is often not met in real-life scenarios. The simplest problem is that Wikipedia has a time lag.
Natural: It is a statement crawled directly from the fact-checking platform, foreign comparison A well-known organization is PolitiFact, which often checks what Trump says. The advantage of this type of data set is that it is a statement that the general public will encounter every day and want to know the truth. It’s also a statement that human fact-checkers need to sift through.
If we ultimately want to build a system that can replace human verifiers to a certain extent, the input of this system needs to be this type of statement. The disadvantage of this type of data set is also obvious, that is, the number of claims that have been verified by humans is very limited. As the table shows, most data sets are actually an order of magnitude smaller than those constructed manually.
On the other hand, finding evidence is a very difficult problem. Existing data sets generally use fact-checking articles [4] as evidence, or use claims to enter Google search queries [5][6], and then use the returned search summary (shown in the red box) as evidence.
There are two problems with these methods of finding evidence:
In response to the problems mentioned above, we built CHEF. CHEF has the following characteristics:
The construction of the data set consists of 4 parts: Data collection, statement annotation, evidence retrieval and data verification.
The original statement was mainly crawled from four Chinese fact-checking websites (according to the Duke News Platform ), of which there are two in Simplified Chinese: China Rumor Refuting Center and Tencent True Truth. Traditional Chinese is from two platforms in Taiwan: MyGoPen and the Taiwan Fact-Checking Center. Since the vast majority (90%) of the claims crawled from fact-checking websites are false, it is actually quite intuitive that most popular rumors/statements are false and will be refuted/verified by the verification platform. Referring to previous methods (PublicHealth [7]), we crawled the titles of China News Network as real claims and constructed a data set with relatively balanced labels.
Compared with relatively mature foreign fact-checking agencies, the articles published by China’s verification platform are relatively Not so standardized. PolitiFact, for example, will tell you exactly what the claim is, what the verification summary is, and what the evidence and reasoning details are (as shown in the image above). However, Chinese articles generally do not clearly indicate this, so we ask the annotators to read the article and extract the statement verified by the article. At the same time, the statement is also cleaned to reduce the bias it contains.
Previous work has shown [8] that the statements in the fact-checking data set contain relatively strong biases (for example, untrue statements generally have negative words), and PLMs such as BERT can pass By capturing these biases directly, claims can be verified without evidence. Cleaning methods include changing rhetorical questions into declarative sentences and removing some words that may be biased, such as: heavy, shocking, etc. After extracting the claims, we also asked annotators to label the claims based on fact-checking articles. We adopt a classification similar to that of a series of works such as FEVER, using three classifications: Supported, Refuted and Not enough information (NEI). Among them, Refuted is the largest and NEI is the smallest.
We use the statement as a query statement to query Google search, and then filter out some documents, some of which are documents after the statement was published, and the other part is Documents from false news dissemination platforms, and the top 5 documents are retained at the end. The annotators were then asked to select up to 5 sentences as evidence for each statement.
The statistics for the claims and evidence in the dataset are as follows: The average length of the documents returned for each claim is 3691 words, where the sentence in which the annotator extracted the last fine-grained evidence contains 126 words, or an average of 68 words using Google's rule-based snippets. Simply comparing numbers, using returned documents and annotated sentences, provides more contextual information than using summaries directly.
In order to ensure the consistency of labeling, we added a round of data verification and randomly selected 3% of the A total of 310 labeled statements were distributed to 5 annotators for labeling and re-labeling. Fleiss K score reached 0.74, which is slightly higher than FEVER's 0.68 and Snopes[5]'s 0.70, indicating that the quality of data annotation is not inferior to the data sets constructed by previous people. The statement in CHEF is mainly divided into 5 themes: society, public health, politics, science and culture. Unlike European and American fact-checking platforms that focus on the political field, Chinese platforms pay more attention to public health issues, such as the new coronavirus, health care, medical treatment, etc. Another major topic is society, such as: fraud, further education, social events, etc.
There are four main challenges in verifying the statement:
Similar to previous classic fact-checking data sets (such as FEVER), the machine learning system needs to first select relevant sentences in a given document as evidence (evidence retrieval) , and then verify the claim against the evidence (claim verification).
Based on the work of thousands of people, this article proposes two major categories of baseline systems: pipeline and joint systems. Pipeline: Evidence retrieval and claim verification are two separate modules. The evidence retrieval is used to extract evidence first, and then the combined claims are handed over to the claim verification module for classification.
Joint: The evidence retrieval and claim verification modules are jointly optimized. Three different models are used. The first is the joint model of SOTA on FEVER [10], which uses a multi-task learning framework to learn to label evidence and claims at the same time. The second is to process the evidence extraction as a latent variable [11], and label each sentence of the returned document with 0 or 1. The sentences labeled with 1 will be left as evidence and classified together with the statement, using REINFORCE for training. The third method is similar to the second method, except that it uses HardKuma and the heavy parameter method for joint training [12] instead of using policy gradient.
The main results of the experiment are shown in the figure below:
The number of fine-grained evidence is not the more the better, as shown below As shown, when we select 5 sentences as fine-grained evidence, the evidence extractor in the pipeline system achieves the best effect. When 10 and 15 sentences are extracted as evidence, the effect becomes worse and worse. We The guess is that a lot of noise is introduced into the extracted sentences, which affects the judgment of the statement verification model.
Most declarations are greater than 10 The longer the word length, the better the model effect. We guess the main reason is that the statement is more detailed, and it is easier to collect detailed evidence to help the model make judgments. When the statement length is relatively short, the gap between the centralized baseline models is not very large. When the statement length is relatively long, the better the evidence obtained, the better the statement verification effect, which also illustrates the importance of evidence retrieval.
Claims from the scientific field are the most difficult to verify, and the model effects are basically No more than 55. On the one hand, it is more difficult to collect relevant evidence, and on the other hand, statements on scientific issues are relatively complex and often require implicit reasoning to obtain results.
As shown in the figure, even if we introduce partial Supported declarations, the entire data set There is still a problem of class imbalance. The model's effect on the NEI category is much weaker than the Supported and Refuted categories. Future work can study how to adjust the claim verification model for category-imbalanced fact-checking data sets, or use data augmentation methods to randomly increase the number of NEIs during the training process. For example, FEVEROUS [13] randomly increases the number of NEIs during the training process. Throw away the evidence for some claims and change the category of those claims to NEI.
The above is the detailed content of Tsinghua, Cambridge, and UIC jointly launch the first Chinese fact-checking data set: based on evidence, covering medical society and other fields. For more information, please follow other related articles on the PHP Chinese website!