In recent years, with the vigorous development of question answering technology and multi-modal understanding technology, visual question answering tasks (Visual Question Answering) have become more and more popular. Large-scale visual question answering datasets such as VQA, CLEVER, and Visual-7W have been released one after another, which has greatly promoted the iterative development of visual question answering tasks. However, most of the current visual question answering data are artificially synthesized questions, such as "What color is her eyes?" which are fictitiously designed by the annotator after seeing the picture. Manually generated data will be relatively simple, low-quality and even biased. Therefore, in this work, we propose a large-scale Chinese-based image question and answer dataset: ChiQA based on real questions from users in QQ browser.
ChiQA contains over 40,000 real user queries and over 200,000 question-image pairs. The data and some baseline models have been published on GitHub. Relevant research has been accepted into the CIKM2022 long article.
##Paper address: https://arxiv.org/abs/2208.03030
Github address: https://github.com/benywon/ChiQA
Comparison of single-modal question answering tasksThree salient features of ChiQAQuestion Answering is one of the very important tasks in artificial intelligence and intelligent language processing. In recent years, with the release of large-scale data sets (such as SQuAD, NaturalQuestions) and the introduction of large-scale pre-trained language models (such as BERT, GPT), question and answer tasks have developed rapidly. However, most current question answering tasks are unimodal, that is, questions, resources, and answers are all text-based. However, from the perspective of cognitive intelligence and practical applications, multi-modal resources such as images can often provide richer information and answers. For example, for a question: What are the dimensions of iPhone13? A size comparison chart for different iPhone13 models would be more clear and intuitive. There are also some examples as shown below:
Figure 1: Some examples that are suitable for using pictures to answer user questions
In recent years, question and answer data and tasks for multi-modal understanding have been proposed one after another. Such as VQA1.0 and 2.0, CLEVR, GQA, etc. In most image question and answer data sets, the system provides some artificially generated or real images to annotators, and the annotators need to manually write some questions targeting specific attributes or entities. However, this data collection process inevitably has many flaws:
1) All questions are image-dependent, that is, the annotator asks the question after seeing the picture. question. In the process of large-scale data construction, artificially generated questions often lack diversity and are often biased due to the subjective factors of the annotators. Models trained on this kind of data that looks at the resources first and then asks questions can often easily achieve very good results by looking at the questions without looking at the background resources;
2) Secondly, in the traditional In VQA data, the answer is often a simple entity, relationship or a simple description of a specific area. However, for actual image question and answer tasks, many textual answers are unnecessary. For example, for the question: "What does an alpaca look like?" providing a lengthy answer describing the appearance of an alpaca is very redundant. Secondly, this kind of short entity description often causes annotators to only pay attention to local relationships and pay little attention to some information on the real overall structure;
3) Finally, most of the previous The resources tend to focus on English, with very little image question and answer data in the Chinese domain.
In this work, in response to the above problems, we propose a large-scale Chinese image question answering data set - ChiQA (Chinese Image Question Answering). We started with the user's real search terms in the mobile QQ browser, retrieved several related pictures through a specific API, and then handed the pictures to professionally trained annotators for three-level annotation to indicate whether the picture can answer the question perfectly ( 2 points), partially answered (1 point), and unable to answer (0 points) the user’s question. For ChiQA, there are three notable features:
In the end we collected more than 40,000 questions, each question has about 5 related pictures, that is, a total of more than 200,000 question-picture pairs. Each question has several pictures, and we score each picture on a three-step scale of 2-1-0.
Some examples in ChiQA are shown below:
Figure Three: Examples of some samples in ChiQA.
Data collection: All questions come from real user queries
The entire data collection process can be divided into four steps. The overall flow chart is as follows :
##Figure 4: Data collection process
A major feature of ChiQA is that all questions come from real user queries. However, if we randomly sample users' queries from search engine users' search logs, most of the queries will have no question-and-answer intent.
So we first need to filter out queries with question and answer intent. In this work, we use an internally constructed weak supervision method to train a binary classifier to determine whether a query has question-answering intent. Human evaluation of this intent model was able to achieve 90% precision and 80% recall. We used this model to sample user queries and obtained approximately 75,000 user queries that were judged by the model to have Q&A intentions and entered the next round.
After getting the questions, we send these questions to the open API provided by Google (Google Images API - SerpApi) for related image retrieval. The Google API returns the 100 most relevant images for each query. In order to ensure the quality of the final data, we removed queries whose length or width were less than 200 pixels and images that were too long or too wide.
After obtaining the original image, we take the first 5 filtered images and ask the annotator to annotate the query and the corresponding 5 images. We have designed an annotation interface specifically for this task, as shown in the figure below.
Figure 5: ChiQA annotation interface
During the annotation process , we asked the annotators to mark three aspects:
1) Problem annotation
Since this work mainly focuses on Picture Q&A, in fact, many common user questions have nothing to do with picture Q&A (or are difficult to answer with pictures). Therefore, we first ask the annotator to mark whether this question can be regarded as an image question and answer question. For example:
If a question is "the difference between xxx and xxx", then this question will be considered as a question with the intention of image question and answer;
If a question is vague, ambiguous, or contains opinions that are not based on factual inferences, then the question will be classified as having no image Q&A intention and will not participate in the subsequent image annotation process.
Some query annotation examples are shown in Figure 6:
Figure 6: Example of query annotation
2) Image annotation
For the valid query in the previous step, we Label its 5 candidate queries. The annotation standard is three-level 0-1-2 annotation, where:
0 points means that the picture cannot be used to answer this question at all, and 2 points means that the picture quality is acceptable and can be fully used. Answer this question independently. The picture with a score of 1 is somewhere between the two, which means that the picture is related to the query, but it cannot be answered directly. The user may need more queries or reasoning to get the final answer. Some examples of 0 points, 1 points, and 2 points are shown below:
## Figure 7: For the question "How to use different prepositions ”, Example of image annotation and scoring
3) Quality control
We are in the entire annotation process A strict quality control program is adopted. Specifically, we will first invite 3 quality teams to conduct trial annotation, and select the team with the best annotation quality to annotate all the remaining data. Secondly, during the annotation process, we will divide the annotated data into batches. For each batch of data, we will sample one-fifth of the data for manual verification. If the pass rate of the data is less than 90%, then this batch The data will be returned and re-labeled until the data accuracy reaches 90%.
After data collection work, we found that if the data is randomly sampled and annotated, there will often be There are some simple patterns, and the presence of such simple patterns in large numbers in the data may bias the final model. Therefore, we design an active learning annotation process. Specifically, we will first ask annotators to annotate a batch of data. After the annotation of this batch of data is completed, we will use this batch of data to train a transmembrane text-image matching model. After the model is trained, we start to use this model to "select" new samples: if the model is very uncertain about the prediction of this new sample (that is, the entropy of the final classification prediction is particularly large), then we think this sample is relatively difficult for the model. Therefore, there is a higher probability of retaining it until the middle of the next round. Otherwise, it means that the model is very confident in the data, so the model will retain it with a smaller probability until the next round.
We found that the active learning data selection process indeed makes the dataset more unbiased. We found that the labeled data from the first stage contained some imperceptible biases. For example, questions containing the word "Tips" are marked as valid questions, but almost all corresponding images are marked as unanswerable (i.e., 0 points). Therefore, the model is likely to predict the final question directly based on the questions in the query without looking at the image. result. This active learning process reduces the possibility that this high-confidence and biased shortcut will be difficult to select in the next round, thus reducing the impact of this model.
We randomly filtered out 2500 pieces of data from the annotated data and asked different annotators to re-annotate them. If the annotation result is the same as the previous result, the data is retained as the test set. If it is different, we ask an "expert" who knows the task very well to re-annotate the data, and finally get 2362 test data and more than 40,000 training data. The statistical information of the training set and the test set is shown in the following figure:
Figure 8: The statistical information of the training set and the test set in ChiQA
After annotating the data, we perform statistics and analysis on the data in ChiQA.
1) Query common word analysis:
We use stuttering word segmentation to segment query, and query The words in are displayed on the following cloud chart according to frequency:
#You can see that the most common query in ChiQA is Differences, illustrations, locations, etc. This is consistent with our intuition, because these words are indeed very suitable questions for pictures to answer.
2) Domain analysis
We use an internal domain classification classifier to perform all queries Classification, the final result is shown in the figure below:
It can be seen that our data contains data in many fields, and there is no one The data in the field account for an absolute majority. This ensures that our data distribution is even. Secondly, we also count the interrogative words in the question. The results are shown in the following figure:
You can see the what class and how to in ChiQA Questions like this account for the majority, and some other questions also have a considerable proportion.
3) Image analysis
In addition to questions, we also performed analysis on images in ChiQA. Since most images are language-independent, we use a target detection model DETR that is recognized as having excellent performance in the industry to mine entities in the image. DETR can map entities in the image to entities defined by standard MS-COCO, such as "person", "dog", etc. We mine entities for each image in ChiQA and display the distribution of the highest frequency entities in the following figure:
You can see in ChiQA More than 30 entities appear at least 1000 times in , which shows that ChiQA is an image data that is very evenly distributed and covers most fields. The entities that appear most often are "person", "mobile phone", "car", etc. . This is similar to the distribution of questions.
4) Reasoning skills
In order to better analyze the data, we also analyze the ChiQA data required reasoning skills were analyzed. Focused on analyzing 5 skills that require reasoning:
We randomly sampled 200 pieces of ChiQA data and labeled them according to the above 5 standards. , some of which data may require more than one reasoning skill. The result is shown below.
It can be seen that in addition to Grouding, more than 80% of ChiQA data requires in-depth understanding of text and contrast relationships in images. This is very different from most previous VQA data. Secondly, there are quite a few questions that require logic and comparison, indicating that the data in ChiQA is quite difficult. We believe that the analysis of reasoning skills in ChiQA can help us better understand this data and provide some a priori guidance for subsequent model design.
In the ChiQA data set, there are three levels of annotation scoring: 0, 1, 2, so in the experiment we test the model ranking indicators and ordinary classification index of. Divided into three categories:
Baseline Model
We experimented with multiple commonly used models on the ChiQA data set. Following the previous image-text matching work, we first encode the image and text with encoders respectively, then perform cross-modal fusion of their representations, and finally use a prediction layer to obtain the matching score. In the models listed below, adding ♣ means that the model has been pre-trained, and adding ♦ means that it has not been pre-trained.
The following is the result display:
The above model The indicators on the test set are shown in the figure. We can see that direct application of previous state-of-the-art cross-modal methods performs poorly, with metrics only slightly better than random scoring models. This means that ChiQA data is difficult and models that only use large-scale weakly supervised contrastive learning, such as ALBEF*, Wenlan, may not be able to distinguish the fine-grained information required for visual question answering. Furthermore, the poor performance of these models illustrates that the ChiQA dataset is different from previous weakly supervised image-text matching data because weakly supervised image-text matching focuses on correlation, while the ChiQA data also requires answerability of images.
Finally, the model fine-tuned on ChiQA has made great progress over the baseline, but is still far from human performance, so the model still has a lot to do on the ChiQA dataset. room for improvement.
With the development of the Internet, users have higher demands for questions and answers, and the system needs to provide more intuitive and convenient answers. Especially in recent years, multimedia content has become increasingly abundant, and more and more Q&A content based on pictures and videos has appeared in front of the public. The QQ Browser Lab Lizhi team was the first in the industry to launch a picture question and answer project in April this year. For example, if a user searches for the difference between kiwi fruit and kiwi fruit, the results will be intuitively displayed in front of the user in the form of pictures. As shown in the figure below:
Currently, this kind of problem that can be directly satisfied by pictures has achieved good results after it went online. We have observed that its user behavior (such as CTR, word exchange rate, etc.) has been significantly improved compared to traditional results, indicating that the current "new Q&A" based on pictures, etc., is a product business that can better meet user needs.
Introduction to the author team
The QQ Browser Search Technology Center team is the team responsible for search technology research and development of Tencent PCG information platform and service line. Relying on Tencent's content ecology, it drives product innovation through user research, providing users with graphics, information, novels, Long and short videos, services and other multi-faceted information needs are met. In terms of algorithms, based on natural language processing, deep learning, multi-modal understanding and generation, knowledge calculation and application and other technologies, we build content understanding, correlation and sorting, multi-modal search, intelligent question and answer, multi-language translation, search Recommended and other technical directions, explore and apply the industry's advanced technologies to create a better user search experience; in terms of engineering, build a mid-stage industrialized system for search technology and polish a high-performance, high-availability, low-cost tens of billions-level retrieval system to provide Tencent with PCG provides basic search engine services for the search scenarios of various content businesses. It currently supports multiple PCG product lines such as QQ Browser, Tencent Video, Tencent News, and Tencent Weishi.
The above is the detailed content of ChiQA - a picture question and answer dataset based on 200,000 real user questions. For more information, please follow other related articles on the PHP Chinese website!