Recently, the paper "Generative Dense Retrieval: Memory Can Be a Burden" from the Xiaohongshu search algorithm team was accepted by EACL 2024, an international conference in the field of natural language processing. For Oral, the acceptance rate is 11.32% (144/1271).
In their paper, they proposed a novel information retrieval paradigm - Generative Dense Retrieval (GDR). This paradigm can well solve the challenges faced by traditional generative retrieval (GR) when dealing with large-scale data sets. It is inspired by the memory mechanism.
#In past practice, GR relied on its unique memory mechanism to achieve in-depth interaction between queries and document libraries. However, this method that relies on language model autoregressive coding has obvious limitations when processing large-scale data, including fuzzy fine-grained document features, limited document library size, and difficulty in index update.
The GDR proposed by Xiaohongshu adopts a two-stage retrieval idea from coarse to fine. First, it uses the limited memory capacity of the language model to realize the mapping of queries to documents, and then The precise mapping of documents to documents is completed through the vector matching mechanism. GDR effectively alleviates the inherent drawbacks of GR by introducing a vector matching mechanism for dense set retrieval.
In addition, the team also designed a "memory-friendly document cluster identifier construction strategy" and a "document cluster adaptive negative sampling strategy". The retrieval performance of the two stages is improved respectively. Under multiple settings of the Natural Questions data set, GDR not only demonstrated SOTA's Recall@k performance, but also achieved good scalability while retaining the advantages of deep interaction, opening up new possibilities for future research on information retrieval. sex.
Text search tools have important research and application value. Traditional search paradigms, such as sparse retrieval (SR) based on word matching and dense retrieval (DR) based on semantic vector matching, although each has its own merits, with the rise of pre-trained language models, based on this The generative retrieval paradigm began to emerge. The beginnings of the generative retrieval paradigm were primarily based on semantic matches between queries and candidate documents. By mapping queries and documents into the same semantic space, the retrieval problem of candidate documents is transformed into a dense retrieval of vector matching degrees. This groundbreaking retrieval paradigm takes advantage of pre-trained language models and brings new opportunities to the field of text search. However, the generative retrieval paradigm still faces challenges. On the one hand, existing pre-training
During the training process, the model uses the given query as the context to autoregressively generate identifiers of relevant documents. This process enables the model to memorize the candidate corpus. After the query enters the model, it interacts with the model parameters and is autoregressively decoded, which implicitly produces deep interaction between the query and the candidate corpus, and this deep interaction is exactly what SR and DR lack. Therefore, GR can show excellent retrieval performance when the model can accurately memorize candidate documents.
Although GR's memory mechanism is not impeccable. Through comparative experiments between the classic DR model (AR2) and the GR model (NCI), we have confirmed that the memory mechanism will bring at least three major challenges:
We separately calculated the probability of an error for NCI and AR2 when decoding each bit of the document identifier from coarse to fine. For AR2, we find the identifier corresponding to the most relevant document for a given query through vector matching, and then count the first error steps of the identifier to obtain the step-by-step decoding error rate corresponding to AR2. As shown in Table 1, NCI performs well in the first half of decoding, while the error rate is higher in the second half, and the opposite is true for AR2. This shows that NCI can better complete the coarse-grained mapping of the semantic space of candidate documents through the overall memory database. However, since the selected features during the training process are determined by search, its fine-grained mapping is difficult to remember accurately, so it performs poorly in fine-grained mapping.
As shown in Table 2, we use 334K candidate document library size (first row) and 1M candidate Document size (second row) NCI model trained and tested with R@k metric. The results show that NCI dropped 11 points on R@100, compared to AR2 which only dropped 2.8 points. In order to explore the reason why NCI performance significantly decreases as the size of the candidate document library increases, we further tested the test results of the NCI model trained on the 1M document library when using 334K as the candidate document library (third row). Compared to the first line, NCI's burden of memorizing more documents results in a significant decrease in its recall performance, indicating that the model's limited memory capacity limits its ability to memorize large-scale candidate document libraries.
When new When a document needs to be added to the candidate library, the document identifier needs to be updated, and the model needs to be retrained to re-memorize all documents. Otherwise, outdated mappings (query to document identifier and document identifier to document) will significantly reduce retrieval performance.
The above problems hinder the application of GR in real scenarios. For this reason, after analysis, we believe that the matching mechanism of DR has a complementary relationship with the memory mechanism, so we consider introducing it into GR to retain the memory mechanism while suppressing its disadvantages. We proposed a new paradigm of generative dense retrieval (Generative Dense Retrieval, GDR):
Taking the query as input, we use the language model Memorize the candidate document library and autoregressively generate k relevant document clusters (CID) to complete the following mapping:
In this process, CID The generation probability is:
where
is all query embeddings generated by the encoder,
is the one-dimensional query representation produced by the encoder. This probability is also stored as an inter-cluster matching score and participates in subsequent operations. Based on this, we use standard cross-entropy loss to train the model:
We further retrieve candidate documents from candidate document clusters and complete intra-cluster matching:
We introduce a document encoder to extract candidate documents Characterization, this process will be completed offline. Based on this, calculate the similarity between the documents in the cluster and the query as the within-cluster matching score:
In this process, NLL loss is used to train the model:
Finally, we calculate the weighted value of the inter-cluster matching score and the intra-cluster matching score of the document and sort them, Select the Top K among them as the retrieved relevant documents:
where beta is set to 1 in our experiment.
#In order to make full use of the limited memory capacity of the model to achieve deep interaction between the query and the candidate document library, we propose Memory-friendly document cluster identifier construction strategy. This strategy first uses the model memory capacity as the benchmark to calculate the upper limit of the number of documents in the cluster:
On this basis, it is further constructed through the K-means algorithm Document cluster identifier to ensure that the memory burden of the model does not exceed its memory capacity:
The two-stage retrieval framework of GDR determines that the negative samples within the cluster account for a larger proportion during the intra-cluster matching process. To this end, in the second stage of training, we use document cluster division as the benchmark and explicitly enhance the weight of negative samples within the cluster to obtain better intra-cluster matching effect:
The data set used in the experiment is Natural Questions (NQ), It contains 58K training pairs (query and related documents) and 6K validation pairs, along with a library of 21M candidate documents. Each query has multiple related documents, which places higher requirements on the recall performance of the model. To evaluate the performance of GDR on document bases of different sizes, we constructed different settings such as NQ334K, NQ1M, NQ2M, and NQ4M by adding the remaining passages from the full 21M corpus to NQ334K. GDR generates CIDs on each dataset separately to prevent the semantic information of the larger candidate document library from leaking into the smaller corpus. We adopt BM25 (Anserini implementation) as the SR baseline, DPR and AR2 as the DR baseline, and NCI as the GR baseline. Evaluation metrics include R@k and Acc@k.
On the NQ data set, GDR improves the R@k indicator by an average of 3.0, while on the Acc@ k index ranks second. This shows that GDR maximizes the advantages of the memory mechanism in deep interaction and the matching mechanism in fine-grained feature discrimination through a coarse-to-fine retrieval process.
We noticed that when the candidate corpus is expanded to a larger At the scale, the R@100 decrease rate of SR and DR remains below 4.06%, while the decrease rate of GR exceeds 15.25% in all three expansion directions. In contrast, GDR achieves an average R@100 reduction rate of 3.50%, which is similar to SR and DR by focusing the memory content on a fixed volume of corpus coarse-grained features.
GDR- in Table 3 bert and GDR-ours respectively represent the corresponding model performance under the traditional and our CID construction strategies. Experiments have proven that using a memory-friendly document cluster identifier construction strategy can significantly reduce the memory burden, thereby leading to better retrieval performance. In addition, Table 4 shows that the document cluster adaptive negative sampling strategy used in GDR training enhances fine-grained matching capabilities by providing more discriminative signals within document clusters.
When there is a new document When adding the candidate document library, GDR adds the new document to the nearest document cluster cluster center and assigns it a corresponding identifier. At the same time, the document encoder extracts the vector representation and updates the vector index, thereby completing the rapid expansion of new documents. As shown in Table 6, in the setting of adding new documents to the candidate corpus, NCI's R@100 drops by 18.3 percentage points, while GDR's performance only drops by 1.9 percentage points. This shows that GDR alleviates the difficult scalability of the memory mechanism by introducing a matching mechanism and maintains good recall effects without retraining the model.
Limited by the characteristics of language model autoregressive generation, although GDR introduces a vector matching mechanism in the second stage, compared with GR, it achieves The retrieval efficiency is significantly improved, but there is still a large room for improvement compared to DR and SR. We look forward to more research in the future to help alleviate the delay problem caused by the introduction of memory mechanisms into the retrieval framework.
In this study, we deeply explored the dual role of memory mechanism in information retrieval. Blade effect: On the one hand, this mechanism realizes in-depth interaction between the query and the candidate document library, making up for the shortcomings of intensive retrieval; on the other hand, the limited memory capacity of the model and the complexity of updating the index make it difficult to deal with large-scale and dynamic problems. It is difficult to change the candidate document library. In order to solve this problem, we innovatively combine the memory mechanism and the vector matching mechanism in a hierarchical manner to achieve the effect of leveraging the strengths and avoiding weaknesses of the two and complementing each other.
# We propose a new text retrieval paradigm, Generative Dense Retrieval (GDR). GDR This paradigm performs a two-stage retrieval from coarse to fine for a given query. First, the memory mechanism autoregressively generates document cluster identifiers to map the query to document clusters, and then uses vector matching. The mechanism calculates the similarity between the query and the document to complete the mapping of document clusters to documents.
The memory-friendly document cluster identifier construction strategy ensures that the memory burden of the model does not exceed its memory capacity and improves the inter-cluster matching effect. The document cluster adaptive negative sampling strategy enhances the training signal for distinguishing negative samples within the cluster and increases the matching effect within the cluster. Extensive experiments have proven that GDR can achieve excellent retrieval performance on large-scale candidate document libraries and can efficiently respond to document library updates.
As a successful attempt to integrate the advantages of traditional retrieval methods, the generative intensive retrieval paradigm has good recall performance, strong scalability, and robust performance in scenarios with massive candidate document libraries. advantage. As large language models continue to improve in their understanding and generation capabilities, the performance of generative intensive retrieval will be further improved, opening up a broader world for information retrieval.
Paper address: https://www.php.cn/link/9e69fd6d1c5d1cef75ffbe159c1f322e
The above is the detailed content of Xiaohongshu interprets information retrieval from the memory mechanism and proposes a new paradigm to obtain EACL Oral. For more information, please follow other related articles on the PHP Chinese website!