Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?-AI-php.cn

Lajur AIxiv ialah lajur di mana tapak ini menerbitkan kandungan akademik dan teknikal. Dalam beberapa tahun kebelakangan ini, lajur AIxiv laman web ini telah menerima lebih daripada 2,000 laporan, meliputi makmal terkemuka dari universiti dan syarikat utama di seluruh dunia, mempromosikan pertukaran dan penyebaran akademik secara berkesan. Jika anda mempunyai kerja yang sangat baik yang ingin anda kongsikan, sila berasa bebas untuk menyumbang atau hubungi kami untuk melaporkan. E-mel penyerahan: liyazhou@jiqizhixin.com; zhaoyunfeng@jiqizhixin.com

Zhang Yingfeng: pengasas bersama Infra, dengan pengalaman bertahun-tahun dalam pencarian, AI, dan pembangunan infrastruktur Infra, dia sedang mengusahakan pembinaan daripada produk teras RAG generasi akan datang.

Dalam pembangunan sistem RAG, model Reranker yang baik adalah pautan yang sangat diperlukan dan sentiasa digunakan dalam pelbagai penilaian Ini kerana pertanyaan yang diwakili oleh carian vektor akan menghadapi kadar hit yang rendah masalahnya, yang membentuk seni bina pengisihan dua peringkat menggunakan carian vektor sebagai penapisan kasar dan model Reranker sebagai pengisihan halus.

Pada masa ini terdapat dua jenis seni bina utama untuk model kedudukan:

1 Pengekod dwi. Mengambil model BERT sebagai contoh, ia mengekod pertanyaan dan dokumen secara berasingan, dan akhirnya melalui lapisan Pengumpulan supaya output mengandungi hanya satu vektor. Dalam peringkat pertanyaan, anda hanya perlu mengira persamaan dua vektor, seperti yang ditunjukkan dalam rajah di bawah. Pengekod dwi boleh digunakan dalam kedua-dua peringkat Kedudukan dan Penarafan Semula, dan carian vektor sebenarnya adalah model kedudukan ini. Memandangkan pengekod dwi mengekodkan pertanyaan dan dokumen secara berasingan, ia tidak dapat menangkap hubungan interaktif yang kompleks antara pertanyaan dan token dokumen, yang akan menyebabkan banyak kehilangan semantik, bagaimanapun, kerana hanya carian vektor diperlukan untuk melengkapkan pengisihan dan pemarkahan pengiraan, kecekapan pelaksanaan dipertingkatkan dengan sangat tinggi.

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

2. Cross-Encoder menggunakan model pengekod tunggal untuk mengekod pertanyaan dan dokumen secara serentak Ia boleh menangkap interaksi kompleks antara pertanyaan dan dokumen, jadi ia boleh memberikan hasil kedudukan carian yang lebih tepat. Cross-Encoder tidak mengeluarkan vektor yang sepadan dengan Token pertanyaan dan dokumen, tetapi menambah pengelas untuk secara langsung mengeluarkan skor persamaan pertanyaan dan dokumen. Kelemahannya ialah memandangkan setiap dokumen dan pertanyaan perlu dikodkan bersama pada masa pertanyaan, yang menjadikan pengisihan sangat perlahan, Cross-Encoder hanya boleh digunakan untuk menyusun semula keputusan akhir. Contohnya, menyusun semula 10 Teratas keputusan saringan awal masih mengambil masa beberapa saat untuk diselesaikan.

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

Sejak tahun ini, satu lagi jenis kerja yang diwakili oleh ColBERT [Rujukan 1] telah menarik perhatian meluas dalam komuniti pembangunan RAG Seperti yang ditunjukkan dalam rajah di bawah, ia mempunyai beberapa ciri yang berbeza secara ketara daripada dua jenis di atas daripada model pemeringkatan pemisahan Ini membolehkan pengekodan dokumen diproses di luar talian, dan hanya pengekodan Pertanyaan digunakan semasa membuat pertanyaan, jadi kelajuan pemprosesan jauh lebih tinggi daripada Pengekod Silang

Kedua, berbanding dengan pengekod dwi, ColBERT mengeluarkan berbilang vektor dan bukannya vektor tunggal adalah kerana Lapisan keluaran akhir Transformer diperoleh secara langsung, manakala pengekod dwi menukarkan berbilang vektor kepada satu output vektor melalui lapisan Penggabungan, dengan itu kehilangan beberapa semantik.

Semasa pengiraan pengisihan, ColBERT memperkenalkan fungsi kesamaan pengiraan interaktif tertunda dan menamakannya kesamaan maksimum (MaxSim kaedah pengiraan adalah seperti berikut: untuk setiap vektor Token pertanyaan, ia mesti dibandingkan dengan vektor yang sepadan dengan semua Token dokumen). Persamaan dikira dan skor maksimum setiap token pertanyaan dijejaki. Jumlah markah untuk pertanyaan dan dokumen ialah jumlah markah kosinus maksimum ini. Contohnya, untuk pertanyaan dengan 32 vektor Token (panjang pertanyaan maksimum ialah 32) dan dokumen dengan 128 Token, operasi persamaan 32*128 perlu dilakukan, seperti yang ditunjukkan dalam rajah di bawah.

Jadi sebagai perbandingan, Cross Encoder boleh dipanggil

Model Interaksi Awal

, manakala karya yang diwakili oleh ColBERT boleh dipanggil Model Interaksi Lewat.

The following figure compares the above sorting models in terms of performance and sorting quality. Since the delayed interaction model satisfies the ability to capture the complex interactions between queries and documents during the sorting process, and also avoids the overhead of encoding document tokens, it can not only ensure good sorting effects, but also achieve faster sorting performance—— Under the same data scale, the efficiency of ColBERT can be more than 100 times that of Cross Encoder. Therefore, the delayed interaction model is a very promising sorting model. A natural idea is: Can the delayed interaction model be directly used in RAG to replace the two-stage sorting architecture such as vector search + fine sorting?

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

To this end, we need to consider some issues in ColBERT engineering:

1. ColBERT’s MaxSim delayed interaction similarity function has much higher computational efficiency than Cross Encoder, but compared to ordinary vector search, the computational overhead is still Very large: Because the similarity between the query and the document is a multi-vector calculation, the cost of MaxSim is M * N times that of ordinary vector similarity calculation (M is the number of Tokens in the query, N is the number of Tokens in the document). In response to these, the author of ColBERT launched ColBERT v2 in 2021 [Reference 2], which improved the quality of the generated Embedding through Cross Encoder and model distillation, and used compression technology to quantize the generated document vector, thereby improving the calculation of MaxSim performance. Project RAGatouille [Reference 3] based on the ColBERT v2 wrapper becomes a solution for high-quality RAG sorting. However, ColBERT v2 is just an algorithm library, and it is still difficult to use it end-to-end in enterprise-level RAG systems.

2. Since ColBERT is a pre-trained model, and the training data comes from search engine queries and return results, these text data are not large. For example, the number of query Tokens is 32, and the number of document Tokens is 128, which are typical length limits. Therefore, when ColBERT is used for real data, the length exceeding the limit will be truncated, which is not friendly for long document retrieval.

Based on the above issues, the open source AI native database Infinity provides the Tensor data type in the latest version and natively provides an end-to-end ColBERT solution. When Tensor is used as a data type, multiple vectors output by ColBERT encoding can be directly stored in one Tensor, so the similarity between Tensors can directly derive the MaxSim score. In response to the problem of MaxSim's large amount of calculations, Infinity has given two solutions to optimize: one is binary quantization, which can make the space of the original Tensor only 1/32 of the original size, but does not change the relative ordering of MaxSim calculations. result. This solution is mainly used for Reranker, because it is necessary to extract the corresponding Tensor based on the results of the previous stage of coarse screening. The other is Tensor Index. ColBERTv2 is actually the Tensor Index implementation launched by the author of ColBERT. Infinity uses EMVB [Reference 4], which can be regarded as an improvement of ColBERT v2, mainly through quantization and pre-filtering technology, and SIMD instructions are introduced on key operations to speed up implementation. Tensor Index can only be used to serve Ranker rather than Reranker. In addition, for long texts that exceed the Token limit, Infinity introduces the Tensor Array type:

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

A document that exceeds the ColBERT limit will be divided into multiple paragraphs, and after encoding and generating Tensors respectively, they will be saved with the original document. One line. When calculating MaxSim, the query and these paragraphs are calculated separately, and then the maximum value is taken as the score of the entire document. As shown in the figure below:

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

Therefore, using Infinity, a delayed interaction model can be introduced end-to-end to serve RAG with high quality. So, should ColBERT be used as Ranker or Reranker? Below we use Infinity to conduct evaluation on real data sets. Since the latest version of Infinity implements the most comprehensive hybrid search solution in history, recall methods include vector search, full-text search, sparse vector search, the Tensor mentioned above, and any combination of these methods, and provides a variety of Reranker methods , such as RRF, and ColBERT Reranker, so we include various combinations of hybrid search and Reranker in the review.

We use the MLDR data set for evaluation. MLDR is a benchmark set used by MTEB [Reference 5] to evaluate the quality of Embedding models. MLDR is one of the data sets, which is called Multi Long Document Retrieval and contains a total of 200,000 long text data. The evaluation uses BGE-M3 [Reference 6] as the Embedding model, Jina-ColBERT [Reference 7] to generate Tensor, and the evaluation script is also placed in the Infinity warehouse [Reference 8].

Evaluation 1: Is ColBERT effective as a Reranker? Use BGE-M3 to generate dense vectors and sparse vectors from 200,000 MLDR data, and insert them into the Infinity database. The database contains 4 columns, which respectively store original text, vectors, sparse vectors, and Tensors, and build corresponding full-text indexes. Vector index, sparse vector index. The evaluation includes all recall combinations, including single-channel recall, dual-channel recall, and three-channel recall, as shown below:

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

The evaluation index uses nDCG@10. Other parameters: When using RRF Reranker, the Top N = 1000 returned by coarse screening, the total number of queries is 800, and the average query length is about 10 tokens.

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

As you can see from the figure, all recall solutions have significantly improved results after using ColBERT Reranker. ColBERT, a delayed interaction model, provides ranking quality comparable to those at the top of MTEB's Reranker leaderboards, but with 100x the performance, allowing for reranking on a much larger scale. The results shown in the figure are based on the Top 100 Reranker, and the Top 1000 is used for ColBERT reranking. The value does not change significantly, and the performance drops significantly, so it is not recommended. Traditionally, when using an external Reranker based on Cross Encoder, the Top 10 will have a second-level delay. However, Infinity implements the high-performance ColBERT Reranker internally. Even if the Top 100 or even the Top 1000 are reordered, the user experience will not be affected. However, the scope of recall is greatly increased, so the final ranking effect can be significantly improved. In addition, this ColBERT Reranker calculation only needs to be run on a pure CPU architecture, which also greatly reduces the cost of deployment.

Evaluation 2: The comparison is based on ColBERT as a Ranker rather than a Reranker. Therefore, it is necessary to construct a Tensor Index for the Tensor column of data. At the same time, in order to evaluate the accuracy loss introduced by Tensor Index, a brute force search was also performed.

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

It can be seen that compared to Reranker, even using brute force search without accuracy loss, there is no significant improvement, and the sorting quality based on Tensor Index is even lower than using Reranker. However, the query time as a Ranker is much slower: the MLDR data set contains 200,000 document data, which is about 2GB. After using Jina-ColBERT to convert it into Tensor data, it is as high as 320 G. This is because the Tensor data type is a document. The vector corresponding to each Token of the document must be saved. The dimension of the ColBERT model is 128 dimensions, so the default data volume will expand by 2 orders of magnitude. Even if a Tensor Index is built, it will take an average of 7 seconds to query so much data. Returns a query but gets no better results.

So, it is clear that ColBERT is much more profitable as a Reranker than as a Ranker. The current best RAG retrieval solution is based on 3-way hybrid search (full-text search + vector + sparse vector) plus ColBERT Reranker. Some partners may ask, in order to use ColBERT Reranker, it is necessary to add a separate Tensor column, and the column will expand by 2 orders of magnitude compared to the original data set. Is it worth it? First of all: Infinity provides Binary quantization method for Tensor. As a Reranker, it does not affect the sorting results much, but it can make the final data only 1/32 of the original Tensor size. Secondly, even so, some people will think that this overhead is too high. However, from the user's perspective, it is still very worthwhile to use more storage in exchange for higher sorting quality and cheaper costs (the sorting process does not require a GPU). Finally, I believe that a Late Interaction model with slightly reduced performance but greatly reduced storage overhead will soon be launched. As a Data Infra infrastructure, it is transparent to these changes and it is a wise choice to hand over these trade offs to users.

The above is based on Infinity’s multi-way recall evaluation on the MLDR data set. The evaluation results on other data sets may be different, but the overall conclusion will not change - 3-way hybrid search + Tensor-based reordering , is currently the recall method with the highest quality search results.

It can be seen that ColBERT and its delayed interaction model have great application value in RAG scenarios. The above is related work on text dialogue content generation. Recently, delayed interaction models have also been used in multi-modal scenarios. SOTA results. This is ColPali [Reference 9], which changes the RAG workflow, as shown below:

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

RAG When faced with complex format documents, the current SOTA uses a document recognition model to identify the layout of the document, and then calls the corresponding model for the identified partial structures, such as charts, pictures, etc., to convert them The corresponding text is then saved into the RAG supporting database in various formats. ColPali eliminates these steps and directly uses multi-modal models to generate Embedding content. When asking questions, you can answer directly based on the charts in the document:

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

The training of the ColPali model is similar to ColBERT, and also uses the form of query-document page pairs to capture the semantics between the query and document multi-modal data. Association, just use PaliGemma [Reference 10] to generate multi-modal Embedding. Compared to BiPali, which does not use the Late Interaction mechanism but also uses PaliGemma to generate Embedding, the evaluation index comparison in nDCG@5 is 81.3 vs 58.8. This gap is the difference between "excellent" and "cannot work at all".

Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?

Therefore, although it has been 4 years since ColBERT appeared, the application of Late Interaction model in RAG has just begun. It will definitely expand the usage scenarios of RAG and provide complex RAG scenarios including multi-modality. High-quality semantic recall. Infinity is already ready for its end-to-end application. Welcome to pay attention to Star Infinity, https://github.com/infiniflow/infinity, and is committed to becoming the best AI native database!

^References

^{1. Colbert: Efficient and effective passage search via contextualized late interaction over bert, SIGIR 2020.}

^{2. Colbertv2: Effective and efficient retrieval via lightweight late interaction, arXiv:2112.01488, 2021. Efficient Multi-vector Dense Retrieval with Bit Vectors, ECIR 2024.}

^{5. https://huggingface.co/mteb}

^{6. https://huggingface.co/BAAI/bge-m3}

^{7. https://huggingface.co /jinaai/jina-colbert-v1-en}

^{8. https://github.com/infiniflow/infinity/tree/main/python/benchmark/mldr_benchmark}

^{9. ColPali: Efficient Document Retrieval with Vision Language Models, arXiv:2407.01449, 2024.}

^{10. https://github.com/google-research/big_vision/tree/main/big_vision/configs/proj/paligemma}

Atas ialah kandungan terperinci Mengapakah model interaksi tertunda menjadi standard untuk generasi RAG seterusnya?. Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!