Retrieval Enhanced Generation (RAG) applications always have a trade-off between two approaches: embed the entire document for better context, or break it down into smaller chunks for more precise retrieval.
Embing the entire document can capture global information, but may lose important details; while shorter blocks can preserve details, but often ignore the overall context.
Delayed chunking provides a solution that splits it into smaller, easier chunks while maintaining the full document context.
This article will introduce delayed chunking as a better alternative to the traditional naive chunking method and gradually demonstrate its implementation method.
Use Retrieval Enhanced Generation (RAG) and LangChain to integrate external data with large language models (LLM). Explore Courses
In the RAG pipeline, documents are broken down into smaller chunks before being embedded and stored in a vector database. Each block is processed independently and is used for retrieval at query time. However, this "naive chunking" approach often loses important long-distance context.
The problem is that the traditional chunking method does not consider the association method of information when segmenting documents. For example, in the documentation about Paris, the phrase “this city” may end up being different from the block where “Paris” is located. Without a complete context, the search model may be difficult to correlate these references, resulting in inaccurate results. In long documents, critical contexts are scattered across multiple sections, which is even more serious.
Delayed chunking solves this problem by changing the time to split the document. Delayed chunking is not to break the document into chunks first, but to embed the entire document using a long context model. Only after this does it split the document into smaller chunks.
Main advantages of delayed chunking:
Using long context models like Jina's jinaai/jina-embeddings-v2-base-en (supporting up to 8192 marks), delayed chunking allows large text parts to be embedded effectively before they are split into blocks.
This is a step-by-step guide to implementing delayed chunking using Jina's long context embedding model. You can get Jina's API key for free here, and we'll use the following input text as a demonstration:
<code>input_text = """Berlin is the capital and largest city of Germany, both by area and by population. Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."""</code>
First, use your Jina API key and the helper function below to break the input text into chunks. These blocks come with span annotations, which helps to split document embeddings later. Jina's API uses natural boundaries such as paragraph or sentence breaks to ensure that the block is meaningful and retains its meaning.
<code>import json import requests def custom_tokenize_jina_api(input_text: str): url = '<https:></https:>' headers = { 'Content-Type': 'application/json', 'Authorization': 'Bearer ENTER_YOUR_JINA_API_KEY' } data = { "content": input_text, "tokenizer": "o200k_base", "return_tokens": "true", "return_chunks": "true", "max_chunk_length": "1000" } # Make the API request response = requests.post(url, headers=headers, json=data) response_data = response.json() chunks = response_data.get("chunks", []) i = 1 j = 1 span_annotations = [] for x in response_data['tokens']: if j == 1: j = len(x) else: j = len(x) + i span_annotations.append((i, j)) i = j return chunks, span_annotations chunks, span_annotations = custom_tokenize_jina_api(input_text) print(chunks) print(span_annotations)</code>
<code>['Berlin is the capital and largest city of Germany, both by area and by population.\n\n', "Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.\n\n", 'The city is also one of the states of Germany, and is the third smallest state in the country in terms of area.'] [(1, 17), (17, 44), (44, 69)]</code>
First, use a tagger compatible with the long context model, such as Jina's embeddings-v2-base-en, to break the entire document into tags. Next, create embeddings for each tag using the long context converter model. This means that every word or marker in the document gets its unique embedding to capture its meaning.
<code>from transformers import AutoModel from transformers import AutoTokenizer # load model and tokenizer tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True) inputs = tokenizer(input_text, return_tensors='pt') model_output = model(**inputs) model_output[0].shape</code>
<code>torch.Size([1, 71, 768]) # 71 代表整个文档中的标记数</code>
Once you have tag embeddings for the entire document, you can do delayed chunking. Use the span annotation in step one to split these marks into smaller chunks. Then, average pooling is applied to average the embeds within each block, creating a single embed for each block. We now have block embeddings that contain the powerful context information of the entire document.
<code>def late_chunking( model_output: 'BatchEncoding', span_annotation: list, max_length=None ): token_embeddings = model_output[0] outputs = [] for embeddings, annotations in zip(token_embeddings, span_annotation): if ( max_length is not None ): # remove annotations which go bejond the max-length of the model annotations = [ (start, min(end, max_length - 1)) for (start, end) in annotations if start = 1 ] pooled_embeddings = [ embedding.detach().cpu().numpy() for embedding in pooled_embeddings ] outputs.append(pooled_embeddings) return outputs</code>
<code>embeddings = late_chunking(model_output, [span_annotations])[0] len(embeddings)</code>
<code>3 # 与步骤 1 中的块数匹配</code>
To understand the advantages of delayed chunking, let's compare it to traditional chunking:
<code>embeddings_traditional_chunking = model.encode(chunks)</code>
<code>import numpy as np cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y)) q = "Berlin" berlin_embedding = model.encode(q) print(q) print('\n') for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking): print(chunk.strip()) print(f'Late chunking:', cos_sim(berlin_embedding, new_embedding)) print(f'Traditional chunking:', cos_sim(berlin_embedding, trad_embeddings)) print("------------------------------------------------------------------")</code>
<code>Berlin Berlin is the capital and largest city of Germany, both by area and by population. Late chunking: 0.84954596 Traditional chunking: 0.84862185 ------------------------------------------------------------------ Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits. Late chunking: 0.82489026 Traditional chunking: 0.70843375 ------------------------------------------------------------------ The city is also one of the states of Germany, and is the third smallest state in the country in terms of area. Late chunking: 0.84980094 Traditional chunking: 0.7534553 ------------------------------------------------------------------</code>
As you can see in the second and third blocks, the traditional chunking shows similarity scores of 70-75% compared to the word "Berlin". However, using delayed chunking (maintaining the context of the entire document), these scores rose to 82-84%. This suggests that delayed chunking does a better job of preserving context and creating more meaningful embeddings, resulting in more accurate search results.
Delayed chunking is a significant improvement to the document retrieval system, especially in the RAG pipeline. Delayed chunking preserves the full context in each block by waiting until the document is fully embedded before splitting the document. This leads to more accurate and meaningful embeddings.
Implement RAG with LangChain to create a chatbot for answering questions about technical documentation. Explore Project
The above is the detailed content of Late Chunking for RAG: Implementation With Jina AI. For more information, please follow other related articles on the PHP Chinese website!