Late Chunking for RAG: Implementation With Jina AI-AI-php.cn

Late Chunking for RAG: Implementation With Jina AI

Retrieval Enhanced Generation (RAG) applications always have a trade-off between two approaches: embed the entire document for better context, or break it down into smaller chunks for more precise retrieval.

Embing the entire document can capture global information, but may lose important details; while shorter blocks can preserve details, but often ignore the overall context.

Delayed chunking provides a solution that splits it into smaller, easier chunks while maintaining the full document context.

This article will introduce delayed chunking as a better alternative to the traditional naive chunking method and gradually demonstrate its implementation method.

Using LangChain's RAG

Use Retrieval Enhanced Generation (RAG) and LangChain to integrate external data with large language models (LLM). Explore Courses

Natural blocking and its limitations in RAG

In the RAG pipeline, documents are broken down into smaller chunks before being embedded and stored in a vector database. Each block is processed independently and is used for retrieval at query time. However, this "naive chunking" approach often loses important long-distance context.

The problem is that the traditional chunking method does not consider the association method of information when segmenting documents. For example, in the documentation about Paris, the phrase “this city” may end up being different from the block where “Paris” is located. Without a complete context, the search model may be difficult to correlate these references, resulting in inaccurate results. In long documents, critical contexts are scattered across multiple sections, which is even more serious.

Delayed chunking: Preserve context in document segmentation

Delayed chunking solves this problem by changing the time to split the document. Delayed chunking is not to break the document into chunks first, but to embed the entire document using a long context model. Only after this does it split the document into smaller chunks.

Main advantages of delayed chunking:

Keep context: Delayed chunking ensures that each block retains the overall context by embedding the entire document first. In this way, references and concatenations in the text remain intact in the block embedding.
Better search: Block embeddings created by delayed chunking are richer and more accurate, thus improving search results in RAG systems because the model has a better understanding of the document.
Processing long text: Very useful for long documents that traditional models cannot be processed at once due to tag limitations.

Using long context models like Jina's jinaai/jina-embeddings-v2-base-en (supporting up to 8192 marks), delayed chunking allows large text parts to be embedded effectively before they are split into blocks.

Implement delayed chunking

This is a step-by-step guide to implementing delayed chunking using Jina's long context embedding model. You can get Jina's API key for free here, and we'll use the following input text as a demonstration:

<code>input_text = """Berlin is the capital and largest city of Germany, both by area and by population.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."""</code>

Copy after login

Step 1: Get the block and span comments

First, use your Jina API key and the helper function below to break the input text into chunks. These blocks come with span annotations, which helps to split document embeddings later. Jina's API uses natural boundaries such as paragraph or sentence breaks to ensure that the block is meaningful and retains its meaning.

<code>import json
import requests

def custom_tokenize_jina_api(input_text: str):
    url = '<https:></https:>'
    headers = {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer ENTER_YOUR_JINA_API_KEY'
    }
    data = {
        "content": input_text,
        "tokenizer": "o200k_base",
        "return_tokens": "true",
        "return_chunks": "true",
        "max_chunk_length": "1000"
    }
    # Make the API request
    response = requests.post(url, headers=headers, json=data)
    response_data = response.json()
    chunks = response_data.get("chunks", [])
    i = 1
    j = 1
    span_annotations = []
    for x in response_data['tokens']:
        if j == 1:
            j = len(x)
        else:
            j = len(x) + i
        span_annotations.append((i, j))
        i = j
    return chunks, span_annotations
chunks, span_annotations = custom_tokenize_jina_api(input_text)

print(chunks)
print(span_annotations)</code>

Copy after login

<code>['Berlin is the capital and largest city of Germany, both by area and by population.\n\n', "Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.\n\n", 'The city is also one of the states of Germany, and is the third smallest state in the country in terms of area.']
[(1, 17), (17, 44), (44, 69)]</code>

Copy after login

Step 2: Tokenize the text and generate tag-level document embedding

First, use a tagger compatible with the long context model, such as Jina's embeddings-v2-base-en, to break the entire document into tags. Next, create embeddings for each tag using the long context converter model. This means that every word or marker in the document gets its unique embedding to capture its meaning.

<code>from transformers import AutoModel
from transformers import AutoTokenizer

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
model_output[0].shape</code>

Copy after login

<code>torch.Size([1, 71, 768]) # 71 代表整个文档中的标记数</code>

Copy after login

Step 3: Delay chunking

Once you have tag embeddings for the entire document, you can do delayed chunking. Use the span annotation in step one to split these marks into smaller chunks. Then, average pooling is applied to average the embeds within each block, creating a single embed for each block. We now have block embeddings that contain the powerful context information of the entire document.

<code>def late_chunking(
    model_output: 'BatchEncoding', span_annotation: list, max_length=None
):
    token_embeddings = model_output[0]
    outputs = []
    for embeddings, annotations in zip(token_embeddings, span_annotation):
        if (
            max_length is not None
        ):  # remove annotations which go bejond the max-length of the model
            annotations = [
                (start, min(end, max_length - 1))
                for (start, end) in annotations
                if start = 1
        ]
        pooled_embeddings = [
            embedding.detach().cpu().numpy() for embedding in pooled_embeddings
        ]
        outputs.append(pooled_embeddings)
    return outputs</code>

Copy after login

<code>embeddings = late_chunking(model_output, [span_annotations])[0]
len(embeddings)</code>

Copy after login

<code>3 # 与步骤 1 中的块数匹配</code>

Copy after login

Step 4: Comparison of delayed chunking and traditional chunking results

To understand the advantages of delayed chunking, let's compare it to traditional chunking:

<code>embeddings_traditional_chunking = model.encode(chunks)</code>

Copy after login

<code>import numpy as np

cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
q = "Berlin"
berlin_embedding = model.encode(q)

print(q)
print('\n')
for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
  print(chunk.strip())
  print(f'Late chunking:', cos_sim(berlin_embedding, new_embedding))
  print(f'Traditional chunking:', cos_sim(berlin_embedding, trad_embeddings))
  print("------------------------------------------------------------------")</code>

Copy after login

<code>Berlin

Berlin is the capital and largest city of Germany, both by area and by population.
Late chunking: 0.84954596
Traditional chunking: 0.84862185
------------------------------------------------------------------
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
Late chunking: 0.82489026
Traditional chunking: 0.70843375
------------------------------------------------------------------
The city is also one of the states of Germany, and is the third smallest state in the country in terms of area.
Late chunking: 0.84980094
Traditional chunking: 0.7534553
------------------------------------------------------------------</code>

Copy after login

As you can see in the second and third blocks, the traditional chunking shows similarity scores of 70-75% compared to the word "Berlin". However, using delayed chunking (maintaining the context of the entire document), these scores rose to 82-84%. This suggests that delayed chunking does a better job of preserving context and creating more meaningful embeddings, resulting in more accurate search results.

Conclusion

Delayed chunking is a significant improvement to the document retrieval system, especially in the RAG pipeline. Delayed chunking preserves the full context in each block by waiting until the document is fully embedded before splitting the document. This leads to more accurate and meaningful embeddings.

Project: Build RAG chatbot for technical documents

Implement RAG with LangChain to create a chatbot for answering questions about technical documentation. Explore Project

The above is the detailed content of Late Chunking for RAG: Implementation With Jina AI. For more information, please follow other related articles on the PHP Chinese website!