ラグの後期チャンキング：Jina AIによる実装-AI-php.cn

Late Chunking for RAG: Implementation With Jina AI

検索強化された生成（RAG）アプリケーションは、より良いコンテキストのためにドキュメント全体を埋め込むか、より正確な検索のために小さなチャンクに分解することです。

ドキュメント全体を埋め込むと、グローバルな情報をキャプチャできますが、より短いブロックは詳細を保存できますが、全体的なコンテキストは無視できます。

遅延チャンキングは、完全なドキュメントコンテキストを維持しながら、小さくて簡単なチャンクに分割するソリューションを提供します。

この記事では、従来の素朴なチャンキング方法のより良い代替手段として、遅延チャンキングを紹介し、その実装方法を徐々に実証します。

Langchainのぼろきれを使用して

検索拡張生成（RAG）とLangchainを使用して、外部データを大規模な言語モデル（LLM）と統合します。コースを探索

自然なブロックとその制限

ragパイプラインでは、ドキュメントが埋め込まれてベクトルデータベースに保存される前に、小さなチャンクに分解されます。各ブロックは独立して処理され、クエリ時に検索に使用されます。ただし、この「素朴なチャンキング」アプローチは、多くの場合、重要な長距離コンテキストを失います。

問題は、従来のチャンキング方法では、ドキュメントをセグメント化する際に情報の関連付け方法を考慮していないことです。たとえば、パリに関する文書では、「この都市」というフレーズは、「パリ」があるブロックとは異なる可能性があります。完全なコンテキストがなければ、検索モデルはこれらの参照を相関させるのが困難である可能性があり、結果として不正確な結果が得られます。長い文書では、重要なコンテキストが複数のセクションに散らばっていますが、これはさらに深刻です。

遅延チャンキング：ドキュメントセグメンテーションのコンテキストを保持

遅延チャンキングは、ドキュメントを分割する時間を変更することにより、この問題を解決します。遅延チャンクは、最初にドキュメントをチャンクに分割するのではなく、長いコンテキストモデルを使用してドキュメント全体を埋め込むことです。この後にのみ、ドキュメントを小さなチャンクに分割します。

遅延チャンキングの主な利点：

JinaのJinaai/Jina-embedings-v2-base-en（最大8192マークをサポートする）などの長いコンテキストモデルを使用すると、遅延チャンキングにより、ブロックに分割する前に大きなテキストパーツを効果的に埋め込むことができます。

遅延チャンキングを実装

これは、Jinaの長いコンテキスト埋め込みモデルを使用して遅延チャンキングを実装するための段階的なガイドです。ここでJinaのAPIキーを無料で入手できます。次の入力テキストをデモとして使用します。

<code>input_text = """Berlin is the capital and largest city of Germany, both by area and by population.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
The city is also one of the states of Germany, and is the third smallest state in the country in terms of area."""</code>

ログイン後にコピー

ステップ1：ブロックを取得してコメントをスパンします

最初に、Jina APIキーと以下のヘルパー機能を使用して、入力テキストをチャンクに分割します。これらのブロックには、スパンアノテーションが付属しているため、後でドキュメントの埋め込みを分割するのに役立ちます。 JinaのAPIは、パラグラフや文の休憩などの自然の境界を使用して、ブロックが意味があり、その意味を保持していることを確認します。

<code>import json
import requests

def custom_tokenize_jina_api(input_text: str):
    url = '<https:></https:>'
    headers = {
        'Content-Type': 'application/json',
        'Authorization': 'Bearer ENTER_YOUR_JINA_API_KEY'
    }
    data = {
        "content": input_text,
        "tokenizer": "o200k_base",
        "return_tokens": "true",
        "return_chunks": "true",
        "max_chunk_length": "1000"
    }
    # Make the API request
    response = requests.post(url, headers=headers, json=data)
    response_data = response.json()
    chunks = response_data.get("chunks", [])
    i = 1
    j = 1
    span_annotations = []
    for x in response_data['tokens']:
        if j == 1:
            j = len(x)
        else:
            j = len(x) + i
        span_annotations.append((i, j))
        i = j
    return chunks, span_annotations
chunks, span_annotations = custom_tokenize_jina_api(input_text)

print(chunks)
print(span_annotations)</code>

ログイン後にコピー

<code>['Berlin is the capital and largest city of Germany, both by area and by population.\n\n', "Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.\n\n", 'The city is also one of the states of Germany, and is the third smallest state in the country in terms of area.']
[(1, 17), (17, 44), (44, 69)]</code>

ログイン後にコピー

ステップ2：テキストをトークン化して、タグレベルのドキュメントを生成します

最初に、JinaのEmbeddings-V2-Base-enなどの長いコンテキストモデルと互換性のあるTaggerを使用して、ドキュメント全体をタグに分割します。次に、長いコンテキストコンバーターモデルを使用して、各タグの埋め込みを作成します。これは、ドキュメント内のすべての単語またはマーカーが、その意味を捉えるために独自の埋め込みを取得することを意味します。

<code>from transformers import AutoModel
from transformers import AutoTokenizer

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
inputs = tokenizer(input_text, return_tensors='pt')
model_output = model(**inputs)
model_output[0].shape</code>

ログイン後にコピー

<code>torch.Size([1, 71, 768]) # 71 代表整个文档中的标记数</code>

ログイン後にコピー

ステップ3：遅延チャンキング

ドキュメント全体にタグ埋め込みを行うと、遅延チャンキングを行うことができます。ステップ1のスパンアノテーションを使用して、これらのマークを小さなチャンクに分割します。次に、各ブロック内の平均埋め込みに平均プーリングが適用され、各ブロックに単一の埋め込みが作成されます。これで、ドキュメント全体の強力なコンテキスト情報を含むブロック埋め込みがあります。

<code>def late_chunking(
    model_output: 'BatchEncoding', span_annotation: list, max_length=None
):
    token_embeddings = model_output[0]
    outputs = []
    for embeddings, annotations in zip(token_embeddings, span_annotation):
        if (
            max_length is not None
        ):  # remove annotations which go bejond the max-length of the model
            annotations = [
                (start, min(end, max_length - 1))
                for (start, end) in annotations
                if start = 1
        ]
        pooled_embeddings = [
            embedding.detach().cpu().numpy() for embedding in pooled_embeddings
        ]
        outputs.append(pooled_embeddings)
    return outputs</code>

ログイン後にコピー

<code>embeddings = late_chunking(model_output, [span_annotations])[0]
len(embeddings)</code>

ログイン後にコピー

<code>3 # 与步骤 1 中的块数匹配</code>

ログイン後にコピー

ステップ4：遅れたチャンクと伝統的なチャンキング結果の比較

遅れたチャンキングの利点を理解するには、それを従来のチャンクと比較しましょう：

<code>embeddings_traditional_chunking = model.encode(chunks)</code>

ログイン後にコピー

<code>import numpy as np

cos_sim = lambda x, y: np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))
q = "Berlin"
berlin_embedding = model.encode(q)

print(q)
print('\n')
for chunk, new_embedding, trad_embeddings in zip(chunks, embeddings, embeddings_traditional_chunking):
  print(chunk.strip())
  print(f'Late chunking:', cos_sim(berlin_embedding, new_embedding))
  print(f'Traditional chunking:', cos_sim(berlin_embedding, trad_embeddings))
  print("------------------------------------------------------------------")</code>

ログイン後にコピー

<code>Berlin

Berlin is the capital and largest city of Germany, both by area and by population.
Late chunking: 0.84954596
Traditional chunking: 0.84862185
------------------------------------------------------------------
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
Late chunking: 0.82489026
Traditional chunking: 0.70843375
------------------------------------------------------------------
The city is also one of the states of Germany, and is the third smallest state in the country in terms of area.
Late chunking: 0.84980094
Traditional chunking: 0.7534553
------------------------------------------------------------------</code>

ログイン後にコピー

2番目と3番目のブロックでわかるように、従来のチャンキングは、「ベルリン」という言葉と比較して70〜75％の類似性スコアを示しています。ただし、遅延チャンキング（ドキュメント全体のコンテキストを維持する）を使用して、これらのスコアは82〜84％に上昇しました。これは、遅延チャンキングがコンテキストを保存し、より意味のある埋め込みを作成し、より正確な検索結果をもたらすというより良い仕事をすることを示唆しています。

結論

遅延チャンクは、特にRAGパイプラインで、ドキュメント検索システムの大幅な改善です。遅延チャンキングは、ドキュメントを分割する前にドキュメントが完全に埋め込まれるまで待機することにより、各ブロックの完全なコンテキストを保存します。これは、より正確で意味のある埋め込みにつながります。

プロジェクト：技術文書のためにragチャットボットを構築します

LangChainでRAGを実装して、技術文書に関する質問に答えるためのチャットボットを作成します。プロジェクトを探索してください

以上がラグの後期チャンキング：Jina AIによる実装の詳細内容です。詳細については、PHP 中国語 Web サイトの他の関連記事を参照してください。