拟人化的上下文检索：实施指南-人工智能-PHP中文网

通过整合外部知识来增强AI模型的检索>检索生成。但是，传统的抹布经常碎片文档，失去关键环境并影响准确性。 Anthropic的上下文检索通过在嵌入之前向每个文档块中添加简洁的上下文解释来解决此问题。这大大减少了检索错误，从而改善了下游任务性能。本文详细介绍了上下文检索及其实现。

langchain

的抹布

>利用兰链和抹布将外部数据与LLMS整合。>

>上下文检索解释

>传统的抹布方法将文档分为较小的块，以便于检索，但这可以消除基本环境。例如，一块可能说“其超过385万居民使其成为欧盟人口最多的城市”，而没有指定这座城市。这种缺乏上下文阻碍了准确性。

>上下文检索通过在嵌入之前对每个块进行简短的，特定于上下文的摘要来解决此问题。上一个示例将变为：

<code>contextualized_chunk = """Berlin is the capital and largest city of Germany, known for being the EU's most populous city within its limits.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
"""</code>

登录后复制

>跨不同数据集的人类内部测试（代码库，科学论文，小说）表明，与上下文嵌入模型和上下文的BM25配对时，上下文检索将检索错误最多减少49％。

Anthropic's Contextual Retrieval: A Guide With Implementation

实现上下文检索

本节使用示例文档概述了逐步实现：> Anthropic's Contextual Retrieval: A Guide With Implementation

步骤1：块创建

将文档分为较小的独立块（在这里，句子）：>

<code># Input text for the knowledge base
input_text = """Berlin is the capital and largest city of Germany, both by area and by population.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
The city is also one of the states of Germany and is the third smallest state in the country in terms of area.
Paris is the capital and most populous city of France.
It is situated along the Seine River in the north-central part of the country.
The city has a population of over 2.1 million residents within its administrative limits, making it one of Europe's major population centers."""</code>

登录后复制

步骤2：提示模板定义

定义上下文生成的提示（使用了拟人的模板）：>

<code># Splitting the input text into smaller chunks
test_chunks = [
    'Berlin is the capital and largest city of Germany, both by area and by population.',
    "\n\nIts more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.",
    '\n\nThe city is also one of the states of Germany and is the third smallest state in the country in terms of area.',
    '\n\n# Paris is the capital and most populous city of France.',
    '\n\n# It is situated along the Seine River in the north-central part of the country.',
    "\n\n# The city has a population of over 2.1 million residents within its administrative limits, making it one of Europe's major population centers."
]</code>

登录后复制

步骤3：LLM初始化>

选择一个llm（在此使用OpenAI的GPT-4O）：>

<code>from langchain.prompts import ChatPromptTemplate, PromptTemplate, HumanMessagePromptTemplate

# Define the prompt for generating contextual information
anthropic_contextual_retrieval_system_prompt = """<document>
{WHOLE_DOCUMENT}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{CHUNK_CONTENT}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""

# ... (rest of the prompt template code remains the same)</code>

登录后复制

步骤4：链创建

> 连接提示和llm：

<code>import os
from langchain_openai import ChatOpenAI

# Load environment variables
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"

# Initialize the model instance
llm_model_instance = ChatOpenAI(
    model="gpt-4o",
)</code>

登录后复制

步骤5：块处理

> 生成每个块的上下文：

（原始示例中显示输出）

<code>from langchain.output_parsers import StrOutputParser

# Chain the prompt with the model instance
contextual_chunk_creation = anthropic_contextual_retrieval_final_prompt | llm_model_instance | StrOutputParser()</code>

登录后复制

重新掌握以增强精度 >通过优先考虑最相关的块来进一步完善检索。这提高了准确性并降低了成本。在拟人的测试中，重新评估的检索错误从5.7％降低到1.9％，提高了67％。

<code># Process each chunk and generate contextual information
for test_chunk in test_chunks:
    res = contextual_chunk_creation.invoke({
        "WHOLE_DOCUMENT": input_text,
        "CHUNK_CONTENT": test_chunk
    })
    print(res)
    print('-----')</code>

登录后复制