通过整合外部知识来增强AI模型的检索>检索生成。但是,传统的抹布经常碎片文档,失去关键环境并影响准确性。
的抹布
>传统的抹布方法将文档分为较小的块,以便于检索,但这可以消除基本环境。例如,一块可能说“其超过385万居民使其成为欧盟人口最多的城市”,而没有指定这座城市。 这种缺乏上下文阻碍了准确性。
实现上下文检索 本节使用示例文档概述了逐步实现: >
步骤2:提示模板定义 定义上下文生成的提示(使用了拟人的模板): 步骤3:LLM初始化>
>
连接提示和llm:
>
生成每个块的上下文:
重新掌握以增强精度
>通过优先考虑最相关的块来进一步完善检索。 这提高了准确性并降低了成本。在拟人的测试中,重新评估的检索错误从5.7%降低到1.9%,提高了67%。 结论>上下文检索解释
>上下文检索通过在嵌入之前对每个块进行简短的,特定于上下文的摘要来解决此问题。 上一个示例将变为:
<code>contextualized_chunk = """Berlin is the capital and largest city of Germany, known for being the EU's most populous city within its limits.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
"""</code>
步骤1:块创建
<code># Input text for the knowledge base
input_text = """Berlin is the capital and largest city of Germany, both by area and by population.
Its more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.
The city is also one of the states of Germany and is the third smallest state in the country in terms of area.
Paris is the capital and most populous city of France.
It is situated along the Seine River in the north-central part of the country.
The city has a population of over 2.1 million residents within its administrative limits, making it one of Europe's major population centers."""</code>
<code># Splitting the input text into smaller chunks
test_chunks = [
'Berlin is the capital and largest city of Germany, both by area and by population.',
"\n\nIts more than 3.85 million inhabitants make it the European Union's most populous city, as measured by population within city limits.",
'\n\nThe city is also one of the states of Germany and is the third smallest state in the country in terms of area.',
'\n\n# Paris is the capital and most populous city of France.',
'\n\n# It is situated along the Seine River in the north-central part of the country.',
"\n\n# The city has a population of over 2.1 million residents within its administrative limits, making it one of Europe's major population centers."
]</code>
<code>from langchain.prompts import ChatPromptTemplate, PromptTemplate, HumanMessagePromptTemplate
# Define the prompt for generating contextual information
anthropic_contextual_retrieval_system_prompt = """<document>
{WHOLE_DOCUMENT}
</document>
Here is the chunk we want to situate within the whole document
<chunk>
{CHUNK_CONTENT}
</chunk>
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else."""
# ... (rest of the prompt template code remains the same)</code>
<code>import os
from langchain_openai import ChatOpenAI
# Load environment variables
os.environ["OPENAI_API_KEY"] = "YOUR_OPENAI_API_KEY"
# Initialize the model instance
llm_model_instance = ChatOpenAI(
model="gpt-4o",
)</code>
<code>from langchain.output_parsers import StrOutputParser
# Chain the prompt with the model instance
contextual_chunk_creation = anthropic_contextual_retrieval_final_prompt | llm_model_instance | StrOutputParser()</code>
<code># Process each chunk and generate contextual information
for test_chunk in test_chunks:
res = contextual_chunk_creation.invoke({
"WHOLE_DOCUMENT": input_text,
"CHUNK_CONTENT": test_chunk
})
print(res)
print('-----')</code>
其他注意事项
对于较小的知识库(&lt; 200,000令牌),直接在提示中包括整个知识基础可能比使用检索系统更有效。 此外,利用及时的缓存(可与Claude一起使用)可以大大降低成本并提高响应时间。
>人类的上下文检索提供了一种简单而强大的方法来改善抹布系统。 上下文嵌入,BM25和重新固定的组合大大提高了准确性。 建议进一步探索其他检索技术。
以上是拟人化的上下文检索:实施指南的详细内容。更多信息请关注PHP中文网其他相关文章!