在人工智能的快速发展的景观中,优化大型语言模型(LLMS)不仅在于突破可能的界限,而且还涉及确保效率和成本效益。
>
提示压缩已成为一种重要技术,可以增强这些模型的性能,同时最大程度地减少计算费用。随着新的研究几乎每周都会出现,保持挑战是具有挑战性的,但是了解基本面是至关重要的。 >本文涵盖了及时压缩的基础知识,讨论了何时应该使用它,其在降低抹布管道中的成本中的重要性,并使用OpenAI的API。
如果您想了解更多信息,请在及时工程上查看本课程。什么是提示压缩?
提示压缩是一种自然语言处理(NLP)中使用的技术,可通过减少其长度而不显着改变输出的质量和相关性来优化对LLM的输入。由于查询中令牌的数量对LLM性能的影响,因此这种优化至关重要。
令牌是文本LLMS使用的基本单元,根据语言模型的令牌代表单词或子字。在提示中减少令牌数量是有益的,有时是由于几个原因所必需的:令牌限制约束: 处理效率和降低成本:
提示压缩通过减少令牌计数的同时保留提示的有效性来减轻这些问题。
>抹布管道将信息检索与文本生成相结合,并且经常用于专门的聊天机器人和上下文理解至关重要的其他应用程序。这些管道通常需要广泛的对话历史或作为提示检索文件,从而导致代币计数和增加的费用。
在这种情况下,
>提示压缩的适用性和局限性
这些模型通常不会为每个令牌收费,并且具有集成的聊天摘要和内存功能来有效地管理对话历史记录,从而使压缩冗余。
>也必须注意的是,即使使用指控的模型,过度压缩也可能导致细微差别或重要细节的损失。在减小大小和保持提示含义的完整性之间达到正确的平衡是关键。>提示压缩如何工作?
>提示压缩技术可以分为三种主要方法:知识蒸馏,编码和过滤。每种技术都利用不同的优势来优化LLMS提示的长度和效率。
>我们将讨论这些技术中的每一种,但您可以在本文中找到一种更全面的方法:大型语言模型的有效提示方法:调查。在本文中,我将本文称为“调查文件”。最初开发了该技术来解决训练模型集合的计算挑战。在迅速工程的背景下,知识蒸馏可用于压缩提示而不是模型。
这是通过学习如何通过软提示调整来压缩LLM中的硬提示来实现的。有关详细见解,请参阅调查文件的第3.1节和附录A.1.1。
>编码>有趣的是,LLM精通其他语言,例如base64,可以在编码中用于降低提示的令牌大小。例如,提示“将以下文本转换为法语:你好,你好吗?”在基本64中编码的是“ vhjhbnnsyxrlihrozsbmb2xsb3dpbmcgdgv4dcb0b0bybgcmvuy2g6icdizwxsbywgag93ig93igfyzsb5b5b5b3unpw ==”。您可以尝试提示您喜欢的LLM测试!
>令人惊讶的是,一些编码技术也用于模型越狱,其中涉及操纵LLM以绕过其安全机制。有关编码方法的更多详细信息,请参见调查文件的第3.2节和附录A.1.2。
过滤>前两种方法试图压缩整个提示,而过滤技术的重点是消除不必要的零件以提高LLM的效率。
的目标是仅保留提示中最相关的部分。在论文中,Li等人的选择性背景。 (2023),研究人员使用自我信息指标来过滤冗余信息。在论文llmlingua中:压缩提示以加速大型语言模型的推理,Microsoft的研究人员将提示提示为关键组件,并动态调整每个部分的压缩比。有关进一步阅读,请参阅调查文件的第3.3节和附录A.1.3。
如何在Python中实现提示压缩
在本节中,我将实施并测试受欢迎并被认为是最先进的选择性上下文算法。如果您只想测试该算法,则无需安装任何内容,它已经在HuggingFace平台上托管。
在选择性上下文Web应用程序中,您可以选择要压缩的提示(英语或简化中文)的语言。您还可以设置压缩率,然后选择是否过滤句子,令牌或相位。
>>使用OpenAI API
实现和测试选择性上下文现在,让我们研究Python实施。我们还将使用GPT-3.5-Turbo-0125型号测试一些压缩提示。>
我们还需要从Spacy下载en_core_web_sm模型,可以使用以下命令来完成:
pip install selective-context
>现在我们需要初始化selectivecontext对象。我们可以为模型选择Curie或GPT-2,然后为语言选择EN或ZH。我将在此示例中使用gpt-2。
python -m spacy download en_core_web_sm
>
sc = SelectiveContext(model_type = ‘gpt-2’, lang = ‘en’)
原始段落和摘要
在下面,您可以看到我将使用的API调用 - 我们将在格式字符串中更改文本变量。首先,我们要求该模型总结原始的未压缩段落:如您所见,该模型很好地总结了未压缩的段落。
context, reduced_content = sc(text, reduce_ratio = 0.5, reduce_level = ‘sent’)
>现在让我们尝试使用令牌还原版本。压缩段落看起来像这样:
from openai import OpenAI client = OpenAI(api_key = key) completion = client.chat.completions.create( model = “gpt-3.5-turbo-0125”, messages = [ {“role”: “user”, “content”: f”Summarize the following paragraph:\n {text}”} ] )
提示为162个令牌长,以下是模型摘要:
>句子级减少
[Prompt compression is useful in scenarios such as advanced prompt engineering techniques and retrieval-augmented generation pipelines to reduce token counts and costs while preserving effectiveness and context. However, it is not a universal solution and should be used judiciously, as excessive compression can lead to a loss of important details. It is important to strike a balance between reducing size and maintaining the prompt's meaning.]
这是模型摘要:
[When Should Prompt CompProm compression particularly following Advanced prompt engineering Techniques like chainofthought prompting while highly lengthy prompts reach thousands tokensThis increases processing costs and may exceed token limits of certain modelsProm compression mit these token count while prompt effectivenessRetri-au generationRAG pipelines R combine information retrieval text specialized chat contextual understandingThese pipelines frequently involve providing extensive conversation histories retrieved as prompts leading high token counts increased expensesProm compression essential such essential context while costsApplic prompt compression It prompt universal judFor instance assistant models like ChatGPT designed convers contexts benefit aggressive prompt compressionThese models often do charge per token integrated chat summar memory features conversation history effectively compression redundantIt even working models charge per token excessive compression could nuance importantSt reducing size maintaining prompt’s meaning is]
此摘要不包含任何错误,但缺少抹布讨论。
[Prompt compression is recommended following advanced prompt engineering techniques like chain of thought prompting, especially when dealing with lengthy prompts that reach thousands of tokens. This helps reduce processing costs and ensure that the prompt remains effective for models like RAG pipelines, which involve extensive conversation histories. Prompt compression is essential for maintaining context while keeping costs low, particularly for models like ChatGPT that charge per token and benefit from aggressive compression. However, excessive compression could lead to loss of important nuances, so it is important to strike a balance between reducing size and maintaining the prompt's meaning.]
[When Should We Use Prompt Compression? This increases processing times and costs and may exceed token limits of certain models. These pipelines frequently involve providing extensive conversation histories or retrieved documents as prompts, leading to high token counts and increased expenses. Prompt compression is essential in such cases to maintain essential context while minimizing costs. For instance, assistant models like ChatGPT, designed for conversational contexts, may not benefit from aggressive prompt compression. These models often do not charge per token and have integrated chat summarization and memory features to manage conversation history effectively, making compression redundant. ]
评估提示压缩
[Prompt compression is necessary when processing times, costs, and token limits may be exceeded in pipelines involving extensive conversation histories or retrieved documents as prompts. However, for certain models like ChatGPT, designed for conversational contexts, aggressive prompt compression may not be beneficial as they already have features to manage conversation history effectively without the need for compression.]
[When Should Prompt Compression Prompt compression particularly beneficial Advanced prompt engineering techniques Techniques like chainofthought prompting while highly lengthy prompts reach thousands tokens This increases processing costs and may exceed token limits of certain models Prompt compression these issues token count while the prompt's effectiveness Retrieval-augmented generation (RAG) pipelines RAG pipelines combine information retrieval text generation specialized chatbots contextual understanding These pipelines frequently involve providing extensive conversation histories or retrieved as prompts leading high token counts increased expenses Prompt compression essential such cases to maintain essential context while costs Applicability prompt compression It's For instance assistant models like ChatGPT designed conversational contexts may benefit aggressive prompt compression These models often do charge per token have integrated chat summarization memory features manage conversation history effectively making compression redundant It even working models charge per token excessive compression could nuance important details reducing size maintaining the prompt’s meaning is ]
| 令牌级
162 |
>对Chatgpt的重要细微差别不从积极的压缩中受益,并犯了错误。 |
|
>句子级
|
129
|
没有犯任何错误,而是错过了有关抹布管道的某些上下文。
|
|
| 183
| >与令牌级别类似,错误地指出,chatgpt受益于积极的压缩。
| 总体而言,迅速压缩可以显着降低令牌计数,同时保留主要想法。但是,要避免失去重要的细微差别和背景是至关重要的。
以上是及时压缩:带有Python示例的指南的详细内容。更多信息请关注PHP中文网其他相关文章!