Prompt Compression: A Guide With Python Examples-AI-php.cn

In the rapidly evolving landscape of artificial intelligence, optimizing large language models (LLMs) is not just about pushing the boundaries of what's possible but also about ensuring efficiency and cost-effectiveness.

Prompt compression has emerged as a vital technique for enhancing the performance of these models while minimizing computational expenses. With new research emerging almost weekly, keeping up is challenging, but understanding the fundamentals is essential.

This article covers the basics of prompt compression, discusses when it should be used, its importance in reducing costs in RAG pipelines, and provides examples using the gpt-3.5-turbo-0125 model through OpenAI’s API.

If you want to learn more, check out this course on prompt engineering.

What Is Prompt Compression?

Prompt compression is a technique used in natural language processing (NLP) to optimize the inputs given to LLMs by reducing their length without significantly altering the quality and relevance of the output. This optimization is crucial due to the impact the number of tokens in queries has on LLM performance.

Tokens are the basic units of text LLMs use, representing words or subwords depending on the language model's tokenizer. Reducing the number of tokens in a prompt is beneficial and sometimes necessary for several reasons:

Token limit constraints: LLMs have a maximum token limit for inputs. Exceeding this limit can truncate important information, reducing the output's clarity and the model's effectiveness.
Processing efficiency and cost reduction: Fewer tokens mean faster processing times and lower costs.
Improved response relevance: A human-readable prompt does not always mean a good prompt. Sometimes the prompts we think are good and informative carry unimportant information for the LLMs, such as stop words ("a," "the," "is," etc.).

Prompt compression reduces the token number by employing strategies such as removing redundant information, summarizing key points, or utilizing specialized algorithms to distill the essence of a prompt while minimizing its token count.

When Should We Use Prompt Compression?

Let’s explore the scenarios where we could use prompt compression.

Advanced prompt engineering techniques

Techniques like chain-of-thought prompting, while highly effective, often result in lengthy prompts that can reach thousands of tokens. This increases processing times and costs and may exceed the token limits of certain models.

Prompt compression mitigates these issues by reducing the token count while preserving the prompt's effectiveness.

RAG pipelines combine information retrieval with text generation and are often used in specialized chatbots and other applications where contextual understanding is critical. These pipelines frequently need extensive conversation histories or retrieved documents as prompts, leading to high token counts and increased expenses.

Prompt compression is essential in such cases to maintain essential context while minimizing costs.

Applicability and limitations of prompt compression

It's important to note that prompt compression is not a universal solution and should be used judiciously. For instance, assistant models like ChatGPT, designed for conversational contexts, may not benefit from aggressive prompt compression.

These models often do not charge per token and have integrated chat summarization and memory features to manage conversation history effectively, making compression redundant.

It’s also important to note that even when working with models that charge per token, excessive compression could lead to a loss of nuance or important details. Striking the right balance between reducing size and maintaining the integrity of the prompt’s meaning is key.

How Does Prompt Compression Work?

Prompt compression techniques can be categorized into three main methods: knowledge distillation, encoding, and filtering. Each technique leverages different strengths to optimize the length and efficiency of prompts for LLMs.

While we’ll be talking about each of these techniques, you can find a more comprehensive approach in this paper: Efficient Prompting Methods for Large Language Models: A Survey. Throughout this article, I’ll be referring to this paper as the “survey paper.”

Knowledge distillation

Knowledge distillation is a technique in the field of machine learning, first introduced by Hinton et al. (2015), where a smaller, simpler model (the student) is trained to replicate the behavior of a larger, more complex model (the teacher).

This technique was initially developed to address the computational challenges of training an ensemble of models. In the context of prompt engineering, knowledge distillation can be used to compress the prompt instead of the model.

This is achieved by learning how to compress the hard prompts within LLMs through soft prompt tuning. For detailed insights, refer to sections 3.1 and appendix A.1.1 of the survey paper.

Encoding

Encoding methods transform input texts into vectors, reducing prompt length without losing critical information. These vectors capture the prompts' essential meaning, allowing LLMs to process shorter inputs efficiently.

Interestingly, LLMs are proficient in other languages like Base64, which can be utilized in encoding to reduce the token size of the prompt. For example, the prompt “Translate the following text to French: Hello, how are you?” encoded in Base64 is “VHJhbnNsYXRlIHRoZSBmb2xsb3dpbmcgdGV4dCB0byBGcmVuY2g6ICdIZWxsbywgaG93IGFyZSB5b3UnPw==”. You can try prompting your favorite LLM to test it out!

Surprisingly, some encoding techniques are also used for model jailbreaking, which involves manipulating LLM to bypass its safety mechanisms. For more details on encoding methods, see sections 3.2 and appendix A.1.2 of the survey paper.

Filtering

While the previous two methods try to compress the whole prompt, filtering techniques focus on eliminating unnecessary parts to enhance the efficiency of LLMs.

Filtering techniques evaluate the information content of different parts of a prompt and remove redundant information since not all information in the prompt is beneficial for LLMs. This can be done at various levels, such as sentences, phrases, or tokens.

The goal is to retain only the most relevant parts of the prompt. In the paper Selective Context by Li et al. (2023), researchers use self-information metrics to filter redundant information. In the paper LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models, researchers from Microsoft refine prompts into key components and dynamically adjust compression ratios for each part. For further reading, refer to sections 3.3 and appendix A.1.3 of the survey paper.

How to Implement Prompt Compression in Python

In this section, I’ll implement and test the Selective Context algorithm which is popular and considered to be state-of-the-art. If you only want to test the algorithm, you don’t need to install anything—it’s already hosted on the HuggingFace platform.

There are also other mainstream compression techniques, like Keep It Simple (KIS), SCLR, and the algorithms from LLMLingua family, but we won’t be able to cover them in this short article.

Prompt Compression: A Guide With Python Examples

App Link

In the Selective Context web app, you can choose the language of the prompt you want to compress (English or Simplified Chinese). You can also set the compression ratio and select whether to filter out sentences, tokens, or phases.

Implementing and testing Selective Context with OpenAI API

Now let's work on the the Python implementation. We’ll also test a few compressed prompts with the gpt-3.5-turbo-0125 model.

First, we need to install all the required modules. We need to install selective-context library using pip:

pip install selective-context

Copy after login

We also need to download en_core_web_sm model from spacy, this can be done with the following command:

pip install selective-context

Copy after login

Now we need to initialize the SelectiveContext object. We can choose either curie or gpt-2 for the model and en or zh for the language. I will be using gpt-2 for this example.

python -m spacy download en_core_web_sm

Copy after login

Next, we can call our SelectiveContext object on the text string we want to compress. We can set the reduce_ratio and reduce_level parameters. reduce_level needs to be one of the following: ‘sent’, ‘phrase’, or ‘token’. The object call returns a (context, reduced_content) tuple, where context is the compressed prompt and reduced_content is a list of removed phrases, sentences, or tokens.

sc = SelectiveContext(model_type = ‘gpt-2’, lang = ‘en’)

Copy after login

Now let’s do some examples. I’ll ask the gpt-3.5-turbo-0125 model to summarize the “When should we use prompt compression” section from this article. Then, we’ll compress the section with a 0.5 compression rate using all three reduction levels: sentence, phrase, and token. We will ask the model to summarize the compressed versions again and compare the token count of each prompt and the output of the model.

Original paragraph and summarization

Below, you can see the API call I’ll be using—we’ll just change the text variable in the format string. First, we ask the model to summarize the original, uncompressed paragraph:

context, reduced_content = sc(text, reduce_ratio = 0.5, reduce_level = ‘sent’)

Copy after login

By checking the completion.usage we can see that the original prompt is 304 tokens long, and here is the output:

from openai import OpenAI
client = OpenAI(api_key = key)
completion = client.chat.completions.create(
	model = “gpt-3.5-turbo-0125”,
	messages = [
{“role”: “user”, “content”: f”Summarize the following paragraph:\n {text}”}
]
)

Copy after login

As you can see, the model summarized the uncompressed paragraph quite well.

Token-level reduction

Now let's try the token-reduced version. The compressed paragraph looks like this:

[Prompt compression is useful in scenarios such as advanced prompt engineering techniques and retrieval-augmented generation pipelines to reduce token counts and costs while preserving effectiveness and context. However, it is not a universal solution and should be used judiciously, as excessive compression can lead to a loss of important details. It is important to strike a balance between reducing size and maintaining the prompt's meaning.]

Copy after login

The prompt is 162 tokens long, and here is the model summary:

[When Should Prompt CompProm compression particularly following Advanced prompt engineering Techniques like chainofthought prompting while highly lengthy prompts reach thousands tokensThis increases processing costs and may exceed token limits of certain modelsProm compression mit these token count while prompt effectivenessRetri-au generationRAG pipelines R combine information retrieval text specialized chat contextual understandingThese pipelines frequently involve providing extensive conversation histories retrieved as prompts leading high token counts increased expensesProm compression essential such essential context while costsApplic prompt compression It prompt universal judFor instance assistant models like ChatGPT designed convers contexts benefit aggressive prompt compressionThese models often do charge per token integrated chat summar memory features conversation history effectively compression redundantIt even working models charge per token excessive compression could nuance importantSt reducing size maintaining prompt’s meaning is]

Copy after login

The paragraph is not summarized correctly. The nuance about ChatGPT not benefiting from aggressive compression was lost.

Sentence-level reduction

Next, let’s try the sentence-level reduction. The reduced prompt is 129 tokens long:

[Prompt compression is recommended following advanced prompt engineering techniques like chain of thought prompting, especially when dealing with lengthy prompts that reach thousands of tokens. This helps reduce processing costs and ensure that the prompt remains effective for models like RAG pipelines, which involve extensive conversation histories. Prompt compression is essential for maintaining context while keeping costs low, particularly for models like ChatGPT that charge per token and benefit from aggressive compression. However, excessive compression could lead to loss of important nuances, so it is important to strike a balance between reducing size and maintaining the prompt's meaning.]

Copy after login

Here is the model summary:

[When Should We Use Prompt Compression? This increases processing times and costs and may exceed token limits of certain models. These pipelines frequently involve providing extensive conversation histories or retrieved documents as prompts, leading to high token counts and increased expenses. Prompt compression is essential in such cases to maintain essential context while minimizing costs. For instance, assistant models like ChatGPT, designed for conversational contexts, may not benefit from aggressive prompt compression. These models often do not charge per token and have integrated chat summarization and memory features to manage conversation history effectively, making compression redundant. ]

Copy after login

This summary doesn’t contain any mistakes but is missing the RAG discussion.

Phrase-level reduction

Lastly, let's test the phrase-level reduction. The prompt is 183 tokens long:

[Prompt compression is necessary when processing times, costs, and token limits may be exceeded in pipelines involving extensive conversation histories or retrieved documents as prompts. However, for certain models like ChatGPT, designed for conversational contexts, aggressive prompt compression may not be beneficial as they already have features to manage conversation history effectively without the need for compression.]

Copy after login

Here is the model’s summary:

[When Should Prompt Compression Prompt compression particularly beneficial Advanced prompt engineering techniques Techniques like chainofthought prompting while highly lengthy prompts reach thousands tokens This increases processing costs and may exceed token limits of certain models Prompt compression these issues token count while the prompt's effectiveness Retrieval-augmented generation (RAG) pipelines RAG pipelines combine information retrieval text generation specialized chatbots contextual understanding These pipelines frequently involve providing extensive conversation histories or retrieved as prompts leading high token counts increased expenses Prompt compression essential such cases to maintain essential context while costs Applicability prompt compression It's For instance assistant models like ChatGPT designed conversational contexts may benefit aggressive prompt compression These models often do charge per token have integrated chat summarization memory features manage conversation history effectively making compression redundant It even working models charge per token excessive compression could nuance important details reducing size maintaining the prompt’s meaning is
]

Copy after login

The summary is mostly correct and coherent. However, it understands wrongly the part about ChatGPT benefiting from aggressive prompt compression.

Evaluating prompt compression

By comparing the token counts and the content of the model's summaries at different compression levels, we can see the impact of prompt compression on the model's output:

Compression level	Token count (original is 304 long)	Accuracy
Token-level	162	Lost important nuances about ChatGPT not benefiting from aggressive compression, and made mistakes.
Sentence-level	129	Did not make any mistakes, but missed some context about RAG pipelines.
Phrase-level	183	Similarly to token-level, incorrectly stated that ChatGPT benefits from aggressive compression.

Overall, prompt compression can significantly reduce the token count while preserving the main ideas. However, it’s essential to strike a balance to avoid losing important nuances and context.

This exercise highlights the need to carefully choose the compression level based on the specific application and the criticality of maintaining certain details in the prompt.

It's worth noting that we performed all experiments with a 0.5 compression rate, which is relatively high. You might want to experiment with various compression rates for different use cases to find the optimal balance between reducing prompt size and maintaining its integrity.

Conclusion

Prompt compression is a powerful technique for optimizing the efficiency and cost-effectiveness of LLMs. We've explored its fundamentals, discussing its importance, various techniques, and implementation details.

As the field of generative AI continues to evolve, staying abreast of the latest developments is crucial. To further enhance your skills and understanding of prompt compression and related techniques, I encourage you to explore the papers referenced in this article, as well as the following comprehensive blog posts and courses from DataCamp:

Understanding Prompt Engineering
Understanding Prompt Tuning
ChatGPT Prompt Engineering for Developers

Prompt Compression FAQs

What is prompt compression, and why is it important?

Prompt compression is a technique used to optimize the inputs given to large language models (LLMs) by reducing their length while maintaining the quality and relevance of the output. It is important because it helps stay within token limits, reduces processing time and costs.

What are some common challenges in implementing prompt compression?

Common challenges include maintaining the balance between compression and preserving essential information, handling diverse types of input data, and ensuring that the compressed prompt still produces high-quality outputs. Additionally, implementing machine-learning based and hybrid approaches can be resource-intensive and complex.

How do you handle prompt compression for multi-turn conversations in chatbots?

For multi-turn conversations, prompt compression must ensure that the context from previous interactions is preserved. Techniques such as selective context filtering can help by retaining critical parts of the conversation while compressing less important information. This approach helps maintain continuity and relevance in the chatbot's responses.

What resources can I use to improve my prompt compression skills?

To further enhance your skills, explore the referenced research papers, blog posts, and DataCamp courses mentioned in the article. Topics include understanding prompt engineering, prompt tuning, and retrieval-augmented generation (RAG), all of which are crucial for mastering prompt compression. Also, make sure to practice the techniques that you learned.

Are there any ethical considerations when using prompt compression?

Yes, ethical considerations include ensuring that compression does not inadvertently introduce biases or omit critical information that could lead to misleading or harmful outputs. Additionally, prompt compression techniques might also inadvertently jailbreak the model, causing it to behave unpredictably or generate inappropriate content. It is important to monitor the effects of prompt compression on the model's performance and outputs, especially in sensitive applications such as healthcare, finance, or legal advice.

The above is the detailed content of Prompt Compression: A Guide With Python Examples. For more information, please follow other related articles on the PHP Chinese website!