Imagine you are building a customer support AI that needs to answer questions about your product. Sometimes it needs to pull information from your documentation, while other times it needs to search the web for the latest updates. Agentic RAG systems come in handy in such types of complex AI applications. Think of them as smart research assistants who not only know your internal documentation but also decide when to go to search the web. In this guide, we will walk through the process of building an agentic QA RAG system using the Haystack framework.
This article was published as a part of theData Science Blogathon.
An agentic LLM is an AI system that can autonomously make decisions and take actions based on its understanding of the task. Unlike traditional LLMs that mainly generate text responses, an agentic LLM can do a lot more. It can think, plan, and act with minimal human input. It assesses its knowledge, recognizing when it needs more information or external tools. Agentic LLMs don’t rely on static data or indexed knowledge, instead, they decide which sources to trust and how to gather the best insights.
This type of system can also pick the right tools for the job. It can decide when it needs to retrieve documents, run calculations, or automate tasks. What sets them apart is its ability to break down complex problems into steps and execute them independently which makes it valuable for research, analysis, and workflow automation.
Traditional RAG systems follow a linear process. When a query is received, the system first identifies the key elements within the request. It then searches the knowledge base, scanning for relevant information that can help design an accurate response. Once the relevant information or data is retrieved, the system processes it to generate a meaningful and contextually relevant response.
You can understand the processes easily by the below diagram.
Now, an agentic RAG system enhances this process by:
The key difference lies in the system’s ability to make intelligent decisions about how to handle queries, rather than following a fixed retrieval-generation pattern.
Haystack is an open-source framework for building production-ready AI, LLM applications, RAG pipelines, and search systems. It offers a powerful and flexible framework for building LLM applications. It allows you to integrate models from various platforms such as Huggingface, OpenAI, CoHere, Mistral, and Local Ollama. You can also deploy models on cloud services like AWS SageMaker, BedRock, Azure, and GCP.
Haystack provides robust document stores for efficient data management. It also comes with a comprehensive set of tools for evaluation, monitoring, and data integration which ensure smooth performance across all layers of your application. It also has strong community collaboration which makes new service integration from various service providers periodically.
Haystack has two primary concepts for building fully functional GenAI LLM systems – components and pipelines. Let’s understand them with a simple example of RAG on Japanese Anime Characters
Components are the core building blocks of Haystack. They can perform tasks such as document storing, document retrieval, text generation, and embedding. Haystack has many components you can use directly after installation, it also provides APIs for making your own components by writing a Python class.
There is a collection of integration from partner companies and the community.
Install Libraries and set Ollama
$ pip install haystack-ai ollama-haystack # On you system download Ollama and install LLM ollama pull llama3.2:3b ollama pull nomic-embed-text # And then start ollama server ollama serve
Import some components
from haystack import Document, Pipeline from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack_integrations.components.generators.ollama import OllamaGenerator
Create a document and document store
document_store = InMemoryDocumentStore() documents = [ Document( content="Naruto Uzumaki is a ninja from the Hidden Leaf Village and aspires to become Hokage." ), Document( content="Luffy is the captain of the Straw Hat Pirates and dreams of finding the One Piece." ), Document( content="Goku, a Saiyan warrior, has defended Earth from numerous powerful enemies like Frieza and Cell." ), Document( content="Light Yagami finds a mysterious Death Note, which allows him to eliminate people by writing their names." ), Document( content="Levi Ackerman is humanity’s strongest soldier, fighting against the Titans to protect mankind." ), ]
Pipelines are the backbone of Haystack’s framework. They define the flow of data between different components. Pipelines are essentially a Directed Acyclic Graph (DAG). A single component with multiple outputs can connect to another single component with multiple inputs.
You can define pipeline by
pipe = Pipeline() pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") ) pipe.connect("retriever", "prompt_builder.documents") pipe.connect("prompt_builder", "llm")
You can visualize the pipeline
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
The pipeline provides:
Nodes are the basic processing units that can be connected in a pipeline these nodes are the components that perform specific tasks.
Examples of nodes from the above pipeline
pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") )
The connectiongraph defines how components interact.
From the above pipeline, you can visualize the connection graph.
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
The connection graph of the anime pipeline
This graph structure:
Now we can query our anime knowledge base using the prompt.
Create a prompt template
template = """ Given only the following information, answer the question. Ignore your own knowledge. Context: {% for document in documents %} {{ document.content }} {% endfor %} Question: {{ query }}? """
This prompt will provide an answer taking information from the document base.
Query using prompt and retriever
query = "How Goku eliminate people?" response = pipe.run({"prompt_builder": {"query": query}, "retriever": {"query": query}}) print(response["llm"]["replies"])
Response:
This RAG is simple yet conceptually valuable to the newcomer. Now that we have understood most of the concepts of Haystack frameworks, we can deep dive into our main project. If any new thing comes up I will explain along the way.
We will build an NCERT Physics books-based Question Answer RAG for higher secondary students. It will provide answers to the query by taking information from the NCERT books, and If the information is not there it will search the web to get that information.
For this, I will use:
I use a free, totally localized system.
We will setup a conda env Python 3.12
$ pip install haystack-ai ollama-haystack # On you system download Ollama and install LLM ollama pull llama3.2:3b ollama pull nomic-embed-text # And then start ollama server ollama serve
from haystack import Document, Pipeline from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack_integrations.components.generators.ollama import OllamaGenerator
Now create a project directory named qagent.
document_store = InMemoryDocumentStore() documents = [ Document( content="Naruto Uzumaki is a ninja from the Hidden Leaf Village and aspires to become Hokage." ), Document( content="Luffy is the captain of the Straw Hat Pirates and dreams of finding the One Piece." ), Document( content="Goku, a Saiyan warrior, has defended Earth from numerous powerful enemies like Frieza and Cell." ), Document( content="Light Yagami finds a mysterious Death Note, which allows him to eliminate people by writing their names." ), Document( content="Levi Ackerman is humanity’s strongest soldier, fighting against the Titans to protect mankind." ), ]
You can use plain Python files for the project or Jupyter Notebook for the project it does not matter. I will use a plain Python file.
Create a main.py file on the project root.
pipe = Pipeline() pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") ) pipe.connect("retriever", "prompt_builder.documents") pipe.connect("prompt_builder", "llm")
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") )
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
template = """ Given only the following information, answer the question. Ignore your own knowledge. Context: {% for document in documents %} {{ document.content }} {% endfor %} Question: {{ query }}? """
Document store is the most important here we will store our embedding for retrieval, we use ChromaDB for the embedding store, and as you may see in the earlier example, we use InMemoryDocumentStore for fast retrieval because then our data was tiny but for a robust system of retrieval we don’t rely on the InMemoryStore, it will hog the memory and we will have creat embeddings every time we start the system.
The solution is a Vector database such as Pinecode, Weaviate, Postgres Vector DB, or ChromaDB. I use ChromaDB because free, open-source, easy to use, and robust.
query = "How Goku eliminate people?" response = pipe.run({"prompt_builder": {"query": query}, "retriever": {"query": query}}) print(response["llm"]["replies"])
persist_path is where you want to store your embedding.
PDF files path
$conda create --name agenticlm python=3.12 $conda activate agenticlm
It will create a list of files from the data folder which consists of our PDF files.
We will use Haystack’s built-in document preprocessor such as cleaner, splitter, and file converter, and then use a writer to write the data into the store.
Cleaner: It will clean the extra space, repeated lines, empty lines, etc from the documents.
$pip install haystack-ai ollama-haystack pypdf $pip install chroma-haystack duckduckgo-api-haystack
Splitter: It will split the document in various ways such as words, sentences, para, pages.
$md qagent # create dir $cd qagent # change to dir $ code . # open folder in vscode
File Converter: It will use the pypdf to convert the pdf to documents.
$ pip install haystack-ai ollama-haystack # On you system download Ollama and install LLM ollama pull llama3.2:3b ollama pull nomic-embed-text # And then start ollama server ollama serve
Writer: It will store the document where you want to store the documents and for duplicate documents, it will overwrite with previous one.
from haystack import Document, Pipeline from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack_integrations.components.generators.ollama import OllamaGenerator
Now set the embedder for document indexing.
Embedder: Nomic Embed Text
We will use nomic-embed-text embedder which is very effective and free inHuggingface and Ollama.
Before you run your indexing pipeline open your terminal and type below to Pull the nomic-embed-text and llama3.2:3b model from the Ollama model store
document_store = InMemoryDocumentStore() documents = [ Document( content="Naruto Uzumaki is a ninja from the Hidden Leaf Village and aspires to become Hokage." ), Document( content="Luffy is the captain of the Straw Hat Pirates and dreams of finding the One Piece." ), Document( content="Goku, a Saiyan warrior, has defended Earth from numerous powerful enemies like Frieza and Cell." ), Document( content="Light Yagami finds a mysterious Death Note, which allows him to eliminate people by writing their names." ), Document( content="Levi Ackerman is humanity’s strongest soldier, fighting against the Titans to protect mankind." ), ]
and start Ollama by typing the command ollama serve in your terminal
now embedder component
pipe = Pipeline() pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") ) pipe.connect("retriever", "prompt_builder.documents") pipe.connect("prompt_builder", "llm")
We use OllamaDocumentEmbedder component for embedding documents, but if you want to embed the text string then you have to use OllamaTextEmbedder.
Like our previous toy RAG example, we will start by initiating the Pipeline class.
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
Now we will add the components to our pipeline one by one
pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") )
Adding components to the pipeline does not care about order so, you can add components in any order. but connecting is what matters.
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
Here, order matters, because how you connect the component tells the pipeline how the data will flow through the pipeline. It is like, It doesn’t matter in which order or from where you buy your plumbing items but how to put them together will decide whether you get your water or not.
The converter converts the PDFs and sends them to clean for cleaning. Then the cleaner sends the cleaned documents to the splitter for chunking. Those chunks will then pass to the embedded for vectorization, and the last embedded will hand over these embeddings to the writer for storage.
Understand! Ok, let me give you a visual graph of the indexing so you can inspect the data flow.
template = """ Given only the following information, answer the question. Ignore your own knowledge. Context: {% for document in documents %} {{ document.content }} {% endfor %} Question: {{ query }}? """
Yeah, you can create a nice mermaid graph from the haystack pipeline easily.
Graph of Indexing Pipeline
I assume now you have fully grasped the idea behind the Haystack Pipeline. Give a thank to you Plumber.
Now, we need to create a router to route the data through a different path. In this case, we’ll use a conditional router which will do our routing job on certain conditions. The conditional router will evaluate conditions based on component output. It will direct data flow through different pipeline branches which enables dynamic decision-making. It will also have robust fallback strategies.
$ pip install haystack-ai ollama-haystack # On you system download Ollama and install LLM ollama pull llama3.2:3b ollama pull nomic-embed-text # And then start ollama server ollama serve
When the system gets no_answer replies from the embedding store context, then it will go to the web search tools for collecting relevant data from the internet.
For web search, we will use Duckduckgo API or Tavily, here I have used Duckduckgo.
from haystack import Document, Pipeline from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack_integrations.components.generators.ollama import OllamaGenerator
Ok, most of the heavy lifting has been done. Now, time for prompt engineering
We will use the Haystack PromptBuilder component for building prompts from the template
First, we will create a prompt for qa
document_store = InMemoryDocumentStore() documents = [ Document( content="Naruto Uzumaki is a ninja from the Hidden Leaf Village and aspires to become Hokage." ), Document( content="Luffy is the captain of the Straw Hat Pirates and dreams of finding the One Piece." ), Document( content="Goku, a Saiyan warrior, has defended Earth from numerous powerful enemies like Frieza and Cell." ), Document( content="Light Yagami finds a mysterious Death Note, which allows him to eliminate people by writing their names." ), Document( content="Levi Ackerman is humanity’s strongest soldier, fighting against the Titans to protect mankind." ), ]
It will take the context from the document and try to answer the question. But if it does not find relevant context in the documents it will reply no_answer.
Now, in the second prompt after getting no_answer from the LLM, the system will use the web search tools for gathering context from the internet.
Duckduckgo prompt template
pipe = Pipeline() pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") ) pipe.connect("retriever", "prompt_builder.documents") pipe.connect("prompt_builder", "llm")
It will facilitate the system to go to the web search and try to answer the query.
Creating prompt using PromptBuilder from Haystack
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
We will use Haystack prompt joiner to join to branches of the prompt together.
pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") )
The query pipeline will be embedding the query gathering contextual resources from the embeddings and answering our query using LLM or Web Search tool.
It is similar to the indexing pipeline.
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
Adding components to the query pipeline
template = """ Given only the following information, answer the question. Ignore your own knowledge. Context: {% for document in documents %} {{ document.content }} {% endfor %} Question: {{ query }}? """
Here, for LLM generation we use the OllamaGenerator component for generating answers using Llama3.2:3b or 1b or whatever LLM you like with tools calling.
Connecting all the components together for query flow and answer generation
query = "How Goku eliminate people?" response = pipe.run({"prompt_builder": {"query": query}, "retriever": {"query": query}}) print(response["llm"]["replies"])
In summary of the above connection:
Why not see for yourself?
$conda create --name agenticlm python=3.12 $conda activate agenticlm
Query Graph
I know it is a huge graph but it will show you exactly what is going on under the belly of the beast.
Now it is time to enjoy the fruit of our hard work.
Create a function for easy querying.
$ pip install haystack-ai ollama-haystack # On you system download Ollama and install LLM ollama pull llama3.2:3b ollama pull nomic-embed-text # And then start ollama server ollama serve
It is an easy simple function for answer generation.
Now run your main script for indexing the NCERT physics book
from haystack import Document, Pipeline from haystack.components.builders.prompt_builder import PromptBuilder from haystack.components.retrievers.in_memory import InMemoryBM25Retriever from haystack.document_stores.in_memory import InMemoryDocumentStore from haystack_integrations.components.generators.ollama import OllamaGenerator
It is a one-time job, after indexing you must comment on this line otherwise it will start re-indexing the books.
and the bottom of the file we write our driver code for the query
document_store = InMemoryDocumentStore() documents = [ Document( content="Naruto Uzumaki is a ninja from the Hidden Leaf Village and aspires to become Hokage." ), Document( content="Luffy is the captain of the Straw Hat Pirates and dreams of finding the One Piece." ), Document( content="Goku, a Saiyan warrior, has defended Earth from numerous powerful enemies like Frieza and Cell." ), Document( content="Light Yagami finds a mysterious Death Note, which allows him to eliminate people by writing their names." ), Document( content="Levi Ackerman is humanity’s strongest soldier, fighting against the Titans to protect mankind." ), ]
MCQ on resistivity from the book’s knowledge
Another question that is not in the book
pipe = Pipeline() pipe.add_component("retriever", InMemoryBM25Retriever(document_store=document_store)) pipe.add_component("prompt_builder", PromptBuilder(template=template)) pipe.add_component( "llm", OllamaGenerator(model="llama3.2:1b", url="http://localhost:11434") ) pipe.connect("retriever", "prompt_builder.documents") pipe.connect("prompt_builder", "llm")
Output
Let’s try another question.
image_param = { "format": "img", "type": "png", "theme": "forest", "bgColor": "f2f3f4", } pipe.show(params=image_param)
So, it’s working! We can use more data, books, or PDFs for embedding which will generate more contextual-aware answers. Also, LLMs such as GPT-4o, Anthropic’s Claude, or other cloud LLMs will do the job even better.
Our agentic RAG system demonstrates the flexibility and robustness of the Haystack framework with its power of combining components and pipelines. This RAG can be made production-ready by deploying to the web service platform and also using better paid LLM such as OpenAI, and nthropic. You can build a UI using Streamlit or React-based web SPA for a better user experience.
You can find all the code used in the article, here.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
A. The system uses its router component to automatically fall back to web search when local knowledge is insufficient, ensuring comprehensive coverage.
Q2. What advantages does the pipeline architecture offer?A. The pipeline architecture enables modular development, easy testing, and flexible component arrangement, making the system maintainable and extensible.
Q3. How does the connection graph enhance system functionality?A. The connection graph enables complex data flows and parallel processing, improving system efficiency and flexibility in handling different types of queries.
Q4. Can I use other LLM APIs?A. Yes, it is very easy just install the necessary integration package for the respective LLM API such as Gemini, Anthropic, and Groq, and use it with your API keys.
The above is the detailed content of How to Build Agentic QA RAG System Using Haystack Framework. For more information, please follow other related articles on the PHP Chinese website!