Home > Backend Development > Python Tutorial > How to Create Your Own RAG with Free LLM Models and a Knowledge Base

How to Create Your Own RAG with Free LLM Models and a Knowledge Base

DDD
Release: 2024-12-28 08:49:11
Original
260 people have browsed it

This article explores the implementation of a straightforward yet effective question-answering system that combines modern transformer-based models. The system uses T5 (Text-to-Text Transfer Transformer) for answer generation and Sentence Transformers for semantic similarity matching.

In my previous article, I explained how to create a simple translation API with a web interface using a free foundational LLM model. This time, let’s dive into building a Retrieval-Augmented Generation (RAG) system using free transformer-based LLM models and a knowledge base.

RAG (Retrieval-Augmented Generation) is a technique that combines two key components:

Retrieval: First, it searches through a knowledge base (like documents, databases, etc.) to find relevant information for a given query. This usually involves:

  • Converting text into embeddings (numerical vectors that represent meaning)
  • Finding similar content using similarity measures (like cosine similarity)
  • Selecting the most relevant pieces of information

Generation: Then it uses a language model (like T5 in our code) to generate a response by:

Combining the retrieved information with the original question

Creating a natural language response based on this context

In the code:

  • The SentenceTransformer handles the retrieval part by creating embeddings
  • The T5 model handles the generation part by creating answers

Benefits of RAG:

  • More accurate responses since they’re grounded in specific knowledge
  • Reduced hallucination compared to pure LLM responses
  • Ability to access up-to-date or domain-specific information
  • More controllable and transparent than pure generation

System Architecture Overview

How to Create Your Own RAG with Free LLM Models and a Knowledge Base

The implementation consists of a SimpleQASystem class that orchestrates two main components:

  • A semantic search system using Sentence Transformers
  • An answer generation system using T5

You can download the latest version of the source code here: https://github.com/alexander-uspenskiy/rag_project

System Diagram

How to Create Your Own RAG with Free LLM Models and a Knowledge Base

RAG Project Setup Guide

This guide will help you set up your Retrieval-Augmented Generation (RAG) project on both macOS and Windows.

Prerequisites

For macOS:

Install Homebrew (if not already installed):
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
Install Python 3.8 using Homebrew
brew install python@3.10
For Windows:
Download and install Python 3.8 from python.org
Make sure to check “Add Python to PATH” during installation

Project Setup

Step 1: Create Project Directory

macOS:

mkdir RAG_project
cd RAG_project
Windows:

mkdir RAG_project
cd RAG_project

Step 2: Set Up Virtual Environment

macOS:

python3 -m venv venv
source venv/bin/activate

Windows:

python -m venv venv
venvScriptsactivate

**Core Components

  1. Initialization**
def __init__(self):
    self.model_name = 't5-small'
    self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)
    self.model = T5ForConditionalGeneration.from_pretrained(self.model_name)
    self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
Copy after login
Copy after login

The system initializes with two primary models:

T5-small: A smaller version of the T5 model for generating answers
paraphrase-MiniLM-L6-v2: A sentence transformer model for encoding text into meaningful vectors

2. Dataset Preparation

def prepare_dataset(self, data: List[Dict[str, str]]):
    self.answers = [item['answer'] for item in data]
    self.answer_embeddings = []
    for answer in self.answers:
        embedding = self.encoder.encode(answer, convert_to_tensor=True)
        self.answer_embeddings.append(embedding)
Copy after login
Copy after login

The dataset preparation phase:

  • Extracts answers from the input data
  • Creates embeddings for each answer using the sentence transformer
  • Stores both answers and their embeddings for quick retrieval

How the System Works

1. Question Processing

When a user submits a question, the system follows these steps:

Embedding Generation: The question is converted into a vector representation using the same sentence transformer model used for the answers.

Semantic Search: The system finds the most relevant stored answer by:

  • Computing cosine similarity between the question embedding and all answer embeddings
  • Selecting the answer with the highest similarity score Context Formation: The selected answer becomes the context for T5 to generate a final response.

2. Answer Generation

def get_answer(self, question: str) -> str:
    # ... semantic search logic ...
    input_text = f"Given the context, what is the answer to the question: {question} Context: {context}"
    input_ids = self.tokenizer(input_text, max_length=512, truncation=True, 
                             padding='max_length', return_tensors='pt').input_ids
    outputs = self.model.generate(input_ids, max_length=50, num_beams=4, 
                                early_stopping=True, no_repeat_ngram_size=2
Copy after login
Copy after login

The answer generation process:

  • Combines the question and context into a prompt for T5
  • Tokenizes the input text with a maximum length of 512 tokens
  • Generates an answer using beam search with these parameters:
  • max_length=50: Limits answer length
  • num_beams=4: Uses beam search with 4 beams
  • early_stopping=True: Stops generation when all beams reach an end token
  • no_repeat_ngram_size=2: Prevents repetition of bigrams

3. Answer Cleaning

def __init__(self):
    self.model_name = 't5-small'
    self.tokenizer = T5Tokenizer.from_pretrained(self.model_name)
    self.model = T5ForConditionalGeneration.from_pretrained(self.model_name)
    self.encoder = SentenceTransformer('paraphrase-MiniLM-L6-v2')
Copy after login
Copy after login
  • Removes duplicate consecutive words (case-insensitive)
  • Capitalizes the first letter of the answer
  • Removes extra whitespace

Full Source Code

You can download the latest version of source code here: https://github.com/alexander-uspenskiy/rag_project

def prepare_dataset(self, data: List[Dict[str, str]]):
    self.answers = [item['answer'] for item in data]
    self.answer_embeddings = []
    for answer in self.answers:
        embedding = self.encoder.encode(answer, convert_to_tensor=True)
        self.answer_embeddings.append(embedding)
Copy after login
Copy after login

Memory Management:

The system explicitly uses CPU to avoid memory issues
Embeddings are converted to CPU tensors when needed
Input length is limited to 512 tokens

Error Handling:

  • Comprehensive try-except blocks throughout the code
  • Meaningful error messages for debugging
  • Validation checks for uninitialized components

Usage Example

def get_answer(self, question: str) -> str:
    # ... semantic search logic ...
    input_text = f"Given the context, what is the answer to the question: {question} Context: {context}"
    input_ids = self.tokenizer(input_text, max_length=512, truncation=True, 
                             padding='max_length', return_tensors='pt').input_ids
    outputs = self.model.generate(input_ids, max_length=50, num_beams=4, 
                                early_stopping=True, no_repeat_ngram_size=2
Copy after login
Copy after login

Run in terminal

How to Create Your Own RAG with Free LLM Models and a Knowledge Base

Limitations and Potential Improvements

Scalability:

The current implementation keeps all embeddings in memory
Could be improved with vector databases for large-scale applications

Answer Quality:

Relies heavily on the quality of the provided answer dataset
Limited by the context window of T5-small
Could benefit from answer validation or confidence scoring

Performance:

  • Using CPU only might be slower for large-scale applications
  • Could be optimized with batch processing
  • Could implement caching for frequently asked questions

Conclusion

This implementation provides a solid foundation for a question-answering system, combining the strengths of semantic search and transformer-based text generation. Feel free to play with model parameters (like max_length, num_beams, early_stopping, no_repeat_ngram_size, etc) to find a better way to get more coherent and stable answers. While there’s room for improvement, the current implementation offers a good balance between complexity and functionality, making it suitable for educational purposes and small to medium-scale applications.

Happy coding!

The above is the detailed content of How to Create Your Own RAG with Free LLM Models and a Knowledge Base. For more information, please follow other related articles on the PHP Chinese website!

source:dev.to
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template