從頭開始建立一個小型向量存儲

王林
發布: 2024-08-27 06:34:02
原創
637 人瀏覽過

隨著生成式人工智慧的不斷發展,向量資料庫在推動生成式人工智慧應用方面發揮著至關重要的作用。目前有許多開源的向量資料庫,例如 Chroma、Milvus 以及其他流行的專有向量資料庫,例如 Pinecone、SingleStore。您可以在此網站上閱讀不同向量資料庫的詳細比較。

但是,您有沒有想過這些向量資料庫在幕後是如何運作的?

學習東西的一個好方法是了解事物的底層工作原理。在本文中,我們將使用 Python 從頭開始建立一個小型記憶體向量儲存“Pixie”,僅使用 NumPy 作為依賴項。

Building a tiny vector store from scratch


在深入程式碼之前,我們先簡單討論一下什麼是向量儲存。

什麼是矢量商店?

向量儲存是一個旨在有效地儲存和檢索向量嵌入的資料庫。這些嵌入是資料的數位表示(通常是文本,但也可以是圖像、音訊等),可以捕捉高維空間中的語義。向量儲存的關鍵特徵是能夠執行有效的相似性搜索,根據向量表示找到最相關的資料點。向量儲存可用於許多任務,例如:

  1. 語意搜尋
  2. 檢索增強產生(RAG)
  3. 推薦系統

讓我們來編碼

在本文中,我們將建立一個名為「Pixie」的小型記憶體向量儲存。雖然它不會具有生產級系統的所有優化,但它將演示核心概念。 Pixie 將有兩個主要功能:

  1. 儲存文件嵌入
  2. 執行相似性搜尋

設定向量存儲

首先,我們將建立一個名為 Pixie 的類別:

import numpy as np
from sentence_transformers import SentenceTransformer
from helpers import cosine_similarity


class Pixie:
    def __init__(self, embedder) -> None:
        self.store: np.ndarray = None
        self.embedder: SentenceTransformer = embedder
登入後複製
  1. 首先,我們導入 numpy 來進行高效的數值運算和儲存多維數組。
  2. 我們也將從sentence_transformers庫中導入SentenceTransformer。我們使用 SentenceTransformer 來產生嵌入,但您可以使用任何將文字轉換為向量的嵌入模型。在本文中,我們的主要關注點將是向量儲存本身,而不是嵌入生成。
  3. 接下來,我們將使用嵌入器初始化 Pixie 類別。嵌入器可以移到主向量儲存之外,但為了簡單起見,我們將在向量儲存類別內初始化它。
  4. self.store 會將我們的文件嵌入儲存為 NumPy 陣列。
  5. self.embedder 將保存我們將用來將文件和查詢轉換為向量的嵌入模型。

攝取文檔

為了在我們的向量儲存中提取文件/數據,我們將實作 from_docs 方法:

def from_docs(self, docs):
        self.docs = np.array(docs)
        self.store = self.embedder.encode(self.docs)
        return f"Ingested {len(docs)} documents"
登入後複製

這個方法做了一些關鍵的事情:

  1. 它會取得文件清單並將它們作為 NumPy 陣列儲存在 self.docs 中。
  2. 它使用嵌入器模型將每個文件轉換為向量嵌入。這些嵌入儲存在 self.store 中。
  3. 它會傳回一則訊息,確認已攝取了多少文件。 我們嵌入器的編碼方法在這裡完成繁重的工作,將每個文字文件轉換為高維度向量表示。

執行相似性搜尋

我們向量儲存的核心是相似性搜尋功能:

def similarity_search(self, query, top_k=3):
        matches = list()
        q_embedding = self.embedder.encode(query)
        top_k_indices = cosine_similarity(self.store, q_embedding, top_k)
        for i in top_k_indices:
            matches.append(self.docs[i])
        return matches
登入後複製

讓我們來分解一下:

  1. 我們首先建立一個名為 matches 的空列表來儲存我們的匹配項。
  2. 我們使用與攝取文件相同的嵌入器模型對使用者查詢進行編碼。這確保了查詢向量與我們的文件向量位於同一空間。
  3. 我們呼叫 cosine_similarity 函數(我們將在接下來定義)來尋找最相似的文件。
  4. 我們使用傳回的索引從 self.docs 取得實際文件。
  5. 最後,我們傳回符合文件的清單。

實現餘弦相似度

import numpy as np


def cosine_similarity(store_embeddings, query_embedding, top_k):
    dot_product = np.dot(store_embeddings, query_embedding)
    magnitude_a = np.linalg.norm(store_embeddings, axis=1)
    magnitude_b = np.linalg.norm(query_embedding)

    similarity = dot_product / (magnitude_a * magnitude_b)

    sim = np.argsort(similarity)
    top_k_indices = sim[::-1][:top_k]

    return top_k_indices
登入後複製

這個函數正在做幾件重要的事情:

  1. It calculates the cosine similarity using the formula: cos(θ) = (A · B) / (||A|| * ||B||)
  2. First, we calculate the dot product between the query embeddings and all document embeddings in the store.
  3. Then, we compute the magnitudes (Euclidean norms) of all vectors.
  4. Lastly, we sort the found similarities and return the indices of the top-k most similar documents. We are using cosine similarity because it measures the angle between vectors, ignoring their magnitudes. This means it can find semantically similar documents regardless of their length.
  5. There are other similarity metrics that you can explore such as:
    1. Euclidean distance
    2. Dot product similarity

You can read more about cosine similarity here.

Piecing everything together

Now that we have built all the pieces, let's understand how they work together:

  1. When we create a Pixie instance, we provide it with an embedding model.
  2. When we ingest documents, we create vector embeddings for each document and store them in self.store.
  3. For a similarity search:
    1. We create an embedding for the query.
    2. We calculate cosine similarity between the query embeddings and all document embeddings.
    3. We return the most similar documents. All the magic happens inside the cosine similarity calculation. By comparing the angle between vectors rather than their magnitude, we can find semantically similar documents even if they use different words or phrasing.

Seeing it in action

Now let's implement a simple RAG system using our Pixie vector store. We'll ingest a story document of a "space battle & alien invasion" and then ask questions about it to see how it generates an answer.

import os
import sys
import warnings

warnings.filterwarnings("ignore")

import ollama
import numpy as np
from sentence_transformers import SentenceTransformer

current_dir = os.path.dirname(os.path.abspath(__file__))
root_dir = os.path.abspath(os.path.join(current_dir, ".."))
sys.path.append(root_dir)

from pixie import Pixie


# creating an instance of a pre-trained embedder model
embedder = SentenceTransformer("all-MiniLM-L6-v2")

# creating an instance of Pixie vector store
pixie = Pixie(embedder)


# generate an answer using llama3 and context docs
def generate_answer(prompt):
    response = ollama.chat(
        model="llama3",
        options={"temperature": 0.7},
        messages=[
            {
                "role": "user",
                "content": prompt,
            },
        ],
    )
    return response["message"]["content"]


with open("example/spacebattle.txt") as f:
    content = f.read()
    # ingesting the data into vector store
    ingested = pixie.from_docs(docs=content.split("\n\n"))
    print(ingested)

# system prompt
PROMPT = """
    User has asked you following question and you need to answer it based on the below provided context. 
If you don't find any answer in the given context then just say 'I don't have answer for that'. 
In the final answer, do not add "according to the context or as per the context". 
You can be creative while using the context to generate the final answer. DO NOT just share the context as it is.

    CONTEXT: {0}
    QUESTION: {1}

    ANSWER HERE:
"""

while True:
    query = input("\nAsk anything: ")
    if len(query) == 0:
        print("Ask a question to continue...")
        quit()

    if query == "/bye":
        quit()

    # search similar matches for query in the embedding store
    similarities = pixie.similarity_search(query, top_k=5)
    print(f"query: {query}, top {len(similarities)} matched results:\n")

    print("-" * 5, "Matched Documents Start", "-" * 5)
    for match in similarities:
        print(f"{match}\n")
    print("-" * 5, "Matched Documents End", "-" * 5)

    context = ",".join(similarities)
    answer = generate_answer(prompt=PROMPT.format(context, query))
    print("\n\nQuestion: {0}\nAnswer: {1}".format(query, answer))

    continue
登入後複製

Here is the output:

Ingested 8 documents

Ask anything: What was the invasion about?
query: What was the invasion about?, top 5 matched results:

----- Matched Documents Start -----
Epilogue: A New Dawn
Years passed, and the alliance between humans and Zorani flourished. Together, they rebuilt what had been lost, creating a new era of exploration and cooperation. The memory of the Krell invasion served as a stark reminder of the dangers that lurked in the cosmos, but also of the strength that came from unity. Admiral Selene Cortez retired, her name etched in the annals of history. Her legacy lived on in the new generation of leaders who continued to protect and explore the stars. And so, under the twin banners of Earth and Zorani, the galaxy knew peace—a fragile peace, hard-won and deeply cherished.

Chapter 3: The Invasion
Kael's warning proved true. The Krell arrived in a wave of bio-mechanical ships, each one bristling with organic weaponry and shields that regenerated like living tissue. Their tactics were brutal and efficient. The Titan Fleet, caught off guard, scrambled to mount a defense. Admiral Cortez's voice echoed through the corridors of the Prometheus. "All hands to battle stations! Prepare to engage!" The first clash was catastrophic. The Krell ships, with their organic hulls and adaptive technology, sliced through human defenses like a knife through butter. The outer rim colonies fell one by one, each defeat sending a shockwave of despair through the fleet. Onboard the Prometheus, Kael offered to assist, sharing Zorani technology and knowledge. Reluctantly, Cortez agreed, integrating Kael’s insights into their strategy. New energy weapons were developed, capable of piercing Krell defenses, and adaptive shields were installed to withstand their relentless attacks.

Chapter 5: The Final Battle
Victory on Helios IV was a much-needed morale boost, but the war was far from over. The Krell regrouped, launching a counter-offensive aimed directly at Earth. Every available ship was called back to defend humanity’s homeworld. As the Krell armada approached, Earth’s skies filled with the largest fleet ever assembled. The Prometheus led the charge, flanked by newly built warships and the remaining Zorani vessels that had joined the fight. "This is it," Cortez addressed her crew. "The fate of our species depends on this battle. We hold the line here, or we perish." The space above Earth turned into a maelstrom of fire and metal. Ships collided, energy beams sliced through the void, and explosions lit up the darkness. The Krell, relentless and numerous, seemed unbeatable. In the midst of chaos, Kael revealed a hidden aspect of Zorani technology—a weapon capable of creating a singularity, a black hole that could consume the Krell fleet. It was a desperate measure, one that could destroy both fleets. Admiral Cortez faced an impossible choice. To use the weapon would mean sacrificing the Titan Fleet and potentially Earth itself. But to do nothing would mean certain destruction at the hands of the Krell. "Activate the weapon," she ordered, her voice heavy with resolve. The Prometheus moved into position, its hull battered and scorched. As the singularity weapon charged, the Krell ships converged, sensing the threat. In a blinding burst of light, the weapon fired, tearing the fabric of space and creating a black hole that began to devour everything in its path.

Chapter 1: The Warning
It began with a whisper—a distant signal intercepted by the outermost listening posts of the Titan Fleet. The signal was alien, unlike anything the human race had ever encountered. For centuries, humanity had expanded its reach into the cosmos, colonizing distant planets and establishing trade routes across the galaxy. The Titan Fleet, the pride of Earth's military might, stood as the guardian of these far-flung colonies.Admiral Selene Cortez, a seasoned commander with a reputation for her sharp tactical mind, was the first to analyze the signal. As she sat in her command center aboard the flagship Prometheus, the eerie transmission played on a loop. It was a distress call, but its origin was unknown. The message, when decoded, revealed coordinates on the edge of the Andromeda Sector. "Set a course," Cortez ordered. The fleet moved with precision, a testament to years of training and discipline.

Chapter 4: Turning the Tide
The next battle, over the resource-rich planet of Helios IV, was a turning point. Utilizing the new technology, the Titan Fleet managed to hold their ground. The energy weapons seared through Krell ships, and the adaptive shields absorbed their retaliatory strikes. "Focus fire on the lead ship," Cortez commanded. "We break their formation, we break their spirit." The flagship of the Krell fleet, a massive dreadnought known as Voreth, was targeted. As the Prometheus and its escorts unleashed a barrage, the Krell ship's organic armor struggled to regenerate. In a final, desperate maneuver, Cortez ordered a concentrated strike on Voreth's core. With a blinding flash, the dreadnought exploded, sending a ripple of confusion through the Krell ranks. The humans pressed their advantage, driving the Krell back.
----- Matched Documents End -----


Question: What was the invasion about?
Answer: The Krell invasion was about the Krell arriving in bio-mechanical ships with organic weaponry and shields that regenerated like living tissue, seeking to conquer and destroy humanity.
登入後複製

Building a tiny vector store from scratch

We have successfully built a tiny in-memory vector store from scratch by using Python and NumPy. While it is very basic, it demonstrates the core concepts such as vector storage, and similarity search. Production grade vector stores are much more optimized and feature-rich.

Github repo: Pixie

Happy coding, and may your vectors always point in the right direction!

以上是從頭開始建立一個小型向量存儲的詳細內容。更多資訊請關注PHP中文網其他相關文章!

來源:dev.to
本網站聲明
本文內容由網友自願投稿,版權歸原作者所有。本站不承擔相應的法律責任。如發現涉嫌抄襲或侵權的內容,請聯絡admin@php.cn
熱門教學
更多>
最新下載
更多>
網站特效
網站源碼
網站素材
前端模板
關於我們 免責聲明 Sitemap
PHP中文網:公益線上PHP培訓,幫助PHP學習者快速成長!