如何衡量大語模型的響應的可靠性-人工智慧-PHP中文網

>大語言模型（LLM）的基本原理非常簡單：根據培訓數據中的統計模式，以一系列單詞來預測下一個單詞（或令牌）。但是，當它可以執行許多驚人的任務，例如文本摘要，想法產生，集思廣益，代碼生成，信息處理和內容創建等令人驚嘆的任務時，這種看似簡單的功能就非常複雜。也就是說，LLM沒有任何記憶，除了堅持其基本功能外，他們實際上“理解”了任何東西：預測下一個單詞。

下一字預測的過程是概率的。 LLM必須從概率分佈中選擇每個單詞。在此過程中，它們通常會產生錯誤，捏造或不一致的內容，以嘗試產生連貫的響應並填補外觀合理但不正確的信息的空白。這種現象稱為幻覺，這是LLM的不可避免的，眾所周知的特徵，需要驗證和證實其產出。使LLM與外部知識源一起工作的

檢索增強生成（RAG）方法在某種程度上可以最大程度地減少幻覺，但不能完全消除它們。儘管高級破布可以提供文本引用和URL，但驗證這些參考可能會忙碌而耗時。因此，我們需要一個客觀的標準來評估LLM響應的可靠性或可信度，無論是根據自己的知識還是外部知識基礎（RAG）產生的。在本文中，我們將討論如何通過值得信賴的語言模型來評估LLM的輸出，該模型為LLM的輸出分配分數。我們將首先討論如何使用值得信賴的語言模型將分數分配給LLM的答案並解釋可信度。隨後，我們將與Llamaparse和Llamaindex一起開發一個例子抹布，以評估抹布的可信度答案。

本文的整個代碼可在github的jupyter筆記本中找到。

>將可信度分數分配給LLM的答案

>演示我們如何為LLM的響應分配一個可信度分數，我將使用清潔行的可信語言模型（TLM）。這樣的TLM結合了不確定性量化

>和>一致性分析來計算LLM響應的可信值分數和解釋。

pip install --upgrade cleanlab-studio

登入後複製

>清潔列表支持幾個專有模型，例如'gpt-4o'，'gpt-> gpt-4o-mini '，' o1-preview '，' claude-3-sonnet'，''claude-3.5-sonnet '，， 'claude-3.5-sonnet-v2'等。這是TLM將Trustworhiness分數分配給GPT-4O的答案的方式。值得信賴的評分範圍從0到1，其中較高的值表示更高的可信度。

from cleanlab_studio import Studio
studio = Studio("<CLEANLAB_API_KEY>")  # Get your API key from above
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#the TLM response contains the actual output 'response', trustworthiness score and explanation
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")

登入後複製

上面的代碼測試了GPT-4O對“ >可以看出，最先進的語言模型如何用於此類簡單任務並產生錯誤的輸出。在這裡，對於的響應和可信度得分。

Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
5.

登入後複製

> claude-3.5-sonnet-v2

產生正確的輸出。讓我們將兩個模型的回答與另一個問題進行比較。

Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.

登入後複製

這是兩個模型的響應：>

>我們還可以為開源LLMS生成可信賴的評分。讓我們檢查一下最近大肆宣傳的開源LLM：DeepSeek-R1。我將基於Meta's

from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# Initialize the Cleanlab Studio with API key
studio = Studio("<CLEANLAB_API_KEY>")  # Replace with your actual API key

# List of models to evaluate
models = ["gpt-4o", "claude-3.5-sonnet-v2"]

# Define the prompt
prompt_text = "Which one of 9.11 and 9.9 is bigger?"

# Loop through each model and evaluate
for model in models:
   tlm = studio.TLM(options={"log": ["explanation"], "model": model})
   out = tlm.prompt(prompt_text)
  
   md_content = f"""
## Model: {model}

**Response:** {out['response']}

**Trustworthiness Score:** {out['trustworthiness_score']}

**Explanation:** {out['log']['explanation']}

---
"""
   display(Markdown(md_content))

登入後複製

llama-3.3–70b-Instruct Model > DeepSeek-r1-Distill-lalama-70b> ）模型。知識蒸餾是一種機器學習技術，旨在將大型預培訓模型的學習“教師模型”轉移到較小的“學生模型”中。如何衡量大語模型的響應的可靠性

這是> deepSeek-r1-distill-lalama-70b型號的輸出

開發一個值得信賴的抹布

import streamlit as st
from langchain_groq.chat_models import ChatGroq
import os
os.environ["GROQ_API_KEY"]=st.secrets["GROQ_API_KEY"]
# Initialize the Groq Llama Instant model
groq_llm = ChatGroq(model="deepseek-r1-distill-llama-70b", temperature=0.5)
prompt = "Which one of 9.11 and 9.9 is bigger?"
# Get the response from the model
response = groq_llm.invoke(prompt)
#Initialize Cleanlab's studio
studio = Studio("226eeab91e944b23bd817a46dbe3c8ae") 
cleanlab_tlm = studio.TLM(options={"log": ["explanation"]})  #for explanations
#Get the output containing trustworthiness score and explanation
output = cleanlab_tlm.get_trustworthiness_score(prompt, response=response.content.strip())
md_content = f"""
## Model: {model}
**Response:** {response.content.strip()}
**Trustworthiness Score:** {output['trustworthiness_score']}
**Explanation:** {output['log']['explanation']}
---
"""
display(Markdown(md_content))

登入後複製

>我們現在將開發一個抹布，以演示如何衡量抹布中LLM響應的可信度。將通過從給定鏈接中刮擦數據，以降價格式解析並創建向量存儲。 需要為下一個代碼安裝以下庫。 >

>要將html渲染為PDF格式，我們還需要從其網站上安裝命令行工具。 將導入以下庫：>

下一步將涉及使用python's

beautifuresoup

庫從給定URL刮除數據，使用

> pdfkit

pip install llama-parse llama-index-core llama-index-embeddings-huggingface 
llama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio

登入後複製

保存刪除的數據，然後從pdf中解析數據（ s）使用

> llamaparse降低文件，這是一個使用LLMS構建的Genai-native文檔解析平台，用於LLM用例。

>我們將首先配置Cleanlabtlm和嵌入模型（ huggingface 嵌入模型baai/bge-small-en-v1.5 ）的LLM要計算刮擦數據的嵌入以創建向量存儲。 >

pip install --upgrade cleanlab-studio

登入後複製

>現在，我們將定義一個自定義事件處理程序，，它來自基本事件處理程序類。該處理程序由LLM完成的結束觸發，並從響應元數據中提取可信賴性分數。輔助功能，display_response，顯示LLM的響應以及其可信賴性分數。 >現在，我們將通過從給定的URL中刮除數據來生成PDF。為了進行演示，我們將僅從這篇Wikipedia文章中報導有關大語言模型的文章（

from cleanlab_studio import Studio
studio = Studio("<CLEANLAB_API_KEY>")  # Get your API key from above
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#the TLM response contains the actual output 'response', trustworthiness score and explanation
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")

登入後複製

Creative Commons Attribution-ShareAlike 4.0許可證

）。 > >

note

：建議讀者始終仔細檢查他們將要刮擦的內容/數據的狀態，並確保允許他們這樣做。 以下代碼通過提出HTTP請求並使用> BeautifulSoup python庫來解析HTML內容，從而從給定的URL中刪除了數據。通過將協議相關URL轉換為絕對值的HTML含量可以清除。隨後，使用

pdfkit。 從刮擦數據中生成pdf（s）後，我們使用> llamaparse

來解析這些PDF。我們設置了解析指令，以以降價格式提取內容，並在文檔名稱和頁碼上分析文檔。這些提取的實體（頁面）稱為Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a. Trustworthiness score = 0.6842228802750124 Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either): 5. 。解析器在提取的節點上迭代，並通過附加引用標頭來更新每個節點的元數據，該引用標頭促進後來引用。

>現在，我們創建一個矢量商店和一個查詢引擎。我們定義了一個客戶提示模板，以指導LLM回答問題的行為。最後，我們使用創建索引創建一個查詢引擎來回答查詢。對於每個查詢，我們根據查詢的語義相似性從矢量存儲中檢索了前3個節點。 LLM使用這些檢索的節點生成最終答案。 現在，讓我們測試一些查詢及其相應的可信度分數的抹布。

Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.

登入後複製

from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# Initialize the Cleanlab Studio with API key
studio = Studio("<CLEANLAB_API_KEY>")  # Replace with your actual API key

# List of models to evaluate
models = ["gpt-4o", "claude-3.5-sonnet-v2"]

# Define the prompt
prompt_text = "Which one of 9.11 and 9.9 is bigger?"

# Loop through each model and evaluate
for model in models:
   tlm = studio.TLM(options={"log": ["explanation"], "model": model})
   out = tlm.prompt(prompt_text)
  
   md_content = f"""
## Model: {model}

**Response:** {out['response']}

**Trustworthiness Score:** {out['trustworthiness_score']}

**Explanation:** {out['log']['explanation']}

---
"""
   display(Markdown(md_content))

登入後複製

>為LLM的響應分配一個可信度分數，無論是通過直接推理還是通過抹布生成，都有助於定義AI輸出的可靠性，並在需要的情況下確定人類驗證的優先級。對於錯誤或不可靠的響應可能會產生嚴重後果的關鍵領域，這一點尤其重要。

這就是所有人！如果您喜歡這篇文章，請在>中關注我，和> linkedIn 。

以上是如何衡量大語模型的響應的可靠性的詳細內容。更多資訊請關注PHP中文網其他相關文章！