Home > Technology peripherals > AI > How to Measure the Reliability of a Large Language Model's Response

How to Measure the Reliability of a Large Language Model's Response

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB
Release: 2025-02-25 22:50:13
Original
785 people have browsed it

The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it can do a number of amazing tasks such as text summarization, idea generation, brainstorming, code generation, information processing, and content creation. That said, LLMs do not have any memory no do they actually “understand” anything, other than sticking to their basic function: predicting the next word.

The process of next-word prediction is probabilistic. The LLM has to select each word from a probability distribution. In the process, they often generate false, fabricated, or inconsistent content in an attempt to produce coherent responses and fill in gaps with plausible-looking but incorrect information. This phenomenon is called hallucination, an inevitable, well-known feature of LLMs that warrants validation and corroboration of their outputs.

Retrieval augment generation (RAG) methods, which make an LLM work with external knowledge sources, do minimize hallucinations to some extent, but they cannot completely eradicate them. Although advanced RAGs can provide in-text citations and URLs, verifying these references could be hectic and time-consuming. Therefore, we need an objective criterion for assessing the reliability or trustworthiness of an LLM’s response, whether it is generated from its own knowledge or an external knowledge base (RAG).

In this article, we will discuss how the output of an LLM can be assessed for trustworthiness by a trustworthy language model which assigns a score to the LLM’s output. We will first discuss how we can use a trustworthy language model to assign scores to an LLM’s answer and explain trustworthiness. Subsequently, we will develop an example RAG with LlamaParse and Llamaindex that assesses the RAG’s answers for trustworthiness.

The entire code of this article is available in the jupyter notebook on GitHub.

Assigning a Trustworthiness Score to an LLM’s Answer

To demonstrate how we can assign a trustworthiness score to an Llm’s response, I will use Cleanlab’s Trustworthy Language Model (TLM). Such TLMs use a combination of uncertainty quantification and consistency analysis to compute trustworthiness scores and explanations for LLM responses.

Cleanlab offers free trial APIs which can be obtained by creating an account at their website. We first need to install Cleanlab’s Python client:

pip install --upgrade cleanlab-studio
Copy after login
Copy after login

Cleanlab supports several proprietary models such as ‘gpt-4o’, ‘gpt-4o-mini’, ‘o1-preview’, ‘claude-3-sonnet’, ‘claude-3.5-sonnet’, ‘claude-3.5-sonnet-v2’ and others. Here is how TLM assigns a trustworhiness score to gpt-4o’s answer. The trustworthiness score ranges from 0 to 1, where higher values indicate greater trustworthiness.

from cleanlab_studio import Studio
studio = Studio("<CLEANLAB_API_KEY>")  # Get your API key from above
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#the TLM response contains the actual output 'response', trustworthiness score and explanation
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")
Copy after login
Copy after login

The above code tested the response of gpt-4o for the question “How many vowels are there in the word ‘Abracadabra’.?”. The TLM’s output contains the model’s answer (response), trustworthiness score, and explanation. Here is the output of this code.

Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
5.
Copy after login
Copy after login

It can be seen how the most advanced language model hallucinates for such simple tasks and produces the wrong output.Here is the response and trustworthiness score for the same question for claude-3.5-sonnet-v2.

Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.
Copy after login
Copy after login

claude-3.5-sonnet-v2 produces the correct output. Let’s compare the two models’ responses to another question.

from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# Initialize the Cleanlab Studio with API key
studio = Studio("<CLEANLAB_API_KEY>")  # Replace with your actual API key

# List of models to evaluate
models = ["gpt-4o", "claude-3.5-sonnet-v2"]

# Define the prompt
prompt_text = "Which one of 9.11 and 9.9 is bigger?"

# Loop through each model and evaluate
for model in models:
   tlm = studio.TLM(options={"log": ["explanation"], "model": model})
   out = tlm.prompt(prompt_text)
  
   md_content = f"""
## Model: {model}

**Response:** {out['response']}

**Trustworthiness Score:** {out['trustworthiness_score']}

**Explanation:** {out['log']['explanation']}

---
"""
   display(Markdown(md_content))
Copy after login
Copy after login

Here is the response of the two models:

How to Measure the Reliability of a Large Language Model's Response

We can also generate a trustworthiness score for open-source LLMs. Let’s check the recent, much-hyped open-source LLM: deepseek-R1. I will use DeepSeek-R1-Distill-Llama-70B, based on Meta’s Llama-3.3–70B-Instruct model and distilled from DeepSeek’s larger 671-billion parameter Mixture of Experts (MoE) model. Knowledge distillation is a Machine Learning technique that aims to transfer the learnings of a large pre-trained model, the “teacher model,” to a smaller “student model.”

import streamlit as st
from langchain_groq.chat_models import ChatGroq
import os
os.environ["GROQ_API_KEY"]=st.secrets["GROQ_API_KEY"]
# Initialize the Groq Llama Instant model
groq_llm = ChatGroq(model="deepseek-r1-distill-llama-70b", temperature=0.5)
prompt = "Which one of 9.11 and 9.9 is bigger?"
# Get the response from the model
response = groq_llm.invoke(prompt)
#Initialize Cleanlab's studio
studio = Studio("226eeab91e944b23bd817a46dbe3c8ae") 
cleanlab_tlm = studio.TLM(options={"log": ["explanation"]})  #for explanations
#Get the output containing trustworthiness score and explanation
output = cleanlab_tlm.get_trustworthiness_score(prompt, response=response.content.strip())
md_content = f"""
## Model: {model}
**Response:** {response.content.strip()}
**Trustworthiness Score:** {output['trustworthiness_score']}
**Explanation:** {output['log']['explanation']}
---
"""
display(Markdown(md_content))
Copy after login
Copy after login

Here is the output of deepseek-r1-distill-llama-70b model.

How to Measure the Reliability of a Large Language Model's Response

Developing a Trustworthy RAG

We will now develop an RAG to demonstrate how we can measure the trustworthiness of an LLM response in RAG. This RAG will be developed by scraping data from given links, parsing it in markdown format, and creating a vector store.

The following libraries need to be installed for the next code.

pip install llama-parse llama-index-core llama-index-embeddings-huggingface 
llama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio
Copy after login
Copy after login

To render HTML into PDF format, we also need to install wkhtmltopdf command line tool from their website.

The following libraries will be imported:

from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
import requests
from bs4 import BeautifulSoup
import pdfkit
from llama_index.readers.docling import DoclingReader
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.cleanlab import CleanlabTLM
from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent
import nest_asyncio
import os
Copy after login

The next steps will involve scraping data from given URLs using Python’s BeautifulSoup library, saving the scraped data in PDF file(s) using pdfkit, and parsing the data from PDF(s) to markdown file using LlamaParse which is a genAI-native document parsing platform built with LLMs and for LLM use cases.

We will first configure the LLM to be used by CleanlabTLM and the embedding model (Huggingface embedding model BAAI/bge-small-en-v1.5) that will be used to compute the embeddings of the scraped data to create the vector store.

pip install --upgrade cleanlab-studio
Copy after login
Copy after login

We will now define a custom event handler, GetTrustworthinessScore, that is derived from a base event handler class. This handler gets triggered by the end of an LLM completion and extracts a trustworthiness score from the response metadata. A helper function, display_response, displays the LLM’s response along with its trustworthiness score.

from cleanlab_studio import Studio
studio = Studio("<CLEANLAB_API_KEY>")  # Get your API key from above
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#the TLM response contains the actual output 'response', trustworthiness score and explanation
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")
Copy after login
Copy after login

We will now generate PDFs by scraping data from given URLs. For demonstration, we will scrap data only from this Wikipedia article about large language models (Creative Commons Attribution-ShareAlike 4.0 License).

Note: Readers are advised to always double-check the status of the content/data they are about to scrape and ensure they are allowed to do so.

The following piece of code scrapes data from the given URLs by making an HTTP request and using BeautifulSoup Python library to parse the HTML content. HTML content is cleaned by converting protocol-relative URLs to absolute ones. Subsequently, the scraped content is converted into a PDF file(s) using pdfkit.

Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
5.
Copy after login
Copy after login

After generating PDF(s) from the scraped data, we parse these PDFs using LlamaParse. We set the parsing instructions to extract the content in markdown format and parse the document(s) page-wise along with the document name and page number. These extracted entities (pages) are referred to as nodes. The parser iterates over the extracted nodes and updates each node’s metadata by appending a citation header which facilitates later referencing.

Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.
Copy after login
Copy after login

We now create a vector store and a query engine. We define a customer prompt template to guide the LLM’s behavior in answering the questions. Finally, we create a query engine with the created index to answer queries. For each query, we retrieve the top 3 nodes from the vector store based on their semantic similarity with the query. The LLM uses these retrieved nodes to generate the final answer.

from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# Initialize the Cleanlab Studio with API key
studio = Studio("<CLEANLAB_API_KEY>")  # Replace with your actual API key

# List of models to evaluate
models = ["gpt-4o", "claude-3.5-sonnet-v2"]

# Define the prompt
prompt_text = "Which one of 9.11 and 9.9 is bigger?"

# Loop through each model and evaluate
for model in models:
   tlm = studio.TLM(options={"log": ["explanation"], "model": model})
   out = tlm.prompt(prompt_text)
  
   md_content = f"""
## Model: {model}

**Response:** {out['response']}

**Trustworthiness Score:** {out['trustworthiness_score']}

**Explanation:** {out['log']['explanation']}

---
"""
   display(Markdown(md_content))
Copy after login
Copy after login

Now let’s test the RAG for some queries and their corresponding trustworthiness scores.

import streamlit as st
from langchain_groq.chat_models import ChatGroq
import os
os.environ["GROQ_API_KEY"]=st.secrets["GROQ_API_KEY"]
# Initialize the Groq Llama Instant model
groq_llm = ChatGroq(model="deepseek-r1-distill-llama-70b", temperature=0.5)
prompt = "Which one of 9.11 and 9.9 is bigger?"
# Get the response from the model
response = groq_llm.invoke(prompt)
#Initialize Cleanlab's studio
studio = Studio("226eeab91e944b23bd817a46dbe3c8ae") 
cleanlab_tlm = studio.TLM(options={"log": ["explanation"]})  #for explanations
#Get the output containing trustworthiness score and explanation
output = cleanlab_tlm.get_trustworthiness_score(prompt, response=response.content.strip())
md_content = f"""
## Model: {model}
**Response:** {response.content.strip()}
**Trustworthiness Score:** {output['trustworthiness_score']}
**Explanation:** {output['log']['explanation']}
---
"""
display(Markdown(md_content))
Copy after login
Copy after login
How to Measure the Reliability of a Large Language Model's Response
pip install llama-parse llama-index-core llama-index-embeddings-huggingface 
llama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio
Copy after login
Copy after login
How to Measure the Reliability of a Large Language Model's Response

Assigning a trustworthiness score to LLM’s response, whether generated through direct inference or RAG, helps to define the reliability of AI’s output and prioritize human verification where needed. This is particularly important for critical domains where a wrong or unreliable response could have severe consequences.

That’s all folks! If you like the article, please follow me on Medium and LinkedIn.

The above is the detailed content of How to Measure the Reliability of a Large Language Model's Response. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template