首页 > 后端开发 > Golang > 计算 Go 中发送给 LLM 的 Token 数量(第 1 部分)

计算 Go 中发送给 LLM 的 Token 数量(第 1 部分)

Patricia Arquette
发布: 2025-01-02 14:18:39
原创
538 人浏览过

Counting the number of Tokens sent to a LLM in Go (part 1)

介绍

几周前,我与一家业务合作伙伴公司的首席财务官讨论了在他们自己的解决方案中实施 watsonx.ai 功能的问题。在讨论成本的过程中,我说出了“代币”这个词,突然出现了恐慌?

解释完什么是代币,问题就来了; “我如何计算我们发送和接收的代币?我们要花多少钱?”

答案很简单。我们去了 watsonx.ai studio 提示实验室,反复询问一些简单的提示,然后我们看到了代币的数量。我还向该人展示了一些非常好的网站,我们可以通过简单的输入找到我们发送给 LLM 的代币数量。

后来我对自己说,为什么我不做一个自己的令牌计数器应用程序(我的目的是用Go语言编写它,因为我已经很长时间没有使用Golang了!)。嗯,我认为事情比这更复杂一点?

第一次尝试——使用正则表达式

我的第一个想法是使用正则表达式,我或多或少可以获得一些可以接受的结果。

我设置了以下 Go 应用。

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
    "regexp"
    "strings"

    "github.com/sqweek/dialog"
)

// countTokens approximates the number of tokens in a text based on whitespace and punctuation.
func countTokens(text string) int {
    // A simple regex to split text into words and punctuation
    tokenizer := regexp.MustCompile(`\w+|[^\w\s]`)
    tokens := tokenizer.FindAllString(text, -1)
    return len(tokens)
}

func main() {

    // Open a file dialog box and let the user select a text file
    filePath, err := dialog.File().Filter("Text Files", "txt").Load()
    if err != nil {
        if err.Error() == "Cancelled" {
            fmt.Println("File selection was cancelled.")
            return
        }
        log.Fatalf("Error selecting file: %v", err)
    }

    // Output the selected file name
    fmt.Printf("Selected file: %s\n", filePath)

    // Specify the file to read
    //filePath := "input.txt"

    // Open the file
    file, err := os.Open(filePath)
    if err != nil {
        fmt.Printf("Error opening file: %v\n", err)
        return
    }
    defer file.Close()

    // Read the file line by line
    var content strings.Builder
    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        content.WriteString(scanner.Text())
        content.WriteString("\n")
    }

    if err := scanner.Err(); err != nil {
        fmt.Printf("Error reading file: %v\n", err)
        return
    }

    // Get the text content
    text := content.String()

    // Count the tokens
    tokenCount := countTokens(text)

    // Output the result
    fmt.Printf("The file contains approximately %d tokens.\n", tokenCount)
}

登录后复制

你会发现我是 GUI 和对话框的粉丝,所以我实现了一个对话框来选择输入文本文件。

这是文本文件(我发现了一些随机文本?)。

The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C, and cannot be realistically rewritten by hand. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages.
We instead explore a different path, and explore what it would take to translate C to safe Rust; that is, to produce code that is trivially memory safe, because it abides by Rust's type system without caveats. Our work sports several original contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on "split trees" that allows expressing C's pointer arithmetic using Rust's slices and splitting operations; an analysis that infers exactly which borrows need to be mutable; and a compilation strategy for C's struct types that is compatible with Rust's distinction between non-owned and owned allocations.
We apply our methodology to existing formally verified C codebases: the HACL* cryptographic library, and binary parsers and serializers from EverParse, and show that the subset of C we support is sufficient to translate both applications to safe Rust. Our evaluation shows that for the few places that do violate Rust's aliasing discipline, automated, surgical rewrites suffice; and that the few strategic copies we insert have a negligible performance impact. Of particular note, the application of our approach to HACL* results in a 80,000 line verified cryptographic library, written in pure Rust, that implements all modern algorithms - the first of its kind.
登录后复制

运行代码后,我得到以下输出;

The file contains approximately 359 tokens.
登录后复制

看起来不错,但是,好吧……好吧,但是……针对哪个模型?而且还有不同的方法来实现正则表达式,所以这个根本不算数?!

第二次尝试——针对特定模型运行

我发现,除非我们不对给定的 LLM 使用特定的“标记器”,否则前一种方法是不准确的。因此,我开始研究如何针对市场上已经有一段时间的 gpt 3.5 等模型获得一些准确的结果。在网上做了一些研究后,我想出了这个应用程序。

package main

import (
 "bufio"
 "bytes"
 "fmt"
 "log"
 "os"
 "os/exec"

 "github.com/joho/godotenv"
 "github.com/sqweek/dialog"
)

func main() {


 // Open a file dialog box and let the user select a text file
 filePath, err := dialog.File().Filter("Text Files", "txt").Load()
 if err != nil {
  if err.Error() == "Cancelled" {
   fmt.Println("File selection was cancelled.")
   return
  }
  log.Fatalf("Error selecting file: %v", err)
 }

 // Output the selected file name
 fmt.Printf("Selected file: %s\n", filePath)

 // Open the file
 file, err := os.Open(filePath)
 if err != nil {
  fmt.Printf("Error opening file: %v\n", err)
  return
 }
 defer file.Close()

 // Read the file content
 var content bytes.Buffer
 scanner := bufio.NewScanner(file)
 for scanner.Scan() {
  content.WriteString(scanner.Text())
  content.WriteString("\n")
 }

 if err := scanner.Err(); err != nil {
  fmt.Printf("Error reading file: %v\n", err)
  return
 }

 // Specify the model
 model := "gpt-3.5-turbo"

 // Execute the Python script
 cmd := exec.Command("python3", "tokenizer.py", model)
 cmd.Stdin = bytes.NewReader(content.Bytes())
 output, err := cmd.Output()
 if err != nil {
  fmt.Printf("Error running tokenizer script: %v\n", err)
  return
 }

 // Print the token count
 fmt.Printf("Token count: %s", output)
}
登录后复制

正如我们在上面的代码中看到的,有一个对 Python 应用程序的调用,我在 Microsoft 网站上找到了该应用程序,该应用程序有助于(因为它已实现)“tiktoken”库来确定 gpt 的令牌数量!模型名称也是硬编码的。

import sys
from tiktoken import encoding_for_model

def count_tokens(model, text):
    enc = encoding_for_model(model)
    tokens = enc.encode(text)
    return len(tokens)

if __name__ == "__main__":
    # Read model name and text from stdin
    model = sys.argv[1]  # E.g., "gpt-3.5-turbo"
    text = sys.stdin.read()
    print(count_tokens(model, text))
登录后复制

这很好用。对于之前给出的相同文本,现在我获得了 366 个令牌的计数,该计数对于我找到的所有网站以及我将模型设置为 GPT 3.5.

的都是准确的。

我想写的是,完全用“Golang”编写的代码……我希望能够为我在 Huggingface 上找到的所有模型(或几乎所有模型)运行它(例如如 ibm-granite/granite-3.1–8b-instruct) ?

这将是本文的第二部分(WIP)。

到目前为止,我正在探索以下内容(很好?)Github 存储库;

  • 分词器:https://github.com/sugarme/tokenizer
  • 标记器:https://github.com/daulet/tokenizers
  • 最后但并非最不重要的 -> go-huggingface:https://github.com/gomlx/go-huggingface?tab=readme-ov-file

结论

感谢您的阅读并欢迎评论。

在第二个应用程序推出之前,请继续关注......?

以上是计算 Go 中发送给 LLM 的 Token 数量(第 1 部分)的详细内容。更多信息请关注PHP中文网其他相关文章!

来源:dev.to
本站声明
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn
作者最新文章
热门教程
更多>
最新下载
更多>
网站特效
网站源码
网站素材
前端模板