Mengira bilangan Token yang dihantar ke LLM dalam Go (bahagian 1)-Golang-php.cn

Counting the number of Tokens sent to a LLM in Go (part 1)

pengenalan

Beberapa minggu lalu, saya sedang mengadakan perbincangan dengan seorang CFO daripada syarikat rakan kongsi perniagaan, mengenai pelaksanaan kapasiti watsonx.ai dalam penyelesaian mereka sendiri. Semasa perbincangan tentang kos saya menyebut perkataan "token" dan tiba-tiba berlaku panik ?

Selepas menerangkan apa itu token, timbul persoalan; “Bagaimanakah saya mengira token yang kami hantar dan terima? Berapa kos kami?”

Jawapannya agak mudah. Kami pergi ke makmal gesaan studio watsonx.ai, berulang-alik dengan beberapa gesaan mudah dan di sana kami melihat bilangan token. Saya juga menunjukkan kepada orang itu beberapa tapak web yang sangat bagus di mana kita boleh mengetahui bilangan token yang kita hantar ke LLM dengan menggunakan input mudah.

Kemudian saya berkata kepada diri sendiri, mengapa saya tidak membuat permohonan kaunter token saya sendiri (dan niat saya menulisnya dalam bahasa Go kerana sudah lama saya tidak menggunakan Golang!). Nah, saya fikir ia lebih rumit daripada itu ?

Percubaan pertama — Menggunakan Regex

Fikiran pertama saya ialah menggunakan Regex, saya boleh memperoleh lebih kurang beberapa hasil yang boleh diterima.

Saya menyediakan apl Go berikut.

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
    "regexp"
    "strings"

    "github.com/sqweek/dialog"
)

// countTokens approximates the number of tokens in a text based on whitespace and punctuation.
func countTokens(text string) int {
    // A simple regex to split text into words and punctuation
    tokenizer := regexp.MustCompile(`\w+|[^\w\s]`)
    tokens := tokenizer.FindAllString(text, -1)
    return len(tokens)
}

func main() {

    // Open a file dialog box and let the user select a text file
    filePath, err := dialog.File().Filter("Text Files", "txt").Load()
    if err != nil {
        if err.Error() == "Cancelled" {
            fmt.Println("File selection was cancelled.")
            return
        }
        log.Fatalf("Error selecting file: %v", err)
    }

    // Output the selected file name
    fmt.Printf("Selected file: %s\n", filePath)

    // Specify the file to read
    //filePath := "input.txt"

    // Open the file
    file, err := os.Open(filePath)
    if err != nil {
        fmt.Printf("Error opening file: %v\n", err)
        return
    }
    defer file.Close()

    // Read the file line by line
    var content strings.Builder
    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        content.WriteString(scanner.Text())
        content.WriteString("\n")
    }

    if err := scanner.Err(); err != nil {
        fmt.Printf("Error reading file: %v\n", err)
        return
    }

    // Get the text content
    text := content.String()

    // Count the tokens
    tokenCount := countTokens(text)

    // Output the result
    fmt.Printf("The file contains approximately %d tokens.\n", tokenCount)
}

Salin selepas log masuk

Anda akan mengetahui bahawa saya peminat GUI dan kotak dialog, jadi saya melaksanakan kotak dialog untuk memilih fail teks input.

Dan inilah fail teks (beberapa teks rawak yang saya temui ?).

The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C, and cannot be realistically rewritten by hand. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages.
We instead explore a different path, and explore what it would take to translate C to safe Rust; that is, to produce code that is trivially memory safe, because it abides by Rust's type system without caveats. Our work sports several original contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on "split trees" that allows expressing C's pointer arithmetic using Rust's slices and splitting operations; an analysis that infers exactly which borrows need to be mutable; and a compilation strategy for C's struct types that is compatible with Rust's distinction between non-owned and owned allocations.
We apply our methodology to existing formally verified C codebases: the HACL* cryptographic library, and binary parsers and serializers from EverParse, and show that the subset of C we support is sufficient to translate both applications to safe Rust. Our evaluation shows that for the few places that do violate Rust's aliasing discipline, automated, surgical rewrites suffice; and that the few strategic copies we insert have a negligible performance impact. Of particular note, the application of our approach to HACL* results in a 80,000 line verified cryptographic library, written in pure Rust, that implements all modern algorithms - the first of its kind.

Salin selepas log masuk

Selepas menjalankan kod saya, saya mendapat output berikut;

The file contains approximately 359 tokens.

Salin selepas log masuk

Nampaknya baik, tetapi, baik… okey, tetapi… terhadap model yang mana ?? Dan juga terdapat cara yang berbeza untuk melaksanakan Regex, jadi yang ini tidak dikira sama sekali ?!

Percubaan kedua — berjalan melawan model tertentu

Apa yang saya dapati ialah melainkan kita tidak menggunakan "tokenizer" khusus untuk LLM tertentu, kaedah terdahulu adalah tidak tepat. Jadi saya mula melihat bagaimana untuk mendapatkan beberapa keputusan yang tepat terhadap model seperti gpt 3.5 yang berada di pasaran buat sementara waktu sekarang. Selepas membuat beberapa kajian di internet, selepas ini aplikasi yang saya hasilkan.

package main

import (
 "bufio"
 "bytes"
 "fmt"
 "log"
 "os"
 "os/exec"

 "github.com/joho/godotenv"
 "github.com/sqweek/dialog"
)

func main() {


 // Open a file dialog box and let the user select a text file
 filePath, err := dialog.File().Filter("Text Files", "txt").Load()
 if err != nil {
  if err.Error() == "Cancelled" {
   fmt.Println("File selection was cancelled.")
   return
  }
  log.Fatalf("Error selecting file: %v", err)
 }

 // Output the selected file name
 fmt.Printf("Selected file: %s\n", filePath)

 // Open the file
 file, err := os.Open(filePath)
 if err != nil {
  fmt.Printf("Error opening file: %v\n", err)
  return
 }
 defer file.Close()

 // Read the file content
 var content bytes.Buffer
 scanner := bufio.NewScanner(file)
 for scanner.Scan() {
  content.WriteString(scanner.Text())
  content.WriteString("\n")
 }

 if err := scanner.Err(); err != nil {
  fmt.Printf("Error reading file: %v\n", err)
  return
 }

 // Specify the model
 model := "gpt-3.5-turbo"

 // Execute the Python script
 cmd := exec.Command("python3", "tokenizer.py", model)
 cmd.Stdin = bytes.NewReader(content.Bytes())
 output, err := cmd.Output()
 if err != nil {
  fmt.Printf("Error running tokenizer script: %v\n", err)
  return
 }

 // Print the token count
 fmt.Printf("Token count: %s", output)
}

Salin selepas log masuk

Seperti yang dapat kita lihat dalam kod di atas, terdapat panggilan ke aplikasi Python yang saya temui di tapak Microsoft yang membantu (kerana ia telah dilaksanakan) "tiktoken” perpustakaan untuk menentukan bilangan token bagi gpt! Nama model juga dikodkan keras.

import sys
from tiktoken import encoding_for_model

def count_tokens(model, text):
    enc = encoding_for_model(model)
    tokens = enc.encode(text)
    return len(tokens)

if __name__ == "__main__":
    # Read model name and text from stdin
    model = sys.argv[1]  # E.g., "gpt-3.5-turbo"
    text = sys.stdin.read()
    print(count_tokens(model, text))

Salin selepas log masuk

Ini berfungsi dengan baik. Untuk teks yang sama yang diberikan sebelum ini, kini saya memperoleh kiraan 366 token yang tepat, mengenai semua tapak web yang saya temui dan yang saya tetapkan modelnya kepada GPT 3.5.

Perkara yang saya ingin tulis ialah, kod sepenuhnya dalam “Golang”… dan saya mahu dapat menjalankannya untuk semua model (atau hampir semua) yang boleh saya temui di Huggingface (seperti sebagai ibm-granit/granit-3.1–8b-instruct) ?

Ini akan menjadi bahagian 2 artikel ini (WIP).

Setakat ini saya meneroka perkara berikut (hebat?) Github repo;

Tokenizer: https://github.com/sugarme/tokenizer
tokenizer: https://github.com/daulet/tokenizers
Dan akhir sekali -> go-huggingface: https://github.com/gomlx/go-huggingface?tab=readme-ov-file

Kesimpulan

Terima kasih kerana membaca dan terbuka kepada komen.

Dan sehingga apl ke-2 keluar, nantikan… ?

Atas ialah kandungan terperinci Mengira bilangan Token yang dihantar ke LLM dalam Go (bahagian 1). Untuk maklumat lanjut, sila ikut artikel berkaitan lain di laman web China PHP!