Counting the number of Tokens sent to a LLM in Go (part 1)-Golang-php.cn

Home

Backend Development

Golang

Counting the number of Tokens sent to a LLM in Go (part 1)

Patricia Arquette

Jan 02, 2025 pm 02:18 PM

Counting the number of Tokens sent to a LLM in Go (part 1)

Introduction

A few weeks ago, I was having a discussion with a CFO from a business partner company, regarding the implementation of watsonx.ai capacities inside their own solution. During the discussion about the costs I pronounced the word “token” and all of a sudden there was panic ?

After explaining what tokens are, there came the question; “How do I count the tokens we send and receive? How much does it cost us?”

The answer was quite easy. We went to watsonx.ai studio prompt lab, went back and forth with some simple prompts and there we saw the number of tokens. I also showed the person some very nice websites where we can find how many tokens we send to a LLM by using simple inputs.

Later on I said to myself, why don’t I make my own token counter application (and my intention is write it in Go language as there’s a long time I didn’t use Golang!). Well I figured it’s a bit more complicated than that ?

First attempt — Using Regex

My first thought was using Regex, I could obtain more or less some acceptable results.

I set up the following Go app.

package main

import (
    "bufio"
    "fmt"
    "log"
    "os"
    "regexp"
    "strings"

    "github.com/sqweek/dialog"
)

// countTokens approximates the number of tokens in a text based on whitespace and punctuation.
func countTokens(text string) int {
    // A simple regex to split text into words and punctuation
    tokenizer := regexp.MustCompile(`\w+|[^\w\s]`)
    tokens := tokenizer.FindAllString(text, -1)
    return len(tokens)
}

func main() {

    // Open a file dialog box and let the user select a text file
    filePath, err := dialog.File().Filter("Text Files", "txt").Load()
    if err != nil {
        if err.Error() == "Cancelled" {
            fmt.Println("File selection was cancelled.")
            return
        }
        log.Fatalf("Error selecting file: %v", err)
    }

    // Output the selected file name
    fmt.Printf("Selected file: %s\n", filePath)

    // Specify the file to read
    //filePath := "input.txt"

    // Open the file
    file, err := os.Open(filePath)
    if err != nil {
        fmt.Printf("Error opening file: %v\n", err)
        return
    }
    defer file.Close()

    // Read the file line by line
    var content strings.Builder
    scanner := bufio.NewScanner(file)
    for scanner.Scan() {
        content.WriteString(scanner.Text())
        content.WriteString("\n")
    }

    if err := scanner.Err(); err != nil {
        fmt.Printf("Error reading file: %v\n", err)
        return
    }

    // Get the text content
    text := content.String()

    // Count the tokens
    tokenCount := countTokens(text)

    // Output the result
    fmt.Printf("The file contains approximately %d tokens.\n", tokenCount)
}

Copy after login

You’ll figure out that I’m a fan of GUI and dialog boxes, so I implemented a dialog box to select the input text file.

And here is the text file (some random text I found ?).

The popularity of the Rust language continues to explode; yet, many critical codebases remain authored in C, and cannot be realistically rewritten by hand. Automatically translating C to Rust is thus an appealing course of action. Several works have gone down this path, handling an ever-increasing subset of C through a variety of Rust features, such as unsafe. While the prospect of automation is appealing, producing code that relies on unsafe negates the memory safety guarantees offered by Rust, and therefore the main advantages of porting existing codebases to memory-safe languages.
We instead explore a different path, and explore what it would take to translate C to safe Rust; that is, to produce code that is trivially memory safe, because it abides by Rust's type system without caveats. Our work sports several original contributions: a type-directed translation from (a subset of) C to safe Rust; a novel static analysis based on "split trees" that allows expressing C's pointer arithmetic using Rust's slices and splitting operations; an analysis that infers exactly which borrows need to be mutable; and a compilation strategy for C's struct types that is compatible with Rust's distinction between non-owned and owned allocations.
We apply our methodology to existing formally verified C codebases: the HACL* cryptographic library, and binary parsers and serializers from EverParse, and show that the subset of C we support is sufficient to translate both applications to safe Rust. Our evaluation shows that for the few places that do violate Rust's aliasing discipline, automated, surgical rewrites suffice; and that the few strategic copies we insert have a negligible performance impact. Of particular note, the application of our approach to HACL* results in a 80,000 line verified cryptographic library, written in pure Rust, that implements all modern algorithms - the first of its kind.

Copy after login

After running my code I get the following output;

The file contains approximately 359 tokens.

Copy after login

It seems fine, but, well… okay, but… against which model ?? And also there are different ways to implement Regex, so this one does not count at all ?!

Second attempt — running against a specific model

What I figured out was that unless we don’t use the specific “tokenizer” for a given LLM, the former method is not accurate. So I began to look at how to obtain some accurate results against a model such as gpt 3.5 which is on the market for a while now. After doing some research on the net, hereafter the app I came up with.

package main

import (
 "bufio"
 "bytes"
 "fmt"
 "log"
 "os"
 "os/exec"

 "github.com/joho/godotenv"
 "github.com/sqweek/dialog"
)

func main() {


 // Open a file dialog box and let the user select a text file
 filePath, err := dialog.File().Filter("Text Files", "txt").Load()
 if err != nil {
  if err.Error() == "Cancelled" {
   fmt.Println("File selection was cancelled.")
   return
  }
  log.Fatalf("Error selecting file: %v", err)
 }

 // Output the selected file name
 fmt.Printf("Selected file: %s\n", filePath)

 // Open the file
 file, err := os.Open(filePath)
 if err != nil {
  fmt.Printf("Error opening file: %v\n", err)
  return
 }
 defer file.Close()

 // Read the file content
 var content bytes.Buffer
 scanner := bufio.NewScanner(file)
 for scanner.Scan() {
  content.WriteString(scanner.Text())
  content.WriteString("\n")
 }

 if err := scanner.Err(); err != nil {
  fmt.Printf("Error reading file: %v\n", err)
  return
 }

 // Specify the model
 model := "gpt-3.5-turbo"

 // Execute the Python script
 cmd := exec.Command("python3", "tokenizer.py", model)
 cmd.Stdin = bytes.NewReader(content.Bytes())
 output, err := cmd.Output()
 if err != nil {
  fmt.Printf("Error running tokenizer script: %v\n", err)
  return
 }

 // Print the token count
 fmt.Printf("Token count: %s", output)
}

Copy after login

As we can see in the code above, there is a call to a Python app which I found on a Microsoft site which helps (because it has been implemented) a “tiktoken” library to determine the number of tokens for gpt! Also the model name is hard coded.

import sys
from tiktoken import encoding_for_model

def count_tokens(model, text):
    enc = encoding_for_model(model)
    tokens = enc.encode(text)
    return len(tokens)

if __name__ == "__main__":
    # Read model name and text from stdin
    model = sys.argv[1]  # E.g., "gpt-3.5-turbo"
    text = sys.stdin.read()
    print(count_tokens(model, text))

Copy after login

This works fine. For the same text given earlier, now I obtain the count of 366 tokens which is accurate, regarding all the websites I found and on which I set the model to GPT 3.5.

The thing is I want to write is, a code fully in “Golang”… and I want to be able to run it for all models (or almost all) I can find on Huggingface (such as ibm-granite/granite-3.1–8b-instruct) ?

This would be the part 2 of this article (WIP).

So far I’m exploring the following (great ?) Github repos;

Tokenizer: https://github.com/sugarme/tokenizer
tokenizers: https://github.com/daulet/tokenizers
And last but not least -> go-huggingface: https://github.com/gomlx/go-huggingface?tab=readme-ov-file

Conclusion

Thanks for reading and open to comments.

And till the 2nd app is out, stay tuned… ?

The above is the detailed content of Counting the number of Tokens sent to a LLM in Go (part 1). For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Will R.E.P.O. Have Crossplay?

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7555

CakePHP Tutorial

1384

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

What are the vulnerabilities of Debian OpenSSL Apr 02, 2025 am 07:30 AM

OpenSSL, as an open source library widely used in secure communications, provides encryption algorithms, keys and certificate management functions. However, there are some known security vulnerabilities in its historical version, some of which are extremely harmful. This article will focus on common vulnerabilities and response measures for OpenSSL in Debian systems. DebianOpenSSL known vulnerabilities: OpenSSL has experienced several serious vulnerabilities, such as: Heart Bleeding Vulnerability (CVE-2014-0160): This vulnerability affects OpenSSL 1.0.1 to 1.0.1f and 1.0.2 to 1.0.2 beta versions. An attacker can use this vulnerability to unauthorized read sensitive information on the server, including encryption keys, etc.

How do you use the pprof tool to analyze Go performance? Mar 21, 2025 pm 06:37 PM

The article explains how to use the pprof tool for analyzing Go performance, including enabling profiling, collecting data, and identifying common bottlenecks like CPU and memory issues.Character count: 159

How do you write unit tests in Go? Mar 21, 2025 pm 06:34 PM

The article discusses writing unit tests in Go, covering best practices, mocking techniques, and tools for efficient test management.

What libraries are used for floating point number operations in Go? Apr 02, 2025 pm 02:06 PM

The library used for floating-point number operation in Go language introduces how to ensure the accuracy is...

What is the problem with Queue thread in Go's crawler Colly? Apr 02, 2025 pm 02:09 PM

Queue threading problem in Go crawler Colly explores the problem of using the Colly crawler library in Go language, developers often encounter problems with threads and request queues. �...

What is the go fmt command and why is it important? Mar 20, 2025 pm 04:21 PM

The article discusses the go fmt command in Go programming, which formats code to adhere to official style guidelines. It highlights the importance of go fmt for maintaining code consistency, readability, and reducing style debates. Best practices fo

PostgreSQL monitoring method under Debian Apr 02, 2025 am 07:27 AM

This article introduces a variety of methods and tools to monitor PostgreSQL databases under the Debian system, helping you to fully grasp database performance monitoring. 1. Use PostgreSQL to build-in monitoring view PostgreSQL itself provides multiple views for monitoring database activities: pg_stat_activity: displays database activities in real time, including connections, queries, transactions and other information. pg_stat_replication: Monitors replication status, especially suitable for stream replication clusters. pg_stat_database: Provides database statistics, such as database size, transaction commit/rollback times and other key indicators. 2. Use log analysis tool pgBadg

Transforming from front-end to back-end development, is it more promising to learn Java or Golang? Apr 02, 2025 am 09:12 AM

Backend learning path: The exploration journey from front-end to back-end As a back-end beginner who transforms from front-end development, you already have the foundation of nodejs,...

See all articles