Table of Contents
Question content
Solution
Home Backend Development Golang Ignore lines containing pattern in long text file in Go

Ignore lines containing pattern in long text file in Go

Feb 13, 2024 pm 01:57 PM
go language

在 Go 中忽略长文本文件中包含模式的行

php Editor Apple In the Go language, we often need to process large text files. Sometimes we are only interested in rows containing a specific pattern and ignore other rows. Fortunately, in Go, we can use regular expressions and bufio.Scanner to achieve this goal. By using regular expressions to match lines and running the file through a Scanner line by line, we can easily filter out lines that we are not interested in. This tip not only improves efficiency, but also makes our code more concise and readable. Next, let’s take a look at how to ignore lines containing patterns in long text files in Go.

Question content

I'm trying to implement a function to ignore lines containing patterns in long text files (guaranteed ascii) in go

My functions in withoutignore and withignore both accept a filename parameter as input and return *byte.buffer, which can then be used to write io.writer.

withignore The function takes additional arguments pattern to exclude lines containing a pattern from the file. The function works, but through benchmarking it was found to be 5 times slower than without ignoring . Is there any way it can be improved?

package main

import (
    "bufio"
    "bytes"
    "io"
    "log"
    "os"
)

func withoutignore(f string) (*bytes.buffer, error) {
    rfd, err := os.open(f)
    if err != nil {
        log.fatal(err)
    }

    defer func() {
        if err := rfd.close(); err != nil {
            log.fatal(err)
        }
    }()

    inputbuffer := make([]byte, 1048576)
    var bytesread int

    var bs []byte
    opbuffer := bytes.newbuffer(bs)

    for {
        bytesread, err = rfd.read(inputbuffer)

        if err == io.eof {
            return opbuffer, nil
        }

        if err != nil {
            return nil, nil
        }

        _, err = opbuffer.write(inputbuffer[:bytesread])
        if err != nil {
            return nil, err
        }
    }
    return opbuffer, nil
}

func withignore(f, pattern string) (*bytes.buffer, error) {
    rfd, err := os.open(f)
    if err != nil {
        log.fatal(err)
    }

    defer func() {
        if err := rfd.close(); err != nil {
            log.fatal(err)
        }
    }()

    scanner := bufio.newscanner(rfd)
    var bs []byte
    buffer := bytes.newbuffer(bs)
    for scanner.scan() {
        if !bytes.contains(scanner.bytes(), []byte(pattern)) {
            _, err := buffer.writestring(scanner.text() + "\n")
            if err != nil {
                return nil, nil
            }
        }
    }

    return buffer, nil
}

func main() {
    // buff, err := withoutignore("base64dump.log")
    buff, err := withignore("base64dump.log", "audit")
    if err != nil {
        log.fatal(err)
    }

    _, err = buff.writeto(os.stdout)
    if err != nil {
        log.fatal(err)
    }
}
Copy after login

Benchmarks

package main

import "testing"

func benchmarktestwithoutignore(b *testing.b) {
    for i := 0; i < b.n; i++ {
        _, err := withoutignore("base64dump.log")
        if err != nil {
            b.fatal(err)
        }
    }
}

func benchmarktestwithignore(b *testing.b) {
    for i := 0; i < b.n; i++ {
        _, err := withignore("base64dump.log", "audit")
        if err != nil {
            b.fatal(err)
        }
    }
}
Copy after login

and can be generated from the command line using "base64dump.log"

base64 /dev/urandom | head -c 10000000 > base64dump.log
Copy after login

Solution

Since ascii is guaranteed, it can work directly at the byte level.

However, if you check each byte for a newline character while reading the input, and then search again for the pattern within the line, the operation will be applied to each byte.

On the other hand, if you read a block of input and perform an optimized search for patterns in the text, without even checking each input byte, you can minimize the number of operations per input byte.

For example, boyer-moore string search algorithm. Go's built-in bytes.index function has also been optimized. The speed achieved will of course depend on the input data and the actual mode. For the input specified in the question, the performance of `bytes.index improves significantly when measured.

program

  • Reading a block where the block size should be significantly longer than the maximum line length, a value >= 64kb should probably be good, in testing the 1mb in the question was used.
  • A block usually does not end with a newline character, so search from the end of the block to the next newline character, limiting the search to this slice and remembering the remaining data for the next pass
  • The last block does not necessarily end with a newline character
  • With the help of the high-performance go function bytes.index you can find where in the block the pattern occurs
  • Search for the preceding and following newline characters from the found position
  • The block is then output to the beginning of the corresponding line
  • And continue searching from the end of the line where the pattern appears
  • If the search does not find other locations, output the remaining locations
  • Read the next block and apply the steps described again until the end of the file is reached

Noteworthy

A read operation may return less data than the block size, so it makes sense to repeat the read operation until the block size of data is read.

Benchmark

Optimized code is usually much more complex, but also performs significantly better, as we will see later.

benchmarktestwithoutignore-8         270       4137267 ns/op
benchmarktestwithignore-8             54      22403931 ns/op
benchmarktestfilter-8                150       7947454 ns/op
Copy after login

Here, the optimized code benchmarktestfilter-8 is only about 1.9 times slower than the operation without filtering, while the benchmarktestwithignore-8 method is 5.4 times slower than the comparison value without filtering.

Looking at it from another perspective: the optimized code is 2.8 times faster than the unoptimized code.

Code

Of course, this is the code for your own testing:

func filterfile(f, pattern string) (*bytes.buffer, error) {
    rfd, err := os.open(f)
    if err != nil {
        log.fatal(err)
    }
    defer func() {
        if err := rfd.close(); err != nil {
            log.fatal(err)
        }
    }()

    reader := bufio.newreader(rfd)
    return filter(reader, []byte(pattern), 1024*1024)
}

// chunksize must be larger than the longest line
// a reasonable size is probably >= 64k
func filter(reader io.reader, pattern []byte, chunksize int) (*bytes.buffer, error) {
    var bs []byte
    buffer := bytes.newbuffer(bs)

    chunk := make([]byte, chunksize)

    var remaining []byte
    for lastchunk := false; !lastchunk; {
        n, err := readchunk(reader, chunk, remaining, chunksize)
        if err != nil {
            if err == io.eof {
                lastchunk = true
            } else {
                return nil, err
            }
        }

        remaining = remaining[:0]
        if !lastchunk {
            for i := n - 1; i > 0; i-- {
                if chunk[i] == '\n' {
                    remaining = append(remaining, chunk[i+1:n]...)
                    n = i + 1
                    break
                }
            }
        }

        s := 0
        for s < n {
            hit := bytes.index(chunk[s:n], pattern)
            if hit < 0 {
                break
            }
            hit += s
            startofline := hit
            for ; startofline > 0; startofline-- {
                if chunk[startofline] == '\n' {
                    startofline++
                    break
                }
            }
            endofline := hit + len(pattern)
            for ; endofline < n; endofline++ {
                if chunk[endofline] == '\n' {
                    break
                }
            }
            endofline++

            _, err = buffer.write(chunk[s:startofline])
            if err != nil {
                return nil, err
            }
            s = endofline
        }

        if s < n {
            _, err = buffer.write(chunk[s:n])
            if err != nil {
                return nil, err
            }
        }
    }

    return buffer, nil
}

func readchunk(reader io.reader, chunk, remaining []byte, chunksize int) (int, error) {
    copy(chunk, remaining)
    r := len(remaining)
    for r < chunksize {
        n, err := reader.read(chunk[r:])
        r += n
        if err != nil {
            return r, err
        }
    }
    return r, nil
}
Copy after login

The benchmark section might look like this:

func BenchmarkTestFilter(b *testing.B) {
    for i := 0; i < b.N; i++ {
        _, err := filterFile("base64dump.log", "AUDIT")
        if err != nil {
            b.Fatal(err)
        }
    }
}
Copy after login

The filter function is split, and the actual work is done in func filter(reader io.reader, pattern []byte, chunksize int) (*bytes.buffer, error).

The creation of unit tests has been prepared or considered by injecting the reader and chunksize, which is missing here but is definitely recommended when working with indexes.

However, the point here is to find a way to significantly improve performance.

The above is the detailed content of Ignore lines containing pattern in long text file in Go. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What is the problem with Queue thread in Go's crawler Colly? What is the problem with Queue thread in Go's crawler Colly? Apr 02, 2025 pm 02:09 PM

Queue threading problem in Go crawler Colly explores the problem of using the Colly crawler library in Go language, developers often encounter problems with threads and request queues. �...

What libraries are used for floating point number operations in Go? What libraries are used for floating point number operations in Go? Apr 02, 2025 pm 02:06 PM

The library used for floating-point number operation in Go language introduces how to ensure the accuracy is...

How to solve the user_id type conversion problem when using Redis Stream to implement message queues in Go language? How to solve the user_id type conversion problem when using Redis Stream to implement message queues in Go language? Apr 02, 2025 pm 04:54 PM

The problem of using RedisStream to implement message queues in Go language is using Go language and Redis...

In Go, why does printing strings with Println and string() functions have different effects? In Go, why does printing strings with Println and string() functions have different effects? Apr 02, 2025 pm 02:03 PM

The difference between string printing in Go language: The difference in the effect of using Println and string() functions is in Go...

What should I do if the custom structure labels in GoLand are not displayed? What should I do if the custom structure labels in GoLand are not displayed? Apr 02, 2025 pm 05:09 PM

What should I do if the custom structure labels in GoLand are not displayed? When using GoLand for Go language development, many developers will encounter custom structure tags...

What is the difference between `var` and `type` keyword definition structure in Go language? What is the difference between `var` and `type` keyword definition structure in Go language? Apr 02, 2025 pm 12:57 PM

Two ways to define structures in Go language: the difference between var and type keywords. When defining structures, Go language often sees two different ways of writing: First...

Which libraries in Go are developed by large companies or provided by well-known open source projects? Which libraries in Go are developed by large companies or provided by well-known open source projects? Apr 02, 2025 pm 04:12 PM

Which libraries in Go are developed by large companies or well-known open source projects? When programming in Go, developers often encounter some common needs, ...

When using sql.Open, why does not report an error when DSN passes empty? When using sql.Open, why does not report an error when DSN passes empty? Apr 02, 2025 pm 12:54 PM

When using sql.Open, why doesn’t the DSN report an error? In Go language, sql.Open...

See all articles