How to Read UTF-16 Text Files in Go
Understanding the Problem
Many file formats encode textual data using UTF-16 encoding, which is a two-byte Unicode encoding. When you read a UTF-16 file in Go, it's important to decode the bytes correctly to obtain the actual text content. However, the default behavior in Go is to treat UTF-16 bytes as ASCII, which can lead to incorrect results.
Decoding UTF-16 Files
To read a UTF-16 file correctly, you need to specify the encoding when reading the file. Go provides the unicode.UTF16 decoder for this purpose. Here is an updated version of the code you provided:
package main import ( "bytes" "fmt" "io/ioutil" "os" "strings" "golang.org/x/text/encoding/unicode" ) func main() { // Read the file into a []byte raw, err := ioutil.ReadFile("test.txt") if err != nil { fmt.Printf("error opening file: %v\n", err) os.Exit(1) } // Create a Unicode UTF-16 decoder utf16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM) // Create a transformer to decode the data transformer := utf16be.NewDecoder() // Decode the text using the transformer decoded, err := transformer.Bytes(raw) if err != nil { fmt.Printf("error decoding file: %v\n", err) os.Exit(1) } // Convert the decoded bytes to a string text := string(decoded) // Remove any Windows-style line endings (CR+LF) final := strings.Replace(text, "\r\n", "\n", -1) // Print the final text fmt.Println(final) }
This code uses unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM) to create a decoder for UTF-16 with big-endian byte order and ignoring any Byte Order Mark (BOM). The BOM is used to indicate the byte order of the file, but since we ignore it, the code will work correctly regardless of the BOM.
The decoded bytes are then converted to a string using the string() function. Finally, any Windows-style line endings are removed using strings.Replace().
Using New Scanner for UTF-16 Files
If you need to read the file line by line, you can use the New ScannerUTF16 function from the golang.org/x/text package instead of ioutil.ReadFile. Here is an example:
package main import ( "bufio" "fmt" "os" "golang.org/x/text/encoding/unicode" "golang.org/x/text/transform" ) func NewScannerUTF16(filename string) (*bufio.Scanner, error) { // Read the file into a []byte raw, err := os.ReadFile(filename) if err != nil { return nil, err } // Create a Unicode UTF-16 decoder utf16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM) // Create a transformer to decode the data transformer := utf16be.NewDecoder() // Create a scanner that uses the transformer scanner := bufio.NewScanner(transform.NewReader(bytes.NewReader(raw), transformer)) return scanner, nil } func main() { // Create a scanner for the UTF-16 file scanner, err := NewScannerUTF16("test.txt") if err != nil { fmt.Printf("error opening file: %v\n", err) os.Exit(1) } // Read the file line by line for scanner.Scan() { fmt.Println(scanner.Text()) } }
This code uses the bufio.NewScanner() function to create a scanner that reads from the transformed reader, which decodes the UTF-16 bytes. By using a scanner, you can iterate over the lines of the file without having to read the entire file into memory.
The above is the detailed content of How to Correctly Read and Decode UTF-16 Text Files in Go?. For more information, please follow other related articles on the PHP Chinese website!