Validation of Invalid Byte Sequences in Go
When attempting to convert a byte slice ([]byte) to a string in Go, it's crucial to handle scenarios where the byte sequences cannot be converted to a valid Unicode string.
Solution:
1. UTF-8 Validity Check:
As suggested by Tim Cooper, you can utilize the utf8.Valid function to determine if a byte slice is a valid UTF-8 sequence. If utf8.Valid returns false, it indicates the presence of invalid bytes.
2. Non-UTF-8 Byte Handling:
Contrary to popular belief, non-UTF-8 bytes can still be stored in a Go string. This is because strings in Go are essentially read-only byte slices. They can contain non-valid UTF-8 bytes, which can be accessed, printed, or even converted back to a byte slice without issue.
However, Go performs UTF-8 decoding in specific scenarios:
Note: These conversions never result in a panic, so it's only necessary to actively check for UTF-8 validity if it's essential for your application (e.g., if U FFFD is unacceptable and an error should be thrown).
Sample Code:
The following code demonstrates how Go handles a byte slice containing invalid UTF-8:
package main import "fmt" func main() { a := []byte{0xff} // Invalid UTF-8 byte s := string(a) fmt.Println(s) // � for _, r := range s { // Range loop replaces invalid UTF-8 with U+FFFD fmt.Println(r) // 65533 } rs := []rune(s) // Conversion to runes decodes UTF-8 (U+FFFD) fmt.Println(rs) // [65533] }
The above is the detailed content of How Does Go Handle Invalid Byte Sequences When Converting to Strings?. For more information, please follow other related articles on the PHP Chinese website!