


How to Handle Non-ASCII Characters in Go\'s Regular Expression Boundaries?
Oct 30, 2024 am 02:24 AMGolang Regular Expression Boundary and Non-ASCII Characters
Go's regular expression boundary (b) is designed to match the boundary between ASCII characters and non-ASCII characters. However, in certain scenarios, it may not behave as expected when Latin characters are involved.
The Problem
In Go, the b boundary only works when it surrounds ASCII characters. For instance, the regex b(vis)b is intended to match the word "vis". However, when the word "vis" contains Latin characters, such as "révisé", b fails to recognize it as a word boundary.
Consider the following Go code:
<code class="go">package main import ( "fmt" "regexp" ) func main() { r, _ := regexp.Compile(`\b(vis)\b`) fmt.Println(r.MatchString("re vis e")) // Expected true fmt.Println(r.MatchString("revise")) // Expected true fmt.Println(r.MatchString("révisé")) // Expected false }</code>
Running this code produces:
true true true
Notice that the last line incorrectly matches "révisé".
The Solution
To handle cases with non-ASCII characters, you can define your own custom boundary pattern. One approach is to replace b with the following regex:
(?:\A|\s)(vis)(?:\s|\z)
This pattern means:
- (?:A|s): Matches the start of the string or a whitespace character.
- (vis): Captures the word "vis".
- (?:s|z): Matches a whitespace character or the end of the string.
This custom boundary effectively achieves what b does for ASCII characters, but it also extends to non-ASCII characters like Latin characters.
By incorporating this custom pattern into the regex, you can obtain the desired result:
<code class="go">package main import ( "fmt" "regexp" ) func main() { r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`) fmt.Println(r.MatchString("vis")) // Added this case fmt.Println(r.MatchString("re vis e")) fmt.Println(r.MatchString("revise")) fmt.Println(r.MatchString("révisé")) }</code>
Running this code now gives:
true true false false
As you can see, "révisé" is correctly excluded as a match.
The above is the detailed content of How to Handle Non-ASCII Characters in Go\'s Regular Expression Boundaries?. For more information, please follow other related articles on the PHP Chinese website!

Hot Article

Hot tools Tags

Hot Article

Hot Article Tags

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

Go language pack import: What is the difference between underscore and without underscore?

How to implement short-term information transfer between pages in the Beego framework?

How to convert MySQL query result List into a custom structure slice in Go language?

How do I write mock objects and stubs for testing in Go?

How can I define custom type constraints for generics in Go?

How to write files in Go language conveniently?

How can I use tracing tools to understand the execution flow of my Go applications?
