Golang Regular Expression Boundary and Non-ASCII Characters
Go's regular expression boundary (b) is designed to match the boundary between ASCII characters and non-ASCII characters. However, in certain scenarios, it may not behave as expected when Latin characters are involved.
The Problem
In Go, the b boundary only works when it surrounds ASCII characters. For instance, the regex b(vis)b is intended to match the word "vis". However, when the word "vis" contains Latin characters, such as "révisé", b fails to recognize it as a word boundary.
Consider the following Go code:
<code class="go">package main import ( "fmt" "regexp" ) func main() { r, _ := regexp.Compile(`\b(vis)\b`) fmt.Println(r.MatchString("re vis e")) // Expected true fmt.Println(r.MatchString("revise")) // Expected true fmt.Println(r.MatchString("révisé")) // Expected false }</code>
Running this code produces:
true true true
Notice that the last line incorrectly matches "révisé".
The Solution
To handle cases with non-ASCII characters, you can define your own custom boundary pattern. One approach is to replace b with the following regex:
(?:\A|\s)(vis)(?:\s|\z)
This pattern means:
This custom boundary effectively achieves what b does for ASCII characters, but it also extends to non-ASCII characters like Latin characters.
By incorporating this custom pattern into the regex, you can obtain the desired result:
<code class="go">package main import ( "fmt" "regexp" ) func main() { r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`) fmt.Println(r.MatchString("vis")) // Added this case fmt.Println(r.MatchString("re vis e")) fmt.Println(r.MatchString("revise")) fmt.Println(r.MatchString("révisé")) }</code>
Running this code now gives:
true true false false
As you can see, "révisé" is correctly excluded as a match.
The above is the detailed content of How to Handle Non-ASCII Characters in Go\'s Regular Expression Boundaries?. For more information, please follow other related articles on the PHP Chinese website!