How to Handle Non-ASCII Characters in Go\'s Regular Expression Boundaries?

Susan Sarandon
Release: 2024-10-30 02:24:02
Original
936 people have browsed it

 How to Handle Non-ASCII Characters in Go's Regular Expression Boundaries?

Golang Regular Expression Boundary and Non-ASCII Characters

Go's regular expression boundary (b) is designed to match the boundary between ASCII characters and non-ASCII characters. However, in certain scenarios, it may not behave as expected when Latin characters are involved.

The Problem

In Go, the b boundary only works when it surrounds ASCII characters. For instance, the regex b(vis)b is intended to match the word "vis". However, when the word "vis" contains Latin characters, such as "révisé", b fails to recognize it as a word boundary.

Consider the following Go code:

<code class="go">package main

import (
    "fmt"
    "regexp"
)

func main() {
    r, _ := regexp.Compile(`\b(vis)\b`)
    fmt.Println(r.MatchString("re vis e")) // Expected true
    fmt.Println(r.MatchString("revise"))  // Expected true
    fmt.Println(r.MatchString("révisé")) // Expected false
}</code>
Copy after login

Running this code produces:

true
true
true
Copy after login

Notice that the last line incorrectly matches "révisé".

The Solution

To handle cases with non-ASCII characters, you can define your own custom boundary pattern. One approach is to replace b with the following regex:

(?:\A|\s)(vis)(?:\s|\z)
Copy after login

This pattern means:

  • (?:A|s): Matches the start of the string or a whitespace character.
  • (vis): Captures the word "vis".
  • (?:s|z): Matches a whitespace character or the end of the string.

This custom boundary effectively achieves what b does for ASCII characters, but it also extends to non-ASCII characters like Latin characters.

By incorporating this custom pattern into the regex, you can obtain the desired result:

<code class="go">package main

import (
    "fmt"
    "regexp"
)

func main() {
    r, _ := regexp.Compile(`(?:\A|\s)(vis)(?:\s|\z)`)
    fmt.Println(r.MatchString("vis")) // Added this case
    fmt.Println(r.MatchString("re vis e"))
    fmt.Println(r.MatchString("revise"))
    fmt.Println(r.MatchString("révisé"))
}</code>
Copy after login

Running this code now gives:

true
true
false
false
Copy after login

As you can see, "révisé" is correctly excluded as a match.

The above is the detailed content of How to Handle Non-ASCII Characters in Go\'s Regular Expression Boundaries?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!