PHP PCRE regular analysis-PHP Tutorial-php.cn

PHP PCRE regular analysis

little bottle

Release： 2023-04-06 11:30:02

forward

3089 people have browsed it

The main content of this article is about PHP's PCRE regular analysis, which has certain reference value. Interested friends can learn about it and hope it can help you.

1. Preface

In the previous blog, there is an analysis of the character set. This is not about the character set. Many functions in PHP process the UTF-8 encoding format in unicode by default. So without further ado, let’s get straight to the point.

2. PHP function mb_split analysis

1 <?php
2 $preg_strings = &#39;测、试、一、下&#39;;
3 $preg_str = mb_split(&#39;、&#39;, $preg_strings);
4 print_r($preg_str);

Copy after login

Print result:

Array(
    [0] => 测
    [1] => 试
    [2] => 一
    [3] => 下)

Copy after login

This function defaults to underlying parsing, which is parsed in UTF-8 encoding format. The characters $preg_strings are separated by the hexadecimal code points of UNICODE with the delimiter (,).

3. PHP function preg_split analysis

Split the string "Test it"

1 <?php
2 $strings = &#39;测试一下&#39;;
3 $mb_arr = preg_split(&#39;//u&#39;, $strings, -1, PREG_SPLIT_NO_EMPTY);
4 print_r($mb_arr);

Copy after login

The print result is as follows:

Array(
    [0] => 测
    [1] => 试
    [2] => 一
    [3] => 下
)

Copy after login

4. /u parsing in PCRE

In PHP, regular delimiters can be #, %, /, etc.

#Sometimes there are some modifiers behind a regular expression. So what do they all mean?

For example:

%[\x{4e00}-\x{9fa5}]+%u

Copy after login

The following modifiersucode table Use regular matching to match the encoding format of utf-8.

Example 1:

1 <?php
2 $strings = &#39;测试一下&#39;;
3 $is_true = preg_match_all(&#39;%[\x{4e00}-\x{9fa5}]+%u&#39;, $strings, $match);
4 var_dump($is_true);

Copy after login

The print result is as follows:

Array(
    [0] => Array
        (
            [0] => 测试一下
        )
)

Copy after login

here [\x{4e00}-\x{9fa5}]What does it mean?

In PHP regular code \x is used to represent hexadecimal.

Chinese UNICODE code point is in 4E00 - 9FFF (hexadecimal is mentioned here)

So, the regular matching method is the interval [], [\x{4E00}-\x{9FFF}]

##These two regular rules The effects are the same.