Recently, my boss called me a small exercise in data duplication checking, which involves extracting and storing the Chinese fields from a file containing Chinese fields, and developing it using PHP. The middle involves the problem of Chinese matching of PHP regular expressions. I searched a lot on the Internet, but it was also very confusing and there was no accurate information. After modifying and testing my own code, I first wrote down the extract function.
The first thing to note is the encoding problem of double-byte characters. Here we may also encounter encoding problems such as Korean and Japanese in the future, which means the same thing as Chinese.
1. GBK (GB2312/GB18030)
Under Notepad++, we can first test whether our regular writing is wrong or not. I used [u4e00-u9fa5]+ to test the first expression. The + sign indicates more than one
matching character. The result is the same as expected. So, can this regular rule be used in scripts?
Let’s test it. We use preg_match_all('/[u4e00-u9fa5]+/', $subject,$matches) to call, and then you see this result: Compilation failed: PCRE does not support L, l, N{name}, U, or u at offset 2. . . . Isn’t it very big? ? What is the reason for this?
After consulting a lot of information, I found that u (PCRE_UTF8) is the PCRE above. This is a Perl library, including a perl-compatible regular expression library. This modifier enables an extra feature in PCRE that is incompatible with Perl. Pattern strings are treated as UTF-8. This modifier is available since PHP 4.1.0 under Unix and since PHP 4.2.3 under win32. PHP regular expressions also have different ways of expressing hexadecimal data. In PHP, x is used to represent hexadecimal data. Next, we will optimize the code and the detection function becomes:
$match=array();
The input file is:
, the following is the output file content after extracting Chinese:
, in line with expected needs.