Internationalized string comparison objects in PHP-PHP Tutorial-php.cn

Internationalized string comparison objects in PHP

In PHP, internationalization functions are very rich, including many things that we may not know are actually very useful, such as Let’s talk about the series of character sorting and comparison functions we are going to introduce today.

Sort

Normally speaking, if we sort the characters in the array, we will arrange them in the order of the ASC2 table of characters. If it is in English, it is fine. , but for Chinese, the sorted results will be very confusing.

$arr = [&#39;我&#39;,&#39;是&#39;,&#39;硬&#39;,&#39;核&#39;,&#39;项&#39;, &#39;目&#39;, &#39;经&#39;, &#39;理&#39;];
sort($arr);
var_dump( $arr );
// array(8) {
//     [0]=>
//     string(3) "我"
//     [1]=>
//     string(3) "是"
//     [2]=>
//     string(3) "核"
//     [3]=>
//     string(3) "理"
//     [4]=>
//     string(3) "目"
//     [5]=>
//     string(3) "硬"
//     [6]=>
//     string(3) "经"
//     [7]=>
//     string(3) "项"
//   }

Copy after login

According to our habit, we use Chinese pinyin to sort Chinese characters. At this time, everyone often chooses to write their own sorting algorithm or find a suitable Composer package. In fact, PHP has prepared an object for us to handle this type of problem.

$coll = new Collator( &#39;zh_CN&#39; );
$coll->sort($arr);
var_dump( $arr );
// array(8) {
//     [0]=>
//     string(3) "核"
//     [1]=>
//     string(3) "经"
//     [2]=>
//     string(3) "理"
//     [3]=>
//     string(3) "目"
//     [4]=>
//     string(3) "是"
//     [5]=>
//     string(3) "我"
//     [6]=>
//     string(3) "项"
//     [7]=>
//     string(3) "硬"
//   }

Copy after login

Yes, it is this Collator class. It needs to specify the current area when instantiating it. For example, we specify zh_CN, which is the Chinese character area. At this time, use its sort() method to complete the pinyin sorting of Chinese characters. (Recommended: PHP Video Tutorial)

$coll->sort($arr, Collator::SORT_NUMERIC );
var_dump( $arr );
// array(8) {
//     [0]=>
//     string(3) "核"
//     [1]=>
//     string(3) "经"
//     [2]=>
//     string(3) "理"
//     [3]=>
//     string(3) "目"
//     [4]=>
//     string(3) "是"
//     [5]=>
//     string(3) "我"
//     [6]=>
//     string(3) "项"
//     [7]=>
//     string(3) "硬"
//   }
$coll->sort($arr, Collator::SORT_STRING );
var_dump( $arr );
// array(8) {
//     [0]=>
//     string(3) "核"
//     [1]=>
//     string(3) "经"
//     [2]=>
//     string(3) "理"
//     [3]=>
//     string(3) "目"
//     [4]=>
//     string(3) "是"
//     [5]=>
//     string(3) "我"
//     [6]=>
//     string(3) "项"
//     [7]=>
//     string(3) "硬"
//   }

Copy after login

The sort() method of the Collator object also supports a second parameter, which is used to specify whether the current sorting is based on character or numeric format. For pure Chinese content, there is no difference.

In addition to the sort() method, it also has an asort() method, which has the same function as the ordinary asort() function, except that it also supports different regional languages.

$arr = [
    &#39;a&#39; => &#39;100&#39;,
    &#39;b&#39; => &#39;7&#39;,
    &#39;c&#39; => &#39;50&#39;
];
$coll->asort($arr, Collator::SORT_NUMERIC );
var_dump( $arr );
// array(3) {
//     ["b"]=>
//     string(1) "7"
//     ["c"]=>
//     string(2) "50"
//     ["a"]=>
//     string(3) "100"
//   }
$coll->asort($arr, Collator::SORT_STRING );
var_dump( $arr );
// array(3) {
//     ["a"]=>
//     string(3) "100"
//     ["c"]=>
//     string(2) "50"
//     ["b"]=>
//     string(1) "7"
//   }
$arr = [
    &#39;中&#39; => &#39;100&#39;,
    &#39;的&#39; => &#39;7&#39;,
    &#39;文&#39; => &#39;50&#39;
];
$coll->asort($arr, Collator::SORT_NUMERIC );
var_dump( $arr );
// array (
//     &#39;的&#39; => &#39;7&#39;,
//     &#39;文&#39; => &#39;50&#39;,
//     &#39;中&#39; => &#39;100&#39;,
//   )
$coll->asort($arr, Collator::SORT_STRING );
var_dump( $arr );
// array (
//     &#39;中&#39; => &#39;100&#39;,
//     &#39;文&#39; => &#39;50&#39;,
//     &#39;的&#39; => &#39;7&#39;,
//   )

Copy after login

asrot() method sorts based on keys and values together, so specifying SORT_STRING and SORT_NUMERIC here has obvious effects. We can see that if it is sorted based on numbers, then the results are based on the numerical content. If it is sorted based on characters, then the results are sorted based on the string part of the key value.

Both sort() and asrot() are essentially the same as the sort() and asrot() functions provided by ordinary PHP by default. It's just that they have more regional language functions.

In addition, the Collator object also provides a sortWithSortKeys() method, which is not available in ordinary PHP sorting functions.

$arr = [&#39;我&#39;,&#39;是&#39;,&#39;硬&#39;,&#39;核&#39;,&#39;项&#39;, &#39;目&#39;, &#39;经&#39;, &#39;理&#39;];
$coll->sortWithSortKeys($arr);
var_dump( $arr );
// array (
//     0 => &#39;核&#39;,
//     1 => &#39;经&#39;,
//     2 => &#39;理&#39;,
//     3 => &#39;目&#39;,
//     4 => &#39;是&#39;,
//     5 => &#39;我&#39;,
//     6 => &#39;项&#39;,
//     7 => &#39;硬&#39;,
//   )

Copy after login

It is similar to the sort() method, but uses ucol_getSortKey() to generate the ICU sort key, which is faster on large arrays.

The full name of ICU is International Components for Unicode, which is the international component of Unicode. It provides translation-related functions, which is the basis for the internationalization capabilities of our system and various programming languages.

Comparison

The next step is the comparison of strings. For example, we all know that "a" is larger than "A" because in the ASC2 code table In , "A" is 65 and "a" is 97. Of course, this is only a comparison by default. When using the function of the Collator object for comparison, the comparison is based on the sorting index in the dictionary library. For Chinese, it is basically compared in the order of pinyin.

var_dump($coll->compare(&#39;Hello&#39;, &#39;hello&#39;)); // int(1)
var_dump($coll->compare(&#39;你好&#39;, &#39;您好&#39;)); // int(-1)

Copy after login

compare() method is used for comparison. If the two strings are equal, 0 is returned. If the first string is greater than the second, 1 is returned. Otherwise, 0 is returned. -1 . From the code, we can see that "Hello" is greater than "hello" and "hello" is less than "hello" (because "you" has an extra g).

Property settings

Some object properties can also be set in the Collator object.

$coll->setAttribute(Collator::CASE_FIRST, Collator::UPPER_FIRST);
var_dump($coll->getAttribute(Collator::CASE_FIRST)); // int(25)
var_dump($coll->compare(&#39;Hello&#39;, &#39;hello&#39;)); // int(-1)
$coll->setAttribute(Collator::CASE_FIRST, Collator::LOWER_FIRST);
var_dump($coll->getAttribute(Collator::CASE_FIRST)); // int(24)
var_dump($coll->compare(&#39;Hello&#39;, &#39;hello&#39;)); // int(1)
$coll->setAttribute(Collator::CASE_FIRST, Collator::OFF);
var_dump($coll->getAttribute(Collator::CASE_FIRST)); // int(16)
var_dump($coll->compare(&#39;Hello&#39;, &#39;hello&#39;)); // int(1)

Copy after login

Here we specify the CASE_FIRST attribute for the object. The attribute value can specify uppercase first, lowercase first, etc. For English characters, this can affect the sorting and comparison results.

In addition, we can also obtain the current regional language information through a method.

var_dump($coll->getLocale(Locale::VALID_LOCALE)); // string(10) "zh_Hans_CN"
var_dump($coll->getLocale(Locale::ACTUAL_LOCALE)); // string(2) "zh"

Copy after login

These two parameters are to obtain effective regional setting information and actual regional information.

Sort information

Of course, we can also see the specific sort information, which is the encoding of characters in Collator.

var_dump(bin2hex($coll->getSortKey('Hello'))); // string(20) "b6b0bebec4010901dc08"
var_dump(bin2hex($coll->getSortKey('hello'))); // string(18) "b6b0bebec401090109"
var_dump(bin2hex($coll->getSortKey('你好'))); // string(16) "7b9b657301060106"
var_dump(bin2hex($coll->getSortKey('您好'))); // string(16) "7c33657301060106"
$coll = collator_create( 'en_US' );
var_dump($coll->compare(&#39;Hello&#39;, &#39;hello&#39;)); // int(1)
var_dump($coll->compare(&#39;你好&#39;, &#39;您好&#39;)); // int(-1)
var_dump($coll->getLocale(Locale::VALID_LOCALE)); // string(5) "en_US"
var_dump($coll->getLocale(Locale::ACTUAL_LOCALE)); // string(4) "root"
var_dump(bin2hex($coll->getSortKey('Hello'))); // string(20) "3832404046010901dc08"
var_dump(bin2hex($coll->getSortKey('hello'))); // string(18) "383240404601090109"
var_dump(bin2hex($coll->getSortKey('你好'))); // string(20) "fb0b8efb649401060106"
var_dump(bin2hex($coll->getSortKey('您好'))); // string(20) "fba5f8fb649401060106"

Copy after login

It can be seen that the getSortKey() sort key information obtained in different regional languages is different, but they are all stored in hexadecimal, which is completely different from the default ASC2 code.

Error message

$coll = new Collator( &#39;en_US&#39; );;
$coll->compare( &#39;y&#39;, &#39;k&#39; ); 
var_dump($coll->getErrorCode()); // int(0)
var_dump($coll->getErrorMessage()); // string(12) "U_ZERO_ERROR"

Copy after login

Use getErrorCode() to get the error code, and use getErrorMessage() to get the error message. No relevant information has been found regarding the returned U_ZERO_ERROR. I hope knowledgeable friends can reply with explanations so that we can all learn together.

Strength of Sorting Rules

In addition, the Collator object also has a setting of sorting strength, but the effect of my test was not reflected.

$arr  = array( &#39;a&#39;, &#39;à&#39; ,&#39;A&#39;);
$coll = new Collator( &#39;de_DE&#39; );
$coll->sort($arr);
var_dump($coll->getStrength());
var_dump( $arr ); // int(2)
// array(3) {
//     [0]=>
//     string(1) "a"
//     [1]=>
//     string(1) "A"
//     [2]=>
//     string(2) "à"
//   }
$coll->setStrength(Collator::IDENTICAL);
var_dump($coll->getStrength()); // int(15)
$coll->sort($arr);
var_dump( $arr );
$coll->setStrength(Collator::QUATERNARY);
var_dump($coll->getStrength()); // int(3)
$coll->sort($arr);
var_dump( $arr );
$coll->setStrength(Collator::PRIMARY);
var_dump($coll->getStrength()); // int(0)
$coll->sort($arr );
var_dump( $arr );
$coll->setStrength(Collator::TERTIARY);
var_dump($coll->getStrength()); // int(2)
$coll->sort($arr );
var_dump( $arr );
$coll->setStrength(Collator::SECONDARY);
var_dump($coll->getStrength()); // int(1)
$coll->sort($arr );
var_dump( $arr );

Copy after login

In the results of the test code in the official documentation, specifying different parameters will return different sorting orders, but the results of my actual test are all the same. So I won’t explain it here because I don’t understand why. It’s enough for everyone to understand. If you have friends who know this knowledge, please leave a message and reply to learn together!

Summarize

很有意思的一个对象吧，其实这个对象也是支持面向过程式的函数写法的，在示例代码中也有使用面向过程的方式的调用的。总体来说，按拼音排序和比较这两个功能在实际的开发中相信还是有不少用武之地的，大家可以尝试看看哦！

测试代码：
https://github.com/zhangyue0503/dev-blog/blob/master/php/202011/source/3.PHP中国际化的字符串比较对象.php
参考文档：
https://www.php.net/manual/zh/class.collator.php

Copy after login

The above is the detailed content of Internationalized string comparison objects in PHP. For more information, please follow other related articles on the PHP Chinese website!