Bringing Unicode to PHP with Portable UTF-8-PHP Tutorial-php.cn

Bringing Unicode to PHP with Portable UTF-8

Core points

Although PHP is able to handle multi-byte variable names and Unicode strings, the language lacks comprehensive Unicode support because of treating strings as single-byte character sequences. This limitation affects all aspects of string operation, including substring extraction, determining string length, and string segmentation.
Portable UTF-8 is a user space library that brings Unicode support to PHP applications. It is built on top of mbstring and iconv, provides about 60 Unicode-based string manipulation, testing and verification functions, and uses UTF-8 as its main character encoding scheme. The library is fully portable and can be used with any PHP 4.2 or later installation.
Portable UTF-8 library provides multiple functions for processing Unicode strings, including UTF-8 input verification, removing invalid bytes, encoding text into HTML entities to prevent XSS attacks, trimming spaces, removing duplicate spaces, creating inclusions UTF-8 characters URL fragments and forced limits on input character length. This ensures that in Unicode-enabled applications, the focus shifts from byte and byte lengths to character and character lengths.

PHP allows multi-byte variable names (e.g. $a∩b, $Ʃxy and $Δx), mbstring and other extensions can handle Unicode strings, and utf8_encode() and utf8_decode() functions can be used in UTF Convert strings between -8 and ISO-8859-1 encoding. However, it is widely believed that PHP lacks Unicode support. This article describes the meaning of lack of Unicode support and demonstrates how to use a library that brings Unicode support to PHP applications - Portable UTF-8.

Unicode support in PHP

PHP's lack of Unicode/multi-byte support means that standard string processing functions treat strings as single-byte character sequences. In fact, the official PHP manual defines a string in PHP as "a series of characters, one of which is the same as a byte". PHP supports only 8-bit characters, while Unicode (and many other character sets) may require multiple bytes to represent a character. This limitation of PHP affects almost all aspects of string operation, including (but not limited to) substring extraction, determining string length, string segmentation, mixing and so on. Efforts to solve this problem began in early 2005, but in 2010, the work of bringing native Unicode support to PHP was stopped and put on hold for a variety of reasons. Since native Unicode support in PHP can take years to implement (if it does), developers must rely on available extensions such as mbstring and iconv to fill this gap, but these extensions offer only limited Unicode support. These libraries are not Unicode-centric and can also be converted between non-Unicode encodings. They make positive contributions to simplifying Unicode string processing. However, the above extension also has some disadvantages. They only provide limited Unicode string processing capabilities, and none of them are enabled by default. Server administrators must explicitly enable any or all extensions to access them through PHP applications. Shared hosting providers often make things worse by installing one or two extensions, which makes it difficult for developers to rely on an always-available API to meet their Unicode needs. Still, the good news is that PHP can output Unicode text. This is because PHP doesn't really care whether we are sending English text encoded in ASCII or other text belonging to the language whose characters are encoded in multiple bytes. Knowing this, PHP developers now only need an API that provides comfortable Unicode-based string manipulation.

Portable UTF-8

The recent solution is to create a user space library written in PHP. Even if the server/language level lacks support, these libraries can be easily bundled with the application to ensure the presence of Unicode support. Many open source applications already include their own libraries of this kind, and many more use free third-party libraries; Portable UTF-8 is such a library. Portable UTF-8 is a free lightweight library built on top of mbstring and iconv. It extends the functionality of these two extensions, providing about 60 Unicode-based string manipulation, testing and verification functions; it provides UTF-8-aware corresponding functions for nearly all PHP common string handling functions. As the name implies, Portable UTF-8 uses UTF-8 as its primary character encoding scheme. The library uses available extensions (mbstring and iconv) for speed reasons and bridges some inconsistencies when using them directly, but if there are no these extensions on the server, it falls back to using pure PHP A UTF-8 routine written. Portable-UT8 is fully portable and can be used with any PHP 4.2 or later installation.

Stand processing using Portable UTF-8

Text editors with poor Unicode support can corrupt text when reading text, and text copied and pasted into web forms from such an editor may be the source of invalid UTF-8 for the application. When processing user-submitted input, be sure to make sure the input is exactly in line with the application's expectations. To detect whether the text is valid UTF-8, you can use the library's is_utf8() function.

if (is_utf8($_POST['title'])) {
    // 执行某些操作...
}

Copy after login

Recovering characters from invalid bytes is impossible, so removing bytes that are not recognized as valid UTF-8 characters may be your only choice. The utf8_clean() function can be used to remove invalid bytes.

$title = utf8_clean($_POST['title']);

Copy after login

Each Unicode character can be encoded as the corresponding HTML entity, and you may want to encode the text in this way to help prevent XSS attacks before outputting it to the browser.

echo utf8_html_encode($title);

Copy after login

Usually, spaces are trimmed at the beginning and end of a string. Unicode lists about 20 space characters, and some ASCII-based control characters should also be considered objects that need to be pruned.

$title = utf8_trim($title);

Copy after login

On the other hand, duplicates of such spaces may exist in the middle of a string and should be deleted. The following shows how to use utf8_remove_duplicates() and utf8_ws() in combination:

$title = utf8_remove_duplicates($title, utf8_ws());

Copy after login

The traditional solution for creating URL fragments for SEO purposes uses transliteration and removes all non-ASCII characters from the fragment. This makes the URL less valuable than it is. While the URL can support UTF-8 encoded characters, without such removal or transliteration, we can create rich snippets containing characters in any language:

$slug = utf8_url_slug($title, 30); // 字符长度30

Copy after login

From the start of input verification to saving data to a database, Unicode-enabled applications focus on character and character lengths, not byte and byte lengths. This shift in focus requires a new interface to understand this difference. It is usually necessary to limit the length of the input character, so if the input is more than 60 characters in length, we will create a substring.

if (utf8_strlen($title) > 60) {
    $title  = utf8_substr($title, 0, 60);
}

Copy after login

Or:

if (!utf8_fits_inside($title , 60)) {
    $title  = utf8_substr($title, 0 ,60);
}

Copy after login

There are three different ways to access a single character using the Portable-UT8 library. We can use utf8_access() to access a single character.

echo '第六个字符是：' . utf8_access($string, 5);

Copy after login

utf8_chr_map() Allows iterative access of a single character using a callback function.

utf8_chr_map('some_callback', $string);

Copy after login

We can split the string into a character array using utf8_split() and process the array elements as a single character.

array_map('some_callback', utf8_split($string));

Copy after login

Training Unicode may also require us to find the minimum/maximum code point in the string, segment the string, process byte order markers, string case conversion, randomization/mixing, replacement, etc. All of this is supported by Portable-UT8.

Conclusion

PHP 6 development has been stopped, resulting in the long-term need for native Unicode support being delayed, which is crucial for the development of multilingual applications. Therefore, server-side extensions and user space libraries such as Portable UTF-8 play an important role in helping developers create better standardized webs to meet local needs.

(The FAQs part is omitted here due to space limitations)

The above is the detailed content of Bringing Unicode to PHP with Portable UTF-8. For more information, please follow other related articles on the PHP Chinese website!