How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?

Linda Hamilton
Release: 2024-11-03 02:09:29
Original
445 people have browsed it

How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?

Converting UTF-8 Characters to UCS-2 Code Points

In this article, we explore how to extract the UCS-2 code points of characters within a UTF-8 string. We will provide a detailed explanation of the process and an implementation in PHP versions 4 or 5.

Understanding UTF-8

UTF-8 is a character encoding standard that represents Unicode characters using one to four bytes. To determine the number of bytes for a particular character, examine the leading byte:

  • 0xxxxxxx: 1-byte character
  • 110xxxxx: 2-byte character
  • 1110xxxx: 3-byte character
  • 11110xxx: 4-byte character

Converting to UCS-2

UCS-2, also known as UTF-16, is a character encoding format that can represent most Unicode characters. The conversion from UTF-8 to UCS-2 considers the number of bytes per character as follows:

  • 1-byte character: The code point is directly the UTF-8 byte value.
  • 2-byte character: Shift the first byte left by 6 bits and bitwise OR it with the second byte.
  • 3-byte character: Shift the first byte left by 12 bits, the second byte left by 6 bits, and bitwise OR them with the third byte.

Implementation in PHP 4/5

For PHP versions 4 or 5, you can implement a function to perform this conversion:

<code class="php">function utf8_char_to_ucs2($utf8) {
    if (!(ord($utf8[0]) & 0x80)) {
        return ord($utf8[0]);
    } elseif ((ord($utf8[0]) & 0xE0) == 0xC0) {
        return ((ord($utf8[0]) & 0x1F) << 6) | (ord($utf8[1]) & 0x3F);
    } elseif ((ord($utf8[0]) & 0xF0) == 0xE0) {
        return ((ord($utf8[0]) & 0x0F) << 12) | ((ord($utf8[1]) & 0x3F) << 6) | (ord($utf8[2]) & 0x3F);
    } else {
        return null; // Handle invalid characters or characters beyond UCS-2 range
    }
}</code>
Copy after login

Example Usage

<code class="php">$utf8 = "hello";
for ($i = 0; $i < strlen($utf8); $i++) {
    $ucs2_codepoint = utf8_char_to_ucs2($utf8[$i]);
    printf("Code point for '%s': %d\n", $utf8[$i], $ucs2_codepoint);
}</code>
Copy after login

This will output:

Code point for 'h': 104
Code point for 'e': 101
Code point for 'l': 108
Code point for 'l': 108
Code point for 'o': 111
Copy after login

The above is the detailed content of How to Convert UTF-8 Characters to UCS-2 Code Points in PHP?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template