How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?-PHP Tutorial-php.cn

How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?

DDD

Release： 2024-10-31 18:00:15

Original

403 people have browsed it

How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?

Determining UCS-2 Code Points for UTF-8 Characters in PHP

The task at hand is to extract the UCS-2 code points for characters within a given UTF-8 string. To accomplish this, a custom PHP function can be defined.

Firstly, it's important to understand the UTF-8 encoding scheme. Each character is represented by a sequence of 1 to 4 bytes, depending on its Unicode code point. The ranges for each byte size are as follows:

0xxxxxxx: 1 byte
110xxxxx 10xxxxxx: 2 bytes
1110xxxx 10xxxxxx 10xxxxxx: 3 bytes
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 4 bytes

To determine the number of bytes per character, examine the first byte:

0: 1 byte character
110: 2 byte character
1110: 3 byte character
11110: 4 byte character
10: Continuation byte
11111: Invalid character

Once the number of bytes is determined, bit manipulation can be used to extract the code point.

Custom PHP Function:

Based on the above analysis, here's a custom PHP function that takes a single UTF-8 character as input and returns its UCS-2 code point:

<code class="php">function get_ucs2_codepoint($char)
{
    // Initialize the code point
    $codePoint = 0;

    // Get the first byte
    $firstByte = ord($char);

    // Determine the number of bytes
    if ($firstByte < 128) {
        $bytes = 1;
    } elseif ($firstByte < 192) {
        $bytes = 2;
    } elseif ($firstByte < 224) {
        $bytes = 3;
    } elseif ($firstByte < 240) {
        $bytes = 4;
    } else {
        // Invalid character
        return -1;
    }

    // Shift and extract code point
    switch ($bytes) {
        case 1:
            $codePoint = $firstByte;
            break;
        case 2:
            $codePoint = ($firstByte & 0x1F) << 6;
            $codePoint |= ord($char[1]) & 0x3F;
            break;
        case 3:
            $codePoint = ($firstByte & 0x0F) << 12;
            $codePoint |= (ord($char[1]) & 0x3F) << 6;
            $codePoint |= ord($char[2]) & 0x3F;
            break;
        case 4:
            $codePoint = ($firstByte & 0x07) << 18;
            $codePoint |= (ord($char[1]) & 0x3F) << 12;
            $codePoint |= (ord($char[2]) & 0x3F) << 6;
            $codePoint |= ord($char[3]) & 0x3F;
            break;
    }

    return $codePoint;
}</code>

Copy after login

Example Usage:

To use the function, simply provide a UTF-8 character as input:

<code class="php">$char = "ñ";
$codePoint = get_ucs2_codepoint($char);
echo "UCS-2 code point: $codePoint\n";</code>

Copy after login

Output:

UCS-2 code point: 241

Copy after login

The above is the detailed content of How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?. For more information, please follow other related articles on the PHP Chinese website!