How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?

DDD
Release: 2024-10-31 18:00:15
Original
236 people have browsed it

How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?

Determining UCS-2 Code Points for UTF-8 Characters in PHP

The task at hand is to extract the UCS-2 code points for characters within a given UTF-8 string. To accomplish this, a custom PHP function can be defined.

Firstly, it's important to understand the UTF-8 encoding scheme. Each character is represented by a sequence of 1 to 4 bytes, depending on its Unicode code point. The ranges for each byte size are as follows:

  • 0xxxxxxx: 1 byte
  • 110xxxxx 10xxxxxx: 2 bytes
  • 1110xxxx 10xxxxxx 10xxxxxx: 3 bytes
  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx: 4 bytes

To determine the number of bytes per character, examine the first byte:

  • 0: 1 byte character
  • 110: 2 byte character
  • 1110: 3 byte character
  • 11110: 4 byte character
  • 10: Continuation byte
  • 11111: Invalid character

Once the number of bytes is determined, bit manipulation can be used to extract the code point.

Custom PHP Function:

Based on the above analysis, here's a custom PHP function that takes a single UTF-8 character as input and returns its UCS-2 code point:

<code class="php">function get_ucs2_codepoint($char)
{
    // Initialize the code point
    $codePoint = 0;

    // Get the first byte
    $firstByte = ord($char);

    // Determine the number of bytes
    if ($firstByte < 128) {
        $bytes = 1;
    } elseif ($firstByte < 192) {
        $bytes = 2;
    } elseif ($firstByte < 224) {
        $bytes = 3;
    } elseif ($firstByte < 240) {
        $bytes = 4;
    } else {
        // Invalid character
        return -1;
    }

    // Shift and extract code point
    switch ($bytes) {
        case 1:
            $codePoint = $firstByte;
            break;
        case 2:
            $codePoint = ($firstByte & 0x1F) << 6;
            $codePoint |= ord($char[1]) & 0x3F;
            break;
        case 3:
            $codePoint = ($firstByte & 0x0F) << 12;
            $codePoint |= (ord($char[1]) & 0x3F) << 6;
            $codePoint |= ord($char[2]) & 0x3F;
            break;
        case 4:
            $codePoint = ($firstByte & 0x07) << 18;
            $codePoint |= (ord($char[1]) & 0x3F) << 12;
            $codePoint |= (ord($char[2]) & 0x3F) << 6;
            $codePoint |= ord($char[3]) & 0x3F;
            break;
    }

    return $codePoint;
}</code>
Copy after login

Example Usage:

To use the function, simply provide a UTF-8 character as input:

<code class="php">$char = "ñ";
$codePoint = get_ucs2_codepoint($char);
echo "UCS-2 code point: $codePoint\n";</code>
Copy after login

Output:

UCS-2 code point: 241
Copy after login

The above is the detailed content of How to Extract UCS-2 Code Points from UTF-8 Characters in PHP?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!