Getting UCS-2 Code Points for UTF-8 Strings in PHP 4 or 5
To obtain UCS-2 code points for a UTF-8 string, you can leverage existing utilities available in PHP. Consider using libraries like iconv to facilitate this conversion.
In case you prefer a custom solution, it's crucial to understand the UTF-8 format. Each code point is stored as 1-4 bytes, based on its value. The following ranges apply:
To determine the number of bytes in a character, examine the first byte. A 0 prefix indicates a 1-byte character, 110 indicates 2 bytes, 1110 a 3-byte character, and 11110 a 4-byte character.
Once you know the character's size, you can perform bitwise operations to convert it. Note that UCS-2 cannot represent characters above U FFFF.
For reference, here's a PHP 4 or 5 function that you can use:
<code class="php">function get_ucs2_codepoint($char) { $byte = ord($char); if ($byte < 128) { return $byte; } elseif ($byte < 224) { return (($byte & 63) << 6) | (ord($char[1]) & 63); } elseif ($byte < 240) { return (($byte & 31) << 12) | ((ord($char[1]) & 63) << 6) | (ord($char[2]) & 63); } else { return 0; // UCS-2 cannot handle code points this high } }</code>
Remember, this function doesn't handle all Unicode characters, only those representable with UCS-2. If you need to handle full Unicode, you should use alternative libraries or PHP 6 functions.
The above is the detailed content of How can I convert a UTF-8 string to UCS-2 code points in PHP 4 or 5?. For more information, please follow other related articles on the PHP Chinese website!