How to convert Unicode and Utf-8 encoding in PHP, unicodeutf-8_PHP tutorial

WBOY
Release: 2016-07-13 09:45:44
Original
720 people have browsed it

How does PHP realize the conversion between Unicode and Utf-8 encoding? unicodeutf-8

I happened to need to convert unicode encoding recently, so I checked the library functions of PHP. I couldn't find a function that can encode and decode Unicode strings! Well, if you can't find it, just implement it yourself. . .
The difference between Unicode and Utf-8 encoding

Unicode is a character set, and UTF-8 is one of Unicode. Unicode is fixed-length and is double-byte, while UTF-8 is variable. For Chinese characters, Unicode occupies a byte ratio UTF-8 takes up 1 byte less. Unicode is double bytes, while Chinese characters in UTF-8 occupy three bytes.
UTF-8 encoded characters can theoretically be up to 6 bytes long, but 16-bit BMP (Basic Multilingual Plane) characters can only be up to 3 bytes long. Let’s take a look at the UTF-8 encoding table:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The position of

xxx is filled in by the binary representation of the character encoding number. The further to the right x has less special meaning. Only the shortest one is used to express a multi-byte string of a character encoding number. Note that in a multi-byte string, the number of "1"s at the beginning of the first byte is the number of bytes in the entire string. The first line starts with 0 to be compatible with ASCII encoding, which is one byte, the second line is a double-byte string, the third line is 3 bytes, such as Chinese characters, and so on. (Personal opinion: In fact, we can simply regard the number of 1’s in front as the number of bytes)

How to convert Unicode to Utf-8

In order to convert Unicode to UTF-8, of course you need to know what the difference is. Let’s take a look at how the encoding in Unicode is converted into UTF-8. In UTF-8, if the byte of a character is less than 0x80 (128), it is an ASCII character, occupying one byte, and no conversion is needed. Because UTF-8 is compatible with ASCII encoding. If the encoding of the Chinese character "you" in Unicode is "u4F60", convert it to binary to 100111101100000, and then convert it according to the UTF-8 method. Binary digits can be taken from the Unicode binary from low to high, taking 6 digits at a time. For example, the above binary digits can be taken out into the format shown below. The previous ones are filled according to the format, and any less than 8 bits are filled with 0.

unicode: 100111101100000 4F60

utf-8: 11100100,10111101,10100000 E4BDA0

From the above, you can intuitively see the conversion between Unicode and UTF-8. Of course, after knowing the format of UTF-8, you can perform the inverse operation, which is to put it at the corresponding position in the binary according to the format. Take it out, and then convert it to the resulting Unicode character (this operation can be completed through "displacement"). For example, in the above conversion of "you", since its value is greater than 0x800 and less than 0x10000, it can be judged as three-byte storage. Then the highest bit needs to be shifted to the right by "12" bits and then according to the three-byte format, the highest bit is 11100000 (0xE0 ) or (|) to get the highest value. In the same way, the second digit is shifted to the right by "6" bits, and the binary value of the highest digit and the second digit is left. It can be calculated by performing the position (&) operation with 111111 (0x3F), and then summed with 11000000 (0x80). or (|). There is no need to shift the third bit, just take the last six bits directly (& with 111111 (ox3F)), and then OR (|) with 11000000 (0x80).

How to convert Utf-8 back to Unicode

Of course, the conversion from UTF-8 to Unicode is also done through shifting, etc., which is to extract the binary numbers in the corresponding positions of the UTF-8 format. In the above example, "you" is three bytes, so each byte must be processed, from high bit to low bit. In UTF-8 "you" is 11100100,10111101,10100000. Starting from the high bit, that is, the first byte 11100100 is to take out the "0100". This is very simple. Just take the AND (&) with 11111 (0x1F). From the three bytes, we can know that the highest position must be before the 12th bit. , because six digits are taken each time. Therefore, the obtained result needs to be shifted to the left by 12 bits, and the highest bit is now 0100,000000,000000. The second bit is to take out "111101", so you only need to AND (&) the second byte 10111101 and 111111 (0x3F). After shifting the result to the left by 6 bits and taking the result of the highest byte or (|), the second bit is completed, and the result is 0100,111101,000000. By analogy, the last digit is directly ANDed (&) with 111111 (0x3F), and then ORed (|) with the previous result to get the result 0100,111101,100000.

PHP code implementation:

/**
 * utf8字符转换成Unicode字符
 * @param [type] $utf8_str Utf-8字符
 * @return [type]      Unicode字符
 */
function utf8_str_to_unicode($utf8_str) {
  $unicode = 0;
  $unicode = (ord($utf8_str[0]) & 0x1F) << 12;
  $unicode |= (ord($utf8_str[1]) & 0x3F) << 6;
  $unicode |= (ord($utf8_str[2]) & 0x3F);
  return dechex($unicode);
}

/**
 * Unicode字符转换成utf8字符
 * @param [type] $unicode_str Unicode字符
 * @return [type]       Utf-8字符
 */
function unicode_to_utf8($unicode_str) {
  $utf8_str = '';
  $code = intval(hexdec($unicode_str));
  //这里注意转换出来的code一定得是整形,这样才会正确的按位操作
  $ord_1 = decbin(0xe0 | ($code >> 12));
  $ord_2 = decbin(0x80 | (($code >> 6) & 0x3f));
  $ord_3 = decbin(0x80 | ($code & 0x3f));
  $utf8_str = chr(bindec($ord_1)) . chr(bindec($ord_2)) . chr(bindec($ord_3));
  return $utf8_str;
}
Copy after login

Tested it

$utf8_str = '我';

//这是汉字“你”的Unicode编码
$unicode_str = '4f6b';

//输出 6211
echo utf8_str_to_unicode($utf8_str) . "<br/>";

//输出汉字“你”
echo unicode_str_to_utf8($unicode_str);

Copy after login

以上这些转换是针对中文汉字(非ASCII)的测试,并且只支持单个字符【一个完整的utf8字符或是一个完整的Unicode字符】互相转换,希望对大家的学习有所帮助。

www.bkjia.comtruehttp://www.bkjia.com/PHPjc/1039193.htmlTechArticlePHP如何实现Unicode和Utf-8编码相互转换,unicodeutf-8 最近恰好要用到unicode编码的转换,就去查了一下php的库函数,居然没找到一个函数可以对...
Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template