How to convert Unicode and Utf-8 encoding in PHP, unicodeutf-8

Table of Contents

How does PHP realize the conversion between Unicode and Utf-8 encoding? unicodeutf-8

Home

Backend Development

PHP Tutorial

How to convert Unicode and Utf-8 encoding in PHP, unicodeutf-8_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 13, 2016 am 09:45 AM

php unicode utf8

How does PHP realize the conversion between Unicode and Utf-8 encoding? unicodeutf-8

I happened to need to convert unicode encoding recently, so I checked the library functions of PHP. I couldn't find a function that can encode and decode Unicode strings! Well, if you can't find it, just implement it yourself. . .
The difference between Unicode and Utf-8 encoding

Unicode is a character set, and UTF-8 is one of Unicode. Unicode is fixed-length and is double-byte, while UTF-8 is variable. For Chinese characters, Unicode occupies a byte ratio UTF-8 takes up 1 byte less. Unicode is double bytes, while Chinese characters in UTF-8 occupy three bytes.
UTF-8 encoded characters can theoretically be up to 6 bytes long, but 16-bit BMP (Basic Multilingual Plane) characters can only be up to 3 bytes long. Let’s take a look at the UTF-8 encoding table:

U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The position of

xxx is filled in by the binary representation of the character encoding number. The further to the right x has less special meaning. Only the shortest one is used to express a multi-byte string of a character encoding number. Note that in a multi-byte string, the number of "1"s at the beginning of the first byte is the number of bytes in the entire string. The first line starts with 0 to be compatible with ASCII encoding, which is one byte, the second line is a double-byte string, the third line is 3 bytes, such as Chinese characters, and so on. (Personal opinion: In fact, we can simply regard the number of 1’s in front as the number of bytes)

How to convert Unicode to Utf-8

In order to convert Unicode to UTF-8, of course you need to know what the difference is. Let’s take a look at how the encoding in Unicode is converted into UTF-8. In UTF-8, if the byte of a character is less than 0x80 (128), it is an ASCII character, occupying one byte, and no conversion is needed. Because UTF-8 is compatible with ASCII encoding. If the encoding of the Chinese character "you" in Unicode is "u4F60", convert it to binary to 100111101100000, and then convert it according to the UTF-8 method. Binary digits can be taken from the Unicode binary from low to high, taking 6 digits at a time. For example, the above binary digits can be taken out into the format shown below. The previous ones are filled according to the format, and any less than 8 bits are filled with 0.

unicode: 100111101100000 4F60

utf-8: 11100100,10111101,10100000 E4BDA0

From the above, you can intuitively see the conversion between Unicode and UTF-8. Of course, after knowing the format of UTF-8, you can perform the inverse operation, which is to put it at the corresponding position in the binary according to the format. Take it out, and then convert it to the resulting Unicode character (this operation can be completed through "displacement"). For example, in the above conversion of "you", since its value is greater than 0x800 and less than 0x10000, it can be judged as three-byte storage. Then the highest bit needs to be shifted to the right by "12" bits and then according to the three-byte format, the highest bit is 11100000 (0xE0 ) or (|) to get the highest value. In the same way, the second digit is shifted to the right by "6" bits, and the binary value of the highest digit and the second digit is left. It can be calculated by performing the position (&) operation with 111111 (0x3F), and then summed with 11000000 (0x80). or (|). There is no need to shift the third bit, just take the last six bits directly (& with 111111 (ox3F)), and then OR (|) with 11000000 (0x80).

How to convert Utf-8 back to Unicode

Of course, the conversion from UTF-8 to Unicode is also done through shifting, etc., which is to extract the binary numbers in the corresponding positions of the UTF-8 format. In the above example, "you" is three bytes, so each byte must be processed, from high bit to low bit. In UTF-8 "you" is 11100100,10111101,10100000. Starting from the high bit, that is, the first byte 11100100 is to take out the "0100". This is very simple. Just take the AND (&) with 11111 (0x1F). From the three bytes, we can know that the highest position must be before the 12th bit. , because six digits are taken each time. Therefore, the obtained result needs to be shifted to the left by 12 bits, and the highest bit is now 0100,000000,000000. The second bit is to take out "111101", so you only need to AND (&) the second byte 10111101 and 111111 (0x3F). After shifting the result to the left by 6 bits and taking the result of the highest byte or (|), the second bit is completed, and the result is 0100,111101,000000. By analogy, the last digit is directly ANDed (&) with 111111 (0x3F), and then ORed (|) with the previous result to get the result 0100,111101,100000.

PHP code implementation:

/**
 * utf8字符转换成Unicode字符
 * @param [type] $utf8_str Utf-8字符
 * @return [type]      Unicode字符
 */
function utf8_str_to_unicode($utf8_str) {
  $unicode = 0;
  $unicode = (ord($utf8_str[0]) & 0x1F) << 12;
  $unicode |= (ord($utf8_str[1]) & 0x3F) << 6;
  $unicode |= (ord($utf8_str[2]) & 0x3F);
  return dechex($unicode);
}

/**
 * Unicode字符转换成utf8字符
 * @param [type] $unicode_str Unicode字符
 * @return [type]       Utf-8字符
 */
function unicode_to_utf8($unicode_str) {
  $utf8_str = '';
  $code = intval(hexdec($unicode_str));
  //这里注意转换出来的code一定得是整形，这样才会正确的按位操作
  $ord_1 = decbin(0xe0 | ($code >> 12));
  $ord_2 = decbin(0x80 | (($code >> 6) & 0x3f));
  $ord_3 = decbin(0x80 | ($code & 0x3f));
  $utf8_str = chr(bindec($ord_1)) . chr(bindec($ord_2)) . chr(bindec($ord_3));
  return $utf8_str;
}

Copy after login

Tested it

$utf8_str = '我';

//这是汉字“你”的Unicode编码
$unicode_str = '4f6b';

//输出 6211
echo utf8_str_to_unicode($utf8_str) . "<br/>";

//输出汉字“你”
echo unicode_str_to_utf8($unicode_str);

Copy after login

以上这些转换是针对中文汉字（非ASCII）的测试，并且只支持单个字符【一个完整的utf8字符或是一个完整的Unicode字符】互相转换，希望对大家的学习有所帮助。

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Repo: How To Revive Teammates

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

3 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7337

Java Tutorial

1627

CakePHP Tutorial

1352

Laravel Tutorial

1265

PHP Tutorial

1209

Related knowledge

CakePHP Project Configuration Sep 10, 2024 pm 05:25 PM

In this chapter, we will understand the Environment Variables, General Configuration, Database Configuration and Email Configuration in CakePHP.

PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian Dec 24, 2024 pm 04:42 PM

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

CakePHP Date and Time Sep 10, 2024 pm 05:27 PM

To work with date and time in cakephp4, we are going to make use of the available FrozenTime class.

CakePHP File upload Sep 10, 2024 pm 05:27 PM

To work on file upload we are going to use the form helper. Here, is an example for file upload.

CakePHP Routing Sep 10, 2024 pm 05:25 PM

In this chapter, we are going to learn the following topics related to routing ?

Discuss CakePHP Sep 10, 2024 pm 05:28 PM

CakePHP is an open-source framework for PHP. It is intended to make developing, deploying and maintaining applications much easier. CakePHP is based on a MVC-like architecture that is both powerful and easy to grasp. Models, Views, and Controllers gu

How To Set Up Visual Studio Code (VS Code) for PHP Development Dec 20, 2024 am 11:31 AM

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

CakePHP Creating Validators Sep 10, 2024 pm 05:26 PM

Validator can be created by adding the following two lines in the controller.

See all articles