Unicode is a character encoding scheme that sets a unified and unique binary encoding for each character in each language to achieve cross-language and cross-platform text conversion and processing requirements
Unicode meaning
Unicode provides a unique number for each character, no matter what platform, no matter what program, no matter what language. It was officially announced in 1994 and is an industry standard in the computer field, including character sets, encoding schemes, etc. Unicode was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique binary encoding for each character in each language to achieve cross-language and cross-platform text conversion and processing requirements.
The Development of Unicode Encoding
When designing computers, 8 bits are used as a byte. Therefore, one byte can represent up to 256 characters. In the early days, for Western countries that used English, one byte could store uppercase and lowercase English letters, mathematics, and some symbols, so one byte was used to make the code table (ASCII). Later, computers were spread to other countries, and many countries used their own languages, such as Chinese, Japanese, Korean... The languages were complicated. In order to solve this problem, each country formulated its own code table. China formulated GB2312 in 1980 In the Chinese character encoding character set, there are many more Chinese characters than English. One byte is obviously not enough, so 2 bytes are used for encoding. However, although the character encodings defined by different countries can be used, they are often incompatible between different countries. If the computer wants to handle multiple language environments (using Chinese or other languages), it may not be able to support multiple language environments at the same time. In order to unify the encoding of all texts, Unicode was created to unify all languages into one set of encodings so that there would be no garbled characters.
Unicode encoding representation
When representing Unicode characters, U is usually used followed by a set of hexadecimal digits Represents a character, encoding from U 0000 to U FFFF, supporting more than 60,000 characters in total. Characters other than BMP
need to be represented using 5-digit or 6-digit hexadecimal.
Currently Unicode characters are divided into 17 groups, 0x0000 to 0x10FFFF. Each group is called a plane. Each plane has 65536 code points, a total of 1114112.
Unicode is like a table. All characters are written into the table. Each character corresponds to a number, called a code point. This number is generally not used directly. It is usually used
Use different encoding methods
UTF-8, UTF-16, and UTF-32 are encoding schemes for converting numbers into program data. UTF is the abbreviation of "UnicodeTransformation Format", which can be translated into
Unicode character set conversion format, that is, how to convert numbers defined by Unicode into program data
Decimal |
Unicode encoding |
UTF-8 byte stream |
0-127 bits | 0x000000-0x00007F | 0xxxxxxx(7 digits) |
128-2047 digits |
0x000080-0x0007FF | 110xxxxx 10xxxxxx (11 digits) |
2048-65535 digits | 0x000800-0x00FFFF | 1110xxxx 10xxxxxx 10xxxxxx (16 digits) |
65536-1114111 bits | 0x010000-0x10FFFF | ##11110xxx 10xxxxxx 10xxxxxx 10xxxxxx(21 bits)
The above is the detailed content of what is unicode. For more information, please follow other related articles on the PHP Chinese website!