I have never been very demanding in coding, so I don’t know much about Unicode and UTF-8. Recently, I accidentally read an article about UTF-8, and I felt that the explanation was very complicated, so I thought of writing a simpler and easier-to-understand article.
First of all, let’s explain some of the encoding schemes commonly used today:
1. In China, the most commonly used encoding in mainland China is GBK18030, in addition to GBK and GB2312 , the relationship between these codes is as follows:
The earliest Chinese character encoding was GB2312, including 6763 Chinese characters and 682 other symbols; the encoding was revised in 1995, named GBK1.0, and a total of 21886 symbols. Later, the GBK18030 encoding was launched, which included a total of 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major ethnic minority languages. Now the WINDOWS platform must support the GBK18030 encoding.
Following the order of GBK18030, GBK, and GB2312, the three encodings are backward compatible. The same Chinese character has the same encoding in the three encoding schemes.
2. Taiwan, Hong Kong and other places use BIG5 encoding
3. Japan: SJIS encoding
If various text encodings are described as dialects of various places, then Unicode It is a language jointly developed by countries around the world.
In this language environment, there will be no more language encoding conflicts. Content in any language can be displayed on the same screen. This is the biggest benefit of Unicode.
So how is Unicode encoded? In fact, it is very simple:
is to encode all the text in the world using 2 bytes. You may ask, 2 bytes can represent up to 65536 codes, is it enough?
Most of the Chinese characters in Korea and Japan were transmitted from China, and the fonts are exactly the same. For example: the word "文" is the same Chinese character in GBK and SJIS, but the encoding is different. In that way, with unified encoding like this, 2 bytes are enough to accommodate most text in all languages in the world.
The scientific name of Unicode is "Universal Multiple-Octet Coded Character Set", referred to as UCS.
What is currently used is UCS-2, which is a 2-byte encoding, and UCS-4 was developed to prevent 2 bytes from being insufficient in the future. UCS-2 is also called the Basic Multilingual Plane.
Converting UCS-2 to UCS-4 is simply adding 2 bytes of 0 in front.
UCS-4 is mainly used to save auxiliary planes, such as the second auxiliary plane in Unicode 4.0
20000-20FFF - 21000-21FFF - 22000-22FFF - 23000-23FFF - 24000-24FFF - 25000-25FFF - 26000-26FFF - 27000-27FFF - 28000-28FFF - 29000-29FFF - 2A000-2AFFF - 2F000-2FFFF
A total of 16 auxiliary planes have been added, expanding from the original 65536 codes to nearly 1 million codes.
So now that the encoding has been unified, how can it be compatible with the original text encodings of various countries?
At this time codepage is needed.
What is codepage? Codepage is the mapping table between each country's text encoding and Unicode. For example, the mapping table between Simplified Chinese and Unicode is CP936.
The following are several commonly used codepages. Just modify the number of the above address accordingly:
codepage=936 Simplified Chinese GBK
codepage=950 Traditional Chinese BIG5
codepage =437 United States/Canada English
codepage=932 Japanese
codepage=949 Korean
codepage=866 Russian
codepage=65001 unicode UFT-8
The last one is 65001, according to personal Understand, it should be just a virtual mapping table, but it is actually just an algorithm.
Take a random line from 936, for example:
0x9993 0x6ABD #CJK UNIFIED IDEOGRAPH
The previous encoding is GBK encoding, and the following one is Unicode.
By checking this table, you can easily convert between GBK and Unicode.
Now that we understand Unicode, what is UTF-8? And why does UTF-8 appear?
Convert ASCII to UCS-2, just insert a 0x0 before encoding. Using these encodings will include some control characters, such as or /, which will cause serious errors in UNIX and some C functions. Therefore, it is certain that UCS-2 is not suitable as an external encoding for Unicode.
Thus, UTF-8 was born. So how is UTF-8 encoded? How to solve the problem of UCS-2?
Example:
E4 BD A0 11100100 10111101 10100000
This is the UTF-8 encoding of the word "you"
4F 60 01001111 0 1100000
This is " The Unicode encoding of "you"
is decomposed according to the encoding rules of UTF-8 as follows: xxxx0100 xx111101 xx100000
Splice the numbers except x together to become the Unicode encoding of "you".
Pay attention to the first three 1’s of UTF-8, indicating that the entire UTF-8 string is composed of 3 bytes.
After UTF-8 encoding, sensitive characters will no longer appear because the highest bit is always 1.
The following is the conversion relationship table between Unicode and UTF-8:
U-00000000 - U-0000007F: 0xxxxxxx
U-00000080 - U-000007FF: 110xxxxx 10xxxxxx
U-00000800 - U -0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10 xxxxxx 10xxxxxx
U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Convert Unicode encoding to UTF-8. Simply put the Unicode byte stream into x and it will become UTF-8.