Detailed explanation of various php encoding sets and under what circumstances they should be used

A character set is a collection of multiple characters. There are many types of character sets, and each character set contains a different number of characters. Common character set names: ASCII character set, GB2312 character set, BIG5 character set, GB 18030 character set, Unicode characters Set etc. In order for a computer to accurately process text in various character sets, character encoding is required so that the computer can recognize and store various text.

Chinese has a large number of characters, and it is also divided into two characters with different writing rules: Simplified Chinese and Traditional Chinese. Computers were originally designed based on English single-byte characters. Therefore, encoding Chinese characters is Technical basis for information exchange in Chinese. This article will discuss several typical character sets in chronological order of character sets, select several representative Chinese character sets, and study the historical origin, characteristics, and technical features.

ASCII character set

1. Origin of the name

ASCII (American Standard Code for Information Interchange, American Standard Code for Information Interchange) is a set based on the Roman alphabet Computer coding system.

2. Features

It is mainly used to display modern English and other Western European languages. It is the most common single-byte encoding system today and is equivalent to the international standard ISO 646.

3. Contains content

Control characters: Enter key, backspace, line feed key, etc.

Characters that can be displayed: English upper and lower case characters, Arabic numerals and Western symbols

4. Technical features

7 bits represent one character, a total of 128 characters

5. ASCII extended character set

The 7-bit encoded character set can only support 128 characters. In order to represent more commonly used European characters, ASCII has been extended. The ASCII extended character set uses 8 Bits represent a character, a total of 256 characters.

The symbols extended by the ASCII extended character set include tabular symbols, calculation symbols, Greek letters and special Latin symbols.

GB2312 character set

1. Origin of the name

GB2312 is also known as GB2312-80 character set, the full name is "Chinese Coded Character Set for Information Exchange·Basic Set" , issued by the former China State Administration of Standards and implemented on May 1, 1981.

2. Features

GB2312 is China’s national standard simplified Chinese character set. The Chinese characters it contains have covered 99.75% of the frequency of use, basically meeting the computer processing needs of Chinese characters. It is widely used in mainland China and Singapore.

3. Content included

GB2312 includes simplified Chinese characters and general symbols, serial numbers, numbers, Latin letters, Japanese kana, Greek letters, Russian letters, Chinese pinyin symbols, and Chinese phonetic letters, in total 7445 graphic characters. It includes 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters; 682 full-width characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters.

4. Technical features

(1) Partition representation:

In GB2312, the collected Chinese characters are "partitioned", and each zone contains 94 Chinese characters/symbols. This representation is also called location code.

The characters included in each area are as follows: Areas 01-09 are special symbols; Areas 16-55 are first-level Chinese characters, sorted by pinyin; Areas 56-87 are second-level Chinese characters, sorted by radicals/strokes; 10 Areas -15 and 88-94 are not coded.

(2) Double-byte representation

The first byte of the two bytes is the first byte, and the following byte is the second byte. It is customary to call the first byte the "high byte" and the second byte the "low byte".

The "high byte" uses 0xA1-0xF7 (add 0xA0 to the area code of area 01-87), and the "low byte" uses 0xA1-0xFE (add 01-94 to 0xA0).

5. Encoding example

Take the first Chinese character "ah" in the GB2312 character set as an example. Its area code is 16 and the bit number is 01. The area code is 1601. In most cases, the area code is 1601. In the computer program, add 0xA0 to the high byte and low byte respectively to obtain the Chinese character processing code 0xB0A1 of the program. The calculation formula is: 0xB0=0xA0+16, 0xA1=0xA0+1.

BIG5 character set

1. Origin of the name

Also known as Big Five or Big Five, it was established in 1984 by the Taiwan Information Industry Promotion Association and five Software companies Acer, MiTAC, Allison, Zero One, and FIC were founded, so it is called the Big Five.

The Big5 code was created because different manufacturers in Taiwan at that time launched different codes, such as Yitian code, IBM PS55, Wangan code, etc., which were incompatible with each other; on the other hand, the Taiwan government had not yet launched an official code. Chinese character encoding, and the GB2312 encoding in mainland China does not include traditional Chinese characters.

2. Features

The Big5 character set contains a total of 13,053 Chinese characters. This character set is used in Taiwan, China. What is intriguing is that this character set repeatedly contains the same two characters: "兀" (0xA461 and 0xC94A) and "嗀" (0xDCD1 and 0xDDFC).

　3. Character encoding method

　 Big5 code uses a double-byte storage method, using two bytes to encode a word. The first byte is called the "high byte" and the second byte is called the "low byte".The encoding range of the high-order byte is 0xA1-0xF9, and the encoding range of the low-order byte is 0x40-0x7E and 0xA1-0xFE.

The character types corresponding to each encoding range are as follows: 0xA140-0xA3BF are punctuation marks, Greek letters and special symbols. In addition, 0xA259-0xA261 stores the words for the two-syllable unit of measurement: 兙兛兞兝兡兣嗧瓩玳; 0xA440-0xC67E are commonly used Chinese characters, sorted by strokes first and then by radicals; 0xC940-0xF9D5 are the next most commonly used Chinese characters, also sorted by strokes first and then by radicals.

4. Limitations of Big5

Although the Big5 code contains more than 10,000 characters, it does not take into account the names of people, place names, dialects, chemistry and biology that are circulating in society. The characters used do not include Japanese hiragana and katakana letters.

For example, in Taiwan, the word "Zhu" is regarded as a variant of "Zhu", so the word "Zhu" is not included. Some radicals in the Kangxi dictionary (such as "亠", "疒", "辵", "綶", etc.), common names (such as "kun", "xuan", "cypress", "喆") ", etc.) are not included in the Big5.

GB18030 character set

1. Origin of the name

The full name of GB 18030 is GB18030-2000 "Expansion of the Basic Set of Chinese Coded Character Sets for Information Exchange", which is the The government issued a new national standard for Chinese character encoding on March 17, 2000. Software released on the Chinese market after August 31, 2001 must comply with this standard

　2. Features

GB The 18030 character set standard was introduced after extensive participation and demonstration, and was jointly implemented by well-known companies in the information technology industry at home and abroad, the Ministry of Information Industry and the former State Administration of Quality and Technical Supervision.

GB 18030 character set standard solves the problem of computer encoding of large character sets composed of Chinese characters, Japanese kana, Korean and Chinese minority characters. The total character encoding space of this standard exceeds 1.5 million encoding bits and contains 27,484 Chinese characters, covering Chinese, Japanese, Korean and Chinese minority scripts. It meets the multi-language, large font size, multi-purpose, and unified coding format requirements for information exchange in East Asia, including mainland China, Hong Kong, Taiwan, Japan, and South Korea. And it is compatible with Unicode version 3.0, filling in the content of the Unicode extended character vocabulary "Unified Chinese Character Extension A". And it is compatible with the previous national character encoding standards (GB2312, GB13000.1).

　3. Encoding method

　GB 18030 standard uses three methods of single byte, double byte and four byte to encode characters. The single-byte part uses codes 0×00 to 0×7F (corresponding to the corresponding codes of the ASCII code). In the double-byte part, the first byte code ranges from 0×81 to 0×FE, and the last byte code bits are 0×40 to 0×7E and 0×80 to 0×FE respectively. The four-byte part uses 0×30 to 0×39 that are not used in GB/T 11383 as the suffix for the double-byte encoding expansion. The expanded four-byte encoding ranges from 0×81308130 to 0×FE39FE39. The first and three byte encoding code bits are all from 0×81 to 0×FE, and the second and four byte encoding code bits are all from 0×30 to 0×39.

4. Content included

The content included in the double-byte part mainly includes 20902 all CJK Chinese characters in GB13000.1, 13 related punctuation marks, ideographic descriptors, supplementary Chinese characters and parts 80 headers/components, double-byte encoded euro symbols, etc. The four-byte part contains all characters in GB 13000.1, including CJK Unified Chinese Character Extension A, except the above-mentioned double-byte characters.

Unicode character set

1. Origin of the name

Unicode character set encoding is the abbreviation of Universal Multiple-Octet Coded Character Set, which is composed of A character encoding system developed by an organization called the Unicode Consortium to support the exchange, processing, and display of written text in various languages around the world. The encoding began to be developed in 1990 and was officially announced in 1994. The latest version is Unicode 4.1.0 on March 31, 2005.

2. Features

Unicode is a character encoding used on computers. It sets a unified and unique binary encoding for each character in each language to meet the requirements for cross-language and cross-platform text conversion and processing.

3. Encoding method

The Unicode standard always uses hexadecimal numbers, and is prefixed with "U+" when writing. For example, the encoding of the letter "A" is 004116 and the character The encoding of "?" is 20AC16. So the encoding of "A" is written as "U+0041".

　4.UTF-8 encoding

　UTF-8 is one of the ways to use Unicode. UTF is Unicode Translation Format, which means converting Unicode into a certain format.

UTF-8 facilitates the transmission of text in different languages and encodings between different computers over the network, allowing double-byte Unicode to be correctly transmitted on existing systems that handle single-byte processing.

UTF-8 uses variable length bytes to store Unicode characters. For example, ASCII letters continue to be stored with 1 byte, accented characters, Greek letters or Cyrillic letters are stored with 2 bytes, and commonly used Chinese characters require 3 bytes. Auxiliary plane characters use 4 bytes.

5. UTF-16 and UTF-32 encoding

UTF-32, UTF-16 and UTF-8 are the character encoding schemes of the Unicode standard encoding character set. UTF-16 uses a Or a sequence of two unallocated 16-bit code units to encode a Unicode code point; UTF-32 represents each Unicode code point as a 32-bit integer of the same value.

Solutions to garbled code problems in various PHP applications

1) Use tags to set page encoding

The function of this tag is to declare what character set encoding the client's browser uses for display In this page, xxx can be GB2312, GBK, UTF-8 (different from MySQL, which is UTF8), etc. Therefore, most pages can use this method to tell the browser what encoding to use when displaying this page, so as to avoid encoding errors and garbled characters. But sometimes we will find that this sentence still doesn't work. No matter which xxx is, the browser always uses the same encoding. I will talk about this later.

Please note that it belongs to HTML information and is just a statement, which only indicates that the server has passed the HTML information to the browser.

　2) header("content-type:text/html; charset=xxx");

　The function of this function header() is to send the information in the brackets to the http header. If the content in the brackets is as mentioned in the article, the function is basically the same as the label. If you compare the first one, you will find that the characters are similar. But the difference is that if there is this function, the browser will always use the xxx encoding you requested and will never be disobedient, so this function is very useful. Why is this? Then we have to talk about the difference between http headers and HTML information:

The http header is a string sent by the server before sending HTML information to the browser using the http protocol. The tag belongs to HTML information, so the content sent by header() reaches the browser first. The popular point is that header() has a higher priority (I don’t know if I can say this). If a php page has both header("content-type:text/html;charset=xxx") and header("content-type:text/html;charset=xxx"), the browser will only recognize the former http header and not the meta. Of course, this function can only be used within php pages.

There is also a question left, why does the former definitely work, but the latter sometimes does not work? This is the reason why we want to talk about Apache next.

　3) AddDefaultCharset

　In the conf folder of the Apache root directory, there is the entire Apache configuration document httpd.conf.

Use a text editor to open httpd.conf. Line 708 (different versions may be different) contains AddDefaultCharset xxx, where xxx is the encoding name. The meaning of this line of code: Set the character set in the http header of the web page file in the entire server to your default xxx character set. Having this line is equivalent to adding a line of header("content-type:text/html; charset=xxx") to each file. Now you can understand why the browser always uses gb2312 even though it is set to utf-8.

If there is header("content-type:text/html; charset=xxx") in the web page, the default character set will be changed to the character set you set, so this function will always be useful. If you add a "#" in front of AddDefaultCharset xxx, comment out this sentence, and the page does not contain header ("content-type..."), then it is the meta tag's turn to take effect.

The priority order of the above is listed below:

header("content-type:text/html; charset=xxx")

.. AddDefaultCharset xxx

　..

　 If you are a web programmer, it is recommended to add a header ("content-type: text/html; charset=xxx") to each of your pages, so as to ensure that it is Any server can display correctly and is more portable.

　4) Default_charset configuration in php.ini:

　default_charset = "gb2312" in php.ini defines the default language character set of php. It is generally recommended to comment out this line and let the browser automatically select the language based on the charset in the web page header instead of making a mandatory requirement. This way, web services in multiple languages can be provided on the same server.

Detailed explanation of various php encoding sets and under what circumstances they should be used_PHP tutorial