Newbies must pay attention to HTML's language encoding charset (must read)-HTML Tutorial-php.cn

Newbies must pay attention to HTML's language encoding charset (must read)

云罗郡主

Release： 2018-10-10 15:25:30

forward

3757 people have browsed it

What this article brings to you is that novices must pay attention to the language encoding charset of HTML (a must-read). It has certain reference value. Friends in need can refer to it. I hope it will be helpful to you.

Newbies must pay attention to HTMLs language encoding charset (must read)

Pay attention to the importance of HTML language encoding

1. Importance of coding

Encoding can cause garbled web pages when viewers use IE, and can also lead to compatibility hacks in p css.

2. Coding position

Generally, this web page encoding is placed between

and in the html web page.

3. html encoding style

You can change the encoding of the web page by changing utf-8 in charset=utf-8.
Generally when we write a CSS file, we also need to use @charset "utf-8"; at the top of the CSS file to define the encoding type of this CSS file. Generally, the HTML source code and CSS file encoding must be unified. If they are not unified, it will lead to compatibility issues such as CSS hacks, garbled pages, and chaotic page layout.

4. Commonly used html encoding types

The two popular ones commonly used in China are utf-8 and gb2312. Generally, these two types can meet domestic web page encoding needs. Of course, these two encoding types are also used in programs and databases to process web pages and store data types.

5. UTF-8 has the following characteristics:

UCS characters U 0000 to U 007F (ASCII) are encoded as bytes 0x00 to 0x7F (ASCII compatible). This means that files containing only 7-bit ASCII characters are the same in both ASCII and UTF-8 encodings.
All UCS characters > U 007F are encoded as a string of multiple bytes, each with a set of flag bits. Therefore, ASCII bytes (0x00-0x7F) cannot be part of any other character.
The first byte of a multibyte string representing a non-ASCII character is always in the range 0xC0 to 0xFD, and indicates how many bytes the character contains. The remaining bytes of the multibyte string are in the range 0x80 to 0xBF . This makes resynchronization very easy and makes encodings borderless and rarely affected by missing bytes.
Can program all possible 231 UCS codes
UTF-8 encoded characters can theoretically be up to 6 bytes long, whereas 16-bit BMP characters can only be up to 3 bytes long.
The order of Bigendian UCS-4 byte strings is predetermined.
Bytes 0xFE and 0xFF are never used in UTF-8 encoding.

6. GB2312 has the following characteristics

The GB2312 standard includes a total of 6763 Chinese characters, including 3755 first-level Chinese characters and 3008 second-level Chinese characters. At the same time, GB2312 includes 682 characters including Latin letters, Greek letters, Japanese hiragana and katakana letters, and Russian Cyrillic letters. fullwidth characters.

The emergence of GB2312 basically meets the computer processing needs of Chinese characters. The Chinese characters it contains have covered 99.75% of the frequency of use. In GB2312, the collected Chinese characters are "partitioned", and each zone contains 94 Chinese characters/symbols. This representation is also called location code.

Areas 01-09 are special symbols.

Areas 16-55 are first-level Chinese characters, sorted by pinyin.

Areas 56-87 are second-level Chinese characters, sorted by radical/stroke.

Areas 10-15 and 88-94 are not coded.

For example, the character "ah" is the first Chinese character in GB2312, and its location code is 1601. In programs using GB2312, the byte structure usually uses the EUC storage method to be compatible with ASCII. Each Chinese character and symbol is represented by two bytes. The first byte is called the "high byte" and the second byte is called the "low byte". The "high byte" uses 0xA1-0xF7 (add 0xA0 to the area code of area 01-87), and the "low byte" uses 0xA1-0xFE (add 01-94 to 0xA0). For example, the word "Ah" will be stored as 0xB0A1 in most programs. (Compare with area code: 0xB0=0xA0 16, 0xA1=0xA0 1).

Therefore, the decimal system of the Chinese character area code in GB2312 encoding is from 176 to 247, and the bit code is from 161 to 255. The reason why 6763 is stored is less than 82*94=6768 because the area code is 215 and the bit code is between 250-254. There are five codes in total without Chinese character codes, so 6768-5=6763.

GB2312 encoding can be understood as a common language in China.

7. Recommended encoding for charset

UTF-8 can be easily understood. Simplified and Traditional Chinese can use this encoding. For example, Taiwan and Mainland China use this encoding.

8. Web page compatibility errors caused by encoding

If the encoding is mixed, the web page will be garbled, which is also called incompatible. Especially if the encoding is mixed in CSS comments, it will lead to css hack.

The above is a complete introduction to the HTML language encoding charset (a must-read) for beginners. If you want to know more about HTML tutorials, please pay attention to the PHP Chinese website.

The above is the detailed content of Newbies must pay attention to HTML's language encoding charset (must read). For more information, please follow other related articles on the PHP Chinese website!