In daily front-end development work, we often deal with HTML, JavaScript, CSS and other languages. Just like a real language, computer language also has its alphabet, grammar, lexicon, coding method, etc.
Here I will briefly talk about the coding issues that are often encountered in front-end HTML and JavaScript daily work.
In computers, the information we store is represented by binary codes. The mutual conversion between the symbols such as English and Chinese characters displayed on the screen and the binary codes used for storage is encoding.
There are two basic concepts that need to be explained, charset and character encoding:
charset, character set, which is a table of the mapping relationship between a certain symbol and a certain number, that is, it determines 107 is the 'a' of koubei, and 21475 is the "口" of Koubei. Different tables have different mapping relationships, such as ascii, gb2312, and Unicode. Through this mapping table of numbers and characters, we can convert a binary representation of a number. into a certain character.
chracter encoding, encoding method. For example, for the number 21475 that should be "mouth", should we use u5k3e3 to represent it, or should we use 口 to represent it? This is determined by character encoding.
For strings like 'koubei.com', which are commonly used characters in Americans, they developed a character set called ASCII, whose full name is american standard code of information interchange. , using the 128 numbers 0-127, (2 to the 7th power, 0×00-0×7f) represents the 128 commonly used characters such as 123abc. There are 7 bits in total, plus the first one is the sign bit, which is used to complement the one's complement to represent negative numbers and so on. A total of 8 bits constitute a byte. The Americans were a bit stingy back then. If a byte was designed from the beginning to have 16 bits or 32 bits, there would be fewer problems in the world. However, at that time, they probably thought that 8 bits was enough and could represent 128 different characters. !
Since computers were invented by Americans, they saved themselves the trouble and encoded all the symbols they use, which is quite fun to use. But when computers began to become internationalized, problems arose. Take China as an example. There are tens of thousands of Chinese characters. What should we do?
The existing 8 bits and one byte system is the foundation and cannot be destroyed or changed to 16 bits or the like. Otherwise, the changes will be too big and we can only take another path: use multiple ascii characters. To represent another character, that is, MBCS (Multi-Byte Character System, multi-byte character system).
With this MBCS concept, we can represent more characters. For example, if we use 2 ascii characters, there are 16 bits. In theory, there are 2 to the 16th power of 65536 characters. But how are these codes assigned to characters? For example, the Unicode encoding of "口" in Koubei is 21475. Who decided that? Character set, which is the charset just introduced. ascii is the most basic character set. On top of this, we have character sets similar to gb2312, big5 and other MBCS character sets for simplified Chinese and traditional Chinese. Finally, an organization called the Unicode Consortium decided to create a character set (UCS, Universal Character Set) that includes all characters and a standard for the corresponding encoding method, namely Unicode. Starting in 1991, it released the first version of the Unicode international standard, ISBN 0-321-18578-1, and the International Organization for Standardization ISO also participated in the customization of this, ISO/IEC 10646: the Universal Character Set. In short, Unicode is a character standard that basically covers all existing symbols on the earth. It is now being used more and more widely. The ECMA standard also stipulates that the internal characters of the JavaScript language use the Unicode standard (this means that JavaScript Variable names, function names, etc. are allowed in Chinese!).
For developers in China, they may encounter more problems such as conversion between gbk, gb2312, and utf-8. Strictly speaking, this statement is not very accurate. gbk and gb2312 are character sets (charset), and utf-8 is an encoding method (character encoding). It is an encoding method of the UCS character set in the Unicode standard, because Unicode characters are used The web pages of the collection are mainly encoded in UTF-8, so people often juxtapose them, which is actually inaccurate.
With Unicode, at least until human civilization encounters aliens, this is a master key, so everyone should use it. The most widely used Unicode encoding method now is UTF-8 (8-bit UCS/Unicode Transformation Format), which has several particularly good features:
Encoded UCS character set, universally used around the world
It is a variable-length character encoding method that is compatible with ascii
The second point is a big advantage. It makes the system that previously used pure ascii encoding compatible without adding additional storage ( Assuming that the fixed-length encoding method stipulates that each character consists of 2 bytes, then the storage space occupied by ASCII characters will double).
To explain UTF-8 clearly, it will be more convenient to introduce a table:
U-00000000 – U-0000007F: 0xxxxxxx
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
U-00000800 – U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx
U-00010000 – U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000 – U-03FFFFFF: 111110 xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
U- 04000000 – U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
To understand this table, we only need to look at the first two lines
U-00000000 – U-0000007F:
0xxxxxxx The first line is like this, which means that if you find a utf- The binary code of the 8-encoded byte is 0xxxxxxx, which starts with 0, that is, between 0-127 in decimal. Then it is a single byte that represents a character, and has exactly the same meaning as the ascii code. All other utf8-encoded binary values start with 1, 1xxxxxxx, are greater than 127, and require at least 2 bytes to represent a symbol. So the first bit of a byte is a switch, indicating whether the character is an ASCII code. This is the compatibility just mentioned. From the English definition, it is the two attributes of utf8 encoding:
UCS characters U 0000 to U 007F (ASCII) are encoded simply as bytes 0×00 to 0× 7F (ASCII compatibility). This means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters >U 007F are encoded as a sequence of several bytes , each of which has the most significant bit set. Therefore, no ASCII byte (0×00-0×7F) can appear as part of any other character.
Then let’s look at the second line:
U-00000080 – U-000007FF: 110xxxxx 10xxxxxx
Look at the first byte first: 110xxxxx, its meaning is that I am not an ascii code (because the first bit is not 0), I am an The first byte of the multi-bytes character (the second bit is 1), the character I am participating in represents is composed of 2 bytes (the third bit is 0), starting from the fourth bit is where the character information is stored .
Look at the second byte: 10xxxxxx, its meaning is: I am not an ascii code (because the first bit is not 0), I am not the first byte of a multi-byte character (the second bit is 0) ), starting from the third bit is the location where the character information is stored.
It can be concluded from this example that in the UTF-8 encoding method, in a long series of continuous binary byte codes, a symbol may be represented by 2 to 6 bytes, so compared to using A byte represents the ASCII code of the symbol. We need space to store two additional information: First, the starting position of the symbol, a "starter" position. In biological terms, it is the start codon AUG during protein translation. Second, the number of bytes used by this symbol (in fact, if each symbol has a starter, this length does not need to be provided, but providing length information increases fault tolerance when some bytes are lost). The solution is: use whether the second bit of a byte is 1 to represent whether the byte is the starting byte of a character (because the first bit in a byte has just been used, 0 means ascii code, 1 means non ascii ), that is, the first bytes of a multi-byte symbol must be 11xxxxxx, a binary number between 192 and 255. Next, starting from the third bit, the length information is provided. The third bit is 0, which means that the symbol is 2 bytes. For each additional 1 starting from the third bit, the number of bytes occupied by the character increases by one. UTF-8 defines up to 6 bytes of characters, which requires 4 more 1s than a 2-byte starter like 110xxxxx, so this starter is 1111110x, as shown in the table above.
Look at the standard definition in English again, it expresses the same meaning:
The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. All further bytes in a multibyte sequence are in the range 0×80 to 0xBF. This allows easy resynchronization and makes the encoding stateless and robust against missing bytes.
Real information bits ( That is, the real digital information in the charset character set) is directly placed on the 'x' of the table above in binary format in order. Let’s take the Chinese characters that our Chinese programmers have the most contact with. Their encoding range is between U-00000800 – U-0000FFFF. From the table above, you can find that the UTF-8 encoding for this range uses three Represented by bytes (this is why utf-8 encoded Chinese characters use more storage space than EUC-CN encoded gb2312 character set Chinese characters that occupy 2 bytes per character), or use the word "口" of word-of-mouth For example, the number of mouth in Unicode is like this:
口: 21475 == 0×53e3 == Binary 101001111100011
In javascript, run this code (use firebug console, or Edit an HTML and insert the following code between a pair of script tags):
alert('u53e3′); //get '口'
alert(escape('口')); // get '%u53E3′
alert(String.fromCharCode('21475′)); // get '口'
alert('口'.charCodeAt(0)); // get '21475'
alert (encodeURI('口')); //get '口'
As you can see, the string literal can get the character '口' in the form of u hexadecimal Unicode code, and the fromCharCode method accepts 10 The hexadecimal Unicode code is used to obtain the character '口'.
The second alert got '%u7545′, which is a non-standard Unicode encoding and is part of the Percent encoding of URI. However, this method of use has been officially rejected by W3C and is not included in any RFC. Standard, ECMA-262 standard stipulates this behavior of escape, and it is estimated to be temporary.
What’s more interesting is the ‘mouth’ I got in the fifth alert. What is this? How did you get it?
This is Percent encoding, which is commonly used on URIs, and is specified in the RFC 3986 standard.