What every JavaScript developer should know about Unicode-JS Tutorial-php.cn

What every JavaScript developer should know about Unicode

高洛峰

Release： 2016-10-15 11:52:36

Original

1184 people have browsed it

Table of contents:

1 The idea behind Unicode

2 Basic concepts of Unicode

2.1 Characters and code points

2.2 Unicode plane

2.3 Code elements

2.4 Surrogate pairs

2.5 Combining characters

3 in JavaScript Unicode

3.1 Escape sequence

3.2 String comparison

3.3 String length

3.4 Character positioning

3.5 Regular matching

4 Conclusion

1. The idea behind Unicode

First ask the most basic question : How did you read and understand this article? The answer is simple, because you understand the meaning of these words and the words they are made up of.

Then how do you understand the meaning of these words? The answer is also simple, because you (the reader) and I (the author) have the same understanding of the connection between these graphics (presented on the screen) and the Chinese characters (that is, the meaning).

For computers, this principle is similar, with one difference: the computer does not understand the meaning of these words (letters), it just understands them as a specific sequence of bits.

Let us imagine a scenario: Computer User1 sends a message 'hello' to Computer User2.

The computer doesn’t know what these letters mean. So computer User1 converts the message 'hello' into a sequence of numbers 0x68 0x65 0x6C 0x6C 0x6F, where each letter corresponds to a number: h corresponds to 0x68, e corresponds to 0x65, and so on.

Then send these numbers to computer User2.

After computer User2 receives the number sequence 0x68 0x65 0x6C 0x6C 0x6F, it uses the same set of letters and numbers to reconstruct the message content, and 'hello' can be displayed correctly.

The agreement between different computers on the correspondence between letters and numbers is the result of Unicode standardization.

According to Unicode, h is an abstract character named LATIN SMALL LETTER H. This abstract character corresponds to the number 0x68, which is a code point labeled U+0068. These concepts are explained in the next chapter.

The role of Unicode is to provide an abstract character list (character set) and assign a unique identifier code point (encoded character set) to each character.

2. Basic concepts of Unicode

www.unicode.org website mentioned:

Unicode assigns a proprietary number to each character

regardless of platform

regardless of program

regardless of language

Unicode is A universal character set that defines the character sets for most writing systems around the world and assigns each character a unique number (code point).

Unicode includes most modern languages, punctuation marks, diacritical marks (umlauts), mathematical symbols, technical symbols, arrows and emoticons, etc.

The first version of Unicode 1.0 was released in October 1991 and contained 7161 characters. The latest version 9.0 (released in June 2016) provides an encoding of 128172 characters.

The versatility and openness of Unicode solves a problem that has existed in the past: vendors implement different character sets and encoding rules, which is difficult to deal with.

Creating an application that supports all character sets and encoding rules is very complex. Not to mention that the encoding you choose may not support all the languages you need.

If you think Unicode is hard, just think how much harder it would be to program without it.

I still remember the time when I randomly selected the required character set and encoding rules to read the file content. It all depends on character!

2.1 Characters and code points

Abstract characters (i.e. text characters) are information units used to organize, manage or represent text data.

Characters in Unicode are an abstract concept. Each abstract character has a corresponding name, such as LATIN SMALL LETTER A. The graphical representation (glyph) of this abstract character is a. (Translator's Note: glyph is an image character)

code point refers to the number assigned to an abstract character

Code point is expressed in the form of U+, U+ is the prefix representing Unicode, and is a hexadecimal number. For example, U+0041 and U+2603 are both code points.

The value range of code points is from U+0000 to U+10FFFF.

Remember that a code point is a simple number. Keep this in mind when thinking about Unicode.

Code points are like subscripts of array elements.

The magic of Unicode is associating code points with abstract characters. For example, the abstract character corresponding to U+0041 is named LATIN CAPITAL LETTER A (shown as A), and the abstract character corresponding to U+2603 is named SNOWMAN (shown as ☃)

Note that not all code points have corresponding abstractions character. There are 114,112 code points available, but only 128,237 abstract characters are allocated.

2.2 Unicode plane

The plane refers to the interval from U+n0000 to U+nFFFF, which is 65536 (1000016) consecutive Unicode code points. The value range of n is from 016 to 1016.

These planes divide Unicode code points into 17 equal-sized sets:

Plane 0 contains code points from U+0000 to U+FFFF

Plane 1 contains code points from U+**1**0000 to U+** Code points for 1**FFFF

...

Plane 16 contains code points from U+**10**0000 to U+**10**FFFF

What every JavaScript developer should know about Unicode

Basic Multilingual Plane

Plane 0 is special and is called Basic Multilingual Plane or BMP for short. It contains characters for most modern languages (basic Latin, Cyrillic, Greek, etc.) and a large number of symbols.

As mentioned above, the code point value range of the basic multilingual plane is from U+0000 to U+FFFF, and can have up to 4 hexadecimal digits.

Most of the time developers deal with characters in BMP. It contains the required characters in most cases.

Some characters in BMP:

e corresponds to code point U+0065 Abstract character name: LATIN SMALL LETTER E

|corresponds to code point U+007C Abstract character name: VERTICAL BAR

■corresponds to code point U+25A0 Abstract Character name: BLACK SQUARE

☂Corresponding code point U+2602 Abstract character name: UMBRELLA

Starlight Plane

The 16 planes after BMP (Plane 1, Plane 2,..., Plane 16) are called astral planes or auxiliary flat.

The code points of the astral plane are called astral code points. These code points range from U+10000 to U+10FFFF.

Starlight code points may have 5 or 6 hexadecimal digits: U+ddddd or U+dddddd.

Let’s take a look at some characters in the astral plane: