Re-understand unicode and utf8 encoding
Until today, just now to be precise, I didn’t know that UTF-8 encoding and Unicode encoding are different, there is a difference 囧
There is a certain connection between them, look at their differences:
The length of UTF-8 is not certain, it may be 1, 2, or 3 bytes
Unicode has a certain length, 2 bytes (USC-2)
UTF-8 can be converted to and from Unicode
The relationship between unicode and utf8
Unicode(16)
UTF-8(binary)
0000 - 007F 0xxxxxxx
0080 - 07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx
The above table has two meanings. The first one is obviously the correspondence between Unicode and UTF-8 character ranges, and the other one shows how Unicode and UTF-8 are converted to each other:
Let’s talk about UTF-8 to Unicode conversion first
The UTF-8 encoded binary is matched with the above three formats. After matching, the fixed bits (non-x positions in the table) are removed, and then every 8 bits are grouped from right to left. If there are not enough 8 bits, the left side will not be used. , make up 2 bytes and 16 bits. These 16 bits represent the Unicode encoding corresponding to UTF-8. Take a look at the following examples:
The text encoding format in the above picture is UTF-8, and you can use WinHex to see its hexadecimal representation
The code is as follows:
汉 => E6B189 => 11100110 10110001 10001001 => 01101100 01001001 => 6C49
Word => E5AD97 => 11100101 10101101 10010111 => 01011011 01010111 => 5B57
#The following is the result of running under the chrome command line
'u6C49'
"汉"
'u5B57'
"Word"
#At this point, converting from UTF-8 to Unicode is already a very easy task. Take a look at the pseudocode of the conversion
Read one byte, 11100110
Determine the format of the UTF-8 character, which belongs to the third type, 3 bytes
Continue reading 2 bytes to get 11100101 10101101 10010111
Remove the fixed bits according to the format 1011011 01010111
Not enough 16 digits, add zeros on the left 01011011 01010111 => 5B57
Look again at the conversion from Unicode to UTF-8
Copy the code The code is as follows:
Talk about the problem
Let’s talk about the cause of today’s problem. Many words are input from the front end. Each word in UTF-8 format has a maximum of 30 bytes, so verification will be done on the front end and backend respectively. JavaScript uses Unicode encoding, and the backend program UTF-8 encoding is used, and the current solution is as follows
Front end
|
function utf8_bytes(str) { var len = 0, unicode; for(var i = 0; i < str.length; i ) { unicode = str.charCodeAt(i); if(unicode < 0x0080) { len; } else if(unicode < 0x0800) { len = 2; } else if(unicode <= 0xFFFF) { len = 3; }else { throw "characters must be USC-2!!" } } return len; } #Example utf8_bytes('asdasdas') 8 utf8_bytes('yrt Yan Ruitao') 12 |
Backstage
3 4 |
|