css - js里var name=''张三" 那么name.length 这个属性在UTF-8 GB2312里分别等于多少？

Question

js里var name=''张三" 那么name.length 这个属性在UTF-8 GB2312里分别等于多少？为什么？？？

伊谢尔伦 · Answer

It seems that this problem cannot be solved simply by talking about it. I'll add it to the answer.

First of all, the internal representation of String in JavaScript is always UTF16, and its length is always calculated according to the UTF16 code point. In short, length always returns the number of characters, not the byte size!

So, why did someone test and find that for the same string "张三", 2 is returned when encoded with UTF8, but 3 is obtained when encoded with GBK? That's because the browser failed to correctly identify the file encoding, so it made a wrong judgment. The result returned 3 is a result of program running error.

There are so many concepts in this. Let’s first discuss when JS is written in an HTML page, and then discuss the situation when src loads external JS.

Suppose the encoding of the test.html file itself is GBK. Then when the browser loads test.html, how does it know its encoding?
There are roughly the following options:

According to the charset value in Content-Type: text/html;charset=gbk in the HTTP Header. For example, you can add code header("Content-Type: text/html;charset=gbk") in PHP to tell the browser to encode
If there is no charset information in the HTTP Header, check whether there is or in the HTML head. If so, use the encoding specified here to parse the HTML. Note that if there are Chinese characters before this, they may be parsed into garbled characters:

 
   可能显示成乱码

If the charset information is not found in the above, you can actually judge it through the BOM. If there is no BOM, the browser can only guess. How to guess specifically? Why do we need to know this? Can we still rely on the browser to guess this coding riddle when we write code?

What if the browser ultimately fails to correctly determine the encoding of the HTML file? Then it’s garbled! For text in HTML, you can easily see that it appears as garbled characters. But for JS files, you can’t see it so easily, like this:

a="两个"
alert(a+"
"+a.length); //当a.length不为2的时候，前面a肯定显示成乱码

Or ask, why is 3 output in this case? Stop asking! !
Wrong input! Wrong output! , even if the output of 2 is garbled, it is regarded as wrong, and it is not worth spending that space to explain this.

Okay, do you want to create an output of 3? It's very simple, save the following code to the test.html file using UTF8 encoding:

In this way, the browser believes the GBK encoding you specified in the meta charset, but in fact the HTML file is UTF8 encoded, so it will be garbled!

Okay, let’s talk about the case where src refers to external JS.
Assume that the external JS file test.js file is encoded as UTF8 and test.html is encoded as GBK. This JS file is referenced through the following code:
. So at this time, how does the browser determine the encoding of the JS file? Similar paths:

According to the charset value in Content-Type: text/javascript;charset=gbk in the HTTP Header returned by test.js. Of course, if you use the file:// protocol, there is no such thing.
According to the charset attribute value of script.
If there is no above information, it is considered that the encoding of JS is the same as that of the current HTML file, that is, test.html. How to determine the encoding of test.html was mentioned earlier.

If the encoding of HTML is inconsistent with that of external JS, and no

阿神 · Answer

To understand this problem, let’s first go back to the definition of this attribute String.length. Let’s go to MDN to check it out: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/ Global_Objects/String/length

This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.

MDN clearly describes that JS calculates the length of a string under utf-16 encoding. If the encoding of your current file (can be obtained from document.charset) is utf-8, the correct result will be returned. If your file encoding is gb2312, JavaScript does not know which encoding you are using. If it defaults to searching from gb2312, unexpected errors will occur. The solution to this problem is to set the encoding correctly, utf-16. (Thank you very much @Jex for the detailed explanation of this part)

Regarding how to correctly calculate the length of Chinese strings, you may also refer to this page: http://www.puritys.me/docs-blog/article-107-String-Length-%E4%B8%AD %E6%96%87%E5%AD%97%E4%B8%B2%E9%95%B7%E5%BA%A6.html