It seems that this problem cannot be solved simply by talking about it. I'll add it to the answer.
First of all, the internal representation of String in JavaScript is always UTF16, and its length is always calculated according to the UTF16 code point. In short, length always returns the number of characters, not the byte size!
So, why did someone test and find that for the same string "张三", 2 is returned when encoded with UTF8, but 3 is obtained when encoded with GBK? That's because the browser failed to correctly identify the file encoding, so it made a wrong judgment. The result returned 3 is a result of program running error.
There are so many concepts in this. Let’s first discuss when JS is written in an HTML page, and then discuss the situation when src loads external JS.
Suppose the encoding of the test.html file itself is GBK. Then when the browser loads test.html, how does it know its encoding?
There are roughly the following options:
According to the charset value in Content-Type: text/html;charset=gbk in the HTTP Header. For example, you can add code header("Content-Type: text/html;charset=gbk") in PHP to tell the browser to encode
If there is no charset information in the HTTP Header, check whether there is <meta http-equiv="Content-Type" content="text/html;charset=gbk"/> or <meta charset="GBK"/> in the HTML head. If so, use the encoding specified here to parse the HTML. Note that if there are Chinese characters before this, they may be parsed into garbled characters:
<head>
<title>可能显示成乱码</title>
<meta charset="GBK"/><!--It's too late here-->
</head>
If the charset information is not found in the above, you can actually judge it through the BOM. If there is no BOM, the browser can only guess. How to guess specifically? Why do we need to know this? Can we still rely on the browser to guess this coding riddle when we write code?
What if the browser ultimately fails to correctly determine the encoding of the HTML file? Then it’s garbled! For text in HTML, you can easily see that it appears as garbled characters. But for JS files, you can’t see it so easily, like this:
Or ask, why is 3 output in this case? Stop asking! ! Wrong input! Wrong output! , even if the output of 2 is garbled, it is regarded as wrong, and it is not worth spending that space to explain this.
Okay, do you want to create an output of 3? It's very simple, save the following code to the test.html file using UTF8 encoding:
In this way, the browser believes the GBK encoding you specified in the meta charset, but in fact the HTML file is UTF8 encoded, so it will be garbled!
Okay, let’s talk about the case where src refers to external JS.
Assume that the external JS file test.js file is encoded as UTF8 and test.html is encoded as GBK. This JS file is referenced through the following code: <script src="test.js" charset="UTF8"></script>. So at this time, how does the browser determine the encoding of the JS file? Similar paths:
According to the charset value in Content-Type: text/javascript;charset=gbk in the HTTP Header returned by test.js. Of course, if you use the file:// protocol, there is no such thing.
According to the charset attribute value of script.
If there is no above information, it is considered that the encoding of JS is the same as that of the current HTML file, that is, test.html. How to determine the encoding of test.html was mentioned earlier.
If the encoding of HTML is inconsistent with that of external JS, and no <script charset="XXX"> or HTTP Header is specified, the code will be garbled and the previous effect will be caused.
Oh, this is really nonsense. There have been many articles discussing this issue. Just read these articles!
For example, these two articles are too long to read: http://ued.taobao.org/blog/2011/08/encode-war/, http://tgideas.qq.com/webplat/info/news_version3/804/ 808/811/m579/201307/218730.shtml. What W3C says about script charset: http://www.w3.org/TR/html5/scripting-1.html#attr-script-charset
It’s not a good habit not to watch for too long! ! !
Finally, the best practice is of course: always encode in UTF8, and specify script charset for external JS.
To understand this problem, let’s first go back to the definition of this attribute String.length. Let’s go to MDN to check it out: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/ Global_Objects/String/length
This property returns the number of code units in the string. UTF-16, the string format used by JavaScript, uses a single 16-bit code unit to represent the most common characters, but needs to use two code units for less commonly-used characters, so it's possible for the value returned by length to not match the actual number of characters in the string.
MDN clearly describes that JS calculates the length of a string under utf-16 encoding. If the encoding of your current file (can be obtained from document.charset) is utf-8, the correct result will be returned. If your file encoding is gb2312, JavaScript does not know which encoding you are using. If it defaults to searching from gb2312, unexpected errors will occur. The solution to this problem is to set the encoding correctly, utf-16. (Thank you very much @Jex for the detailed explanation of this part) <meta charset="utf-8" />
Regarding how to correctly calculate the length of Chinese strings, you may also refer to this page: http://www.puritys.me/docs-blog/article-107-String-Length-%E4%B8%AD %E6%96%87%E5%AD%97%E4%B8%B2%E9%95%B7%E5%BA%A6.html
It seems that this problem cannot be solved simply by talking about it. I'll add it to the answer.
First of all, the internal representation of String in JavaScript is always UTF16, and its length is always calculated according to the UTF16 code point. In short, length always returns the number of characters, not the byte size!
So, why did someone test and find that for the same string
"张三"
, 2 is returned when encoded with UTF8, but 3 is obtained when encoded with GBK? That's because the browser failed to correctly identify the file encoding, so it made a wrong judgment. The result returned 3 is a result of program running error.There are so many concepts in this. Let’s first discuss when JS is written in an HTML page, and then discuss the situation when src loads external JS.
Suppose the encoding of the test.html file itself is GBK. Then when the browser loads test.html, how does it know its encoding?
There are roughly the following options:
Content-Type: text/html;charset=gbk
in the HTTP Header. For example, you can add codeheader("Content-Type: text/html;charset=gbk")
in PHP to tell the browser to encode<meta http-equiv="Content-Type" content="text/html;charset=gbk"/>
or<meta charset="GBK"/>
in the HTML head. If so, use the encoding specified here to parse the HTML. Note that if there are Chinese characters before this, they may be parsed into garbled characters:If the charset information is not found in the above, you can actually judge it through the BOM. If there is no BOM, the browser can only guess. How to guess specifically? Why do we need to know this? Can we still rely on the browser to guess this coding riddle when we write code?
What if the browser ultimately fails to correctly determine the encoding of the HTML file? Then it’s garbled! For text in HTML, you can easily see that it appears as garbled characters. But for JS files, you can’t see it so easily, like this:
Or ask, why is 3 output in this case? Stop asking! !
Wrong input! Wrong output! , even if the output of 2 is garbled, it is regarded as wrong, and it is not worth spending that space to explain this.
Okay, do you want to create an output of 3? It's very simple, save the following code to the test.html file using UTF8 encoding:
In this way, the browser believes the GBK encoding you specified in the meta charset, but in fact the HTML file is UTF8 encoded, so it will be garbled!
Okay, let’s talk about the case where src refers to external JS.
Assume that the external JS file test.js file is encoded as UTF8 and test.html is encoded as GBK. This JS file is referenced through the following code:
<script src="test.js" charset="UTF8"></script>
. So at this time, how does the browser determine the encoding of the JS file? Similar paths:Content-Type: text/javascript;charset=gbk
in the HTTP Header returned by test.js. Of course, if you use the file:// protocol, there is no such thing.If the encoding of HTML is inconsistent with that of external JS, and no
<script charset="XXX">
or HTTP Header is specified, the code will be garbled and the previous effect will be caused.Oh, this is really nonsense. There have been many articles discussing this issue. Just read these articles!
For example, these two articles are too long to read: http://ued.taobao.org/blog/2011/08/encode-war/, http://tgideas.qq.com/webplat/info/news_version3/804/ 808/811/m579/201307/218730.shtml. What W3C says about script charset: http://www.w3.org/TR/html5/scripting-1.html#attr-script-charset
It’s not a good habit not to watch for too long! ! !
Finally, the best practice is of course: always encode in UTF8, and specify script charset for external JS.
To understand this problem, let’s first go back to the definition of this attribute
String.length
. Let’s go to MDN to check it out: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/ Global_Objects/String/lengthMDN clearly describes that JS calculates the length of a string under
Regarding how to correctly calculate the length of Chinese strings, you may also refer to this page: http://www.puritys.me/docs-blog/article-107-String-Length-%E4%B8%AD %E6%96%87%E5%AD%97%E4%B8%B2%E9%95%B7%E5%BA%A6.htmlutf-16
encoding. If the encoding of your current file (can be obtained from document.charset) isutf-8
, the correct result will be returned. If your file encoding isgb2312
, JavaScript does not know which encoding you are using. If it defaults to searching fromgb2312
, unexpected errors will occur. The solution to this problem is to set the encoding correctly,utf-16
. (Thank you very much @Jex for the detailed explanation of this part)<meta charset="utf-8" />