It can be seen from the Table that each browser has different parsing for pages that do not use any means to declare encoding. Of course, in the simplest page, no matter what encoding is used (of course, the premise is a superset of ASCII), it has no impact, but it is enough to show the importance of setting the encoding correctly.
Encoding Statement
HTML4 and HTML5 each adopt a chapter to explain the encoding statement method. You can click here to view the relevant chapters of HTML4 or click here to view the relevant chapters of HTML5. chapter.
First of all, what is coding? Encoding is to specify the browser (or user agent) to use a special algorithm to parse the byte stream in a certain way to obtain the truly correct content. In the HTML standard, encodings can be represented using aliases. Encoding aliases come from the IANA definition, and only encodings that appear in this list can be recognized by browsers. Therefore, if UTF-8 is written as UTF8, the browser may completely ignore it. In addition, encoding aliases are case-insensitive.
In HTML4, there are three methods to specify the encoding of the page. According to the priority, they are:
The Content-Type field in the HTTP header is followed by characters set.
Use the <meta http-equiv="Content-Type"> tag to declare.
For some external resources, such as js files loaded by the <script> tag, they can be declared through the charset attribute on the tag.
Of course there is no doubt about this. It should be noted that if the page is declared through the <meta http-equiv="Content-Type"> tag, When the browser encounters this tag, if it finds that the encoding it uses does not match the tag declaration, it will go back to the beginning and re-parse the page. This will cause part of the page to be re-parsed, so if you are trying to use a tag to declare the encoding, it is recommended to write the tag as early as possible. A best practice is to write it after the <head> tag and before any other tags. Regarding this point, Google PageSpeed also has a corresponding introduction.
Evolution of the Times
But as time went by, developers gradually discovered one thing. Just like the simplest statement of DOCTYPE, in fact, when the browser reads the encoding of the <meta> tag, it does not strictly follow the standard. All in all, since in the HTML parsing stage, the encoding of the page must be determined before the Tokenizer stage, it is impossible for the browser to decompose it when the DOM tree is built like analyzing the DOM tree<meta>The structure of the tag, take out the http-equiv and content attributes, and then determine the encoding.
In reality, the browser does a very simple thing to read the encoding defined by the <meta> tag:
Look for the string (there is no concept of label here, just a string) and find a substring "charset".
Read backward, ignore all space characters, and find the first meaningful character c.
If c is not the character "=", return to step 2 and continue searching.
If c is the character "=", continue going down.
Then skip all space characters, single quotes, double quotes, etc., and scan backwards until you encounter single quotes, double quotes, space characters, end tags, etc. The characters that should appear are above, and the string s scanned therein is intercepted.
Analyze s and get the encoding alias.
From the above algorithm, it is not difficult to find that the following writing methods can actually allow the browser to correctly identify the encoding:
< ;meta charset="utf-8" />
##
...and many other weird ways of writing.
So, as history progressed, finally one day, various browser manufacturers sat together and began to discuss this issue... In the end, they were surprised to find that their implementations were very similar. (Maybe they just learned from each other), so they decided to turn this method into a standard... Finally, after a long discussion, the widely loved coding declaration method in HTML5 was born. In HTML5, it is called a "meta charset element", and its simplest form is as follows: