Detailed introduction to encoding XML documents using UTF-8

黄舟
Release: 2017-03-25 16:39:48
Original
2158 people have browsed it

Google's Sitemap service requires that all published site maps must use Unicode's UTF-8 encoding. Google doesn't even allow other Unicode encodings like UTF-16, let alone non-Unicode encodings like ISO-8859-1. Technically this means that Google is using a non-standard XML parser, since the XML Recommendation specifically requires that "all XML handlers must accept the UTF-8 and UTF-16 encodings of Unicode 3.1", but this Is it really a big problem?

Everyone can use UTF-8

Universality is the first and most compelling reason to choose UTF-8. It can handle every script currently used in the world. Although there are still a few gaps, they are becoming less and less obvious and are gradually being filled in. Literals that are not included are usually not implemented in any other character set, and even if they are, they cannot be used in XML. In the best case, these scripts are passed through font borrowing to a single-byte character set like Latin-1. Real support for such rare scripts will probably come first from Unicode, and probably only Unicode supports them.

But this is just one reason to use Unicode. Why choose UTF-8 instead of UTF-16 or other Unicode encodings? One of the most immediate reasons is the extensive tool support. Basically every major editor possible for XML can handle UTF-8, including JEdit, BBEdit, Eclipse, emacs and even Notepad. No other Unicode encoding has such extensive tool support among XML and non-XML tools. For some of these editors, such as BBEdit and Eclipse, UTF-8 is not the default character set. Now it is necessary to change the default settings. All tools should select UTF-8 as the default encoding when shipped from the factory. Unless this is done, we will be stuck in a quagmire of non-interoperability when files travel across borders, platforms and languages. But until all programs use UTF-8 as the default encoding, it's easy to change the default settings yourself. In Eclipse, for example, the General/Editors preference panel shown in Figure 1 allows you to specify that all files use UTF-8. You may notice that Eclipse expects the default to be MacRoman, but if this is the case, the file will not compile when passed to a programmer using Microsoft® Windows® or to a computer outside the United States and Western Europe. Figure 1. Changing the default character set of Eclipse

Of course, for UTF-8 to work, all files exchanged by developers must also use UTF -8, but that's not a problem. Unlike MacRoman, UTF-8 is not limited to a few scripts or platforms. Anyone can use UTF-8. MacRoman, Latin-1, SJIS, and various other legacy national character sets cannot do that.

Detailed introduction to encoding XML documents using UTF-8UTF-8 works fine in tools that don't support multibyte data. Other Unicode formats such as UTF-16 tend to contain many zero bytes. Many tools interpret these bytes as end-of-file or some other special delimiter, causing undesirable, unexpected, and often unpleasant results. For example, if UTF-16 data is loaded unchanged into a C

String

, the string may be truncated from the second byte of the first ASCII character. UTF-8 files contain null only where they actually represent

null

. Of course, such a naive tool should not be chosen to process XML documents. However, documents in legacy systems often end up in strange places, and no one really recognizes or understands that those character sequences are just old wine in new bottles. UTF-8 is less likely to cause problems than UTF-16 or other Unicode encodings for systems that don't support Unicode and XML. What the experts sayXML is the first major standard to fully support UTF-8, but that’s just the beginning. Various standards organizations are gradually recommending UTF-8. For example, URLs containing non-ASCII characters are a long-standing problem on the Web. URLs containing non-ASCII characters that work on a PC won't work on a Mac, and vice versa. The World Wide Web Consortium (

W3C

) and the Internet Engineering Task Force (IETF) recently resolved this issue by agreeing that all URLs must be encoded in UTF-8 and no other encodings.

The W3C and IETF are getting tougher on whether to use UTF-8 first, last, or occasionally. The W3C Character Model for the World Wide Web 1.0: Fundamentals states, "If a character encoding must be chosen, it must be UTF-8, UTF-16, or UTF-32. US-ASCII is upwardly compatible with UTF-8 (US- ASCII strings are also UTF-8 strings, see [RFC 3629]), so if compatibility with US-ASCII is required, UTF-8 is very suitable. "In fact, compatibility with US-ASCII is so important that it is almost required. The W3C wisely explains, "In other cases, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Reasons for choosing one encoding may include efficiency of internal processing and interoperability with other processes."

I agree with the reason for the efficiency of internal processing. For example, the internal representation of strings in the Java™ language is UTF-16, so indexing of strings is faster. However, Java code never exposes this internal representation to the program with which it exchanges data. Instead, for external data exchange, use java.io.Writer, specifying the character set explicitly. When choosing, UTF-8 is highly recommended.

IETF is even more explicit. The IETF Charset Policy [RFC 2277] states that in non-deterministic languages:

protocols must be able to use the UTF-8 character set, which consists of the ISO 10646 encoding set and the UTF-8 character encoding method, See [10646] Annex R (released in revision 2) for the full text.

In addition, the protocol may specify how to use other ISO 10646 character sets and character encoding schemes, such as UTF-16, but the inability to use UTF-8 is a violation of this policy. This violation will not be entered or promoted to the standards track. During the process, it is necessary to go through the change procedure ([BCP9] Section 9) and provide clear and reliable reasons in the protocol specification document.

Existing protocols, or protocols for transferring data from existing data stores, may need to support other datasets, or even use default encodings other than UTF-8. This is allowed, but must be able to support UTF-8.

Points: Support for legacy protocols and files may require acceptance of character sets and encodings other than UTF-8 for some time to come, but I'd be very careful if that had to be the case. Every new protocol, application, and document should use UTF-8.

Chinese, Japanese and Korean

A common misconception is that UTF-8 is a compression format. This is not the case. In UTF-8 ASCII characters take up only half the space compared to other Unicode encodings, especially UTF-16. However, the UTF-8 encoding of some characters takes up 50% more space, especially hieroglyphics like Chinese, Japanese, and Korean (CJK).

But even if CJK XML is encoded in UTF-8, the actual size may be smaller than UTF-16. For example, Chinese XML documents contain a large number of ASCII characters, such as , &, =, ", ' and spaces. The UTF-8 encoding of these characters is smaller than UTF-16. The specific compression/expansion factors vary depending on the document. Different, but in either case, the difference is unlikely to be obvious.

Finally, it is worth mentioning that hieroglyphic scripts such as Chinese and Japanese use characters compared to alphabetical scripts such as Latin and Cyrillic. Often less. Due to the sheer amount of characters, three or more bytes per character are required to fully represent these languages, that is, compared to the same words or sentences in English or Russian. Can be expressed in fewer words. For example, "tree" is represented by "wood" in Japanese (very much like a tree) and requires three bytes in UTF-8, while the English word "tree" contains four letters. , requiring four bytes. The Japanese word "grove" is "林" (two trees close together). Encoding in UTF-8 requires three bytes, while the English word "grove" has five. letters, requires five bytes. The Japanese word "sen" (three trees) still requires three bytes, while the corresponding English word "forest" requires six bytes.

If compression is really needed. , use zip or gzip. After compression, the sizes of UTF-8 and UTF-16 are similar, no matter which encoding is used, the larger the original size, the less redundancy removed by the compression algorithm. More.

Robustness

The real advantage is in the design, UTF-8 is a more robust and easier to interpret format than any other text encoding ever devised before or since. . First of all, compared with UTF-16, UTF-8 does not have the endianness problem. UTF-8 is represented by both big-endian and little-endian, because UTF-8 is based on 8-bit bytes rather than 16-bit words. Defined. UTF-8 has no endianness ambiguity, which must be resolved through endianness flags or other heuristics.

UTF-8 A more important feature is statelessness. Every byte in a UTF-8 stream or sequence is unambiguous. In UTF-8, you can always know the position. That is to say, given a byte, you can immediately determine whether it is a single-byte character, the first byte of a double-byte character, or the first byte of a double-byte character. The second byte, or the second, third, or fourth byte of a three-byte/four-byte character (there are other possibilities, of course, but you get the idea). In UTF-16, it is impossible to determine whether the byte "0x41" is the letter "A". Sometimes it is, sometimes it isn't. Sufficient state must be logged to determine position in the flow. If one byte is lost, all subsequent data will be unusable. In UTF-8, missing or corrupted bytes are easy to determine and do not affect other data.

UTF-8 is not a panacea. Applications that require random access to specific locations in a document may operate faster using fixed-width encodings such as UCS2 or UTF-32. (If you take substitution pairs into account, UTF-16 is a variable-length character encoding.) However, XML processing does not fall into this category of applications. The XML specification specifically requires that parsers start parsing from the first byte of an XML document until the last byte, and all existing parsers do this. Faster random access doesn't help XML processing, and while that might be a good reason to use a different encoding for a database or other system, it doesn't apply to XML.

Conclusion

In an increasingly international world, language and political boundaries are blurring, and character sets that rely on region are no longer applicable. Unicode is the only character set that can interoperate across many geographies. UTF-8 is the best Unicode encoding available:

Extensive tool support, including best-in-class compatibility with legacy ASCII systems.

It is simple and efficient to handle.

Anti-corruption.

Platform independent.

It’s time to stop arguing about character sets and encodings, choose UTF-8 and end the dispute.

The above is the detailed content of Detailed introduction to encoding XML documents using UTF-8. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template