Home Backend Development XML/RSS Tutorial Detailed introduction to encoding XML documents using UTF-8

Detailed introduction to encoding XML documents using UTF-8

Mar 25, 2017 pm 04:39 PM

Google's Sitemap service requires that all published site maps must use Unicode's UTF-8 encoding. Google doesn't even allow other Unicode encodings like UTF-16, let alone non-Unicode encodings like ISO-8859-1. Technically this means that Google is using a non-standard XML parser, since the XML Recommendation specifically requires that "all XML handlers must accept the UTF-8 and UTF-16 encodings of Unicode 3.1", but this Is it really a big problem?

Everyone can use UTF-8

Universality is the first and most compelling reason to choose UTF-8. It can handle every script currently used in the world. Although there are still a few gaps, they are becoming less and less obvious and are gradually being filled in. Literals that are not included are usually not implemented in any other character set, and even if they are, they cannot be used in XML. In the best case, these scripts are passed through font borrowing to a single-byte character set like Latin-1. Real support for such rare scripts will probably come first from Unicode, and probably only Unicode supports them.

But this is just one reason to use Unicode. Why choose UTF-8 instead of UTF-16 or other Unicode encodings? One of the most immediate reasons is the extensive tool support. Basically every major editor possible for XML can handle UTF-8, including JEdit, BBEdit, Eclipse, emacs and even Notepad. No other Unicode encoding has such extensive tool support among XML and non-XML tools. For some of these editors, such as BBEdit and Eclipse, UTF-8 is not the default character set. Now it is necessary to change the default settings. All tools should select UTF-8 as the default encoding when shipped from the factory. Unless this is done, we will be stuck in a quagmire of non-interoperability when files travel across borders, platforms and languages. But until all programs use UTF-8 as the default encoding, it's easy to change the default settings yourself. In Eclipse, for example, the General/Editors preference panel shown in Figure 1 allows you to specify that all files use UTF-8. You may notice that Eclipse expects the default to be MacRoman, but if this is the case, the file will not compile when passed to a programmer using Microsoft® Windows® or to a computer outside the United States and Western Europe. Figure 1. Changing the default character set of Eclipse

Of course, for UTF-8 to work, all files exchanged by developers must also use UTF -8, but that's not a problem. Unlike MacRoman, UTF-8 is not limited to a few scripts or platforms. Anyone can use UTF-8. MacRoman, Latin-1, SJIS, and various other legacy national character sets cannot do that.

Detailed introduction to encoding XML documents using UTF-8UTF-8 works fine in tools that don't support multibyte data. Other Unicode formats such as UTF-16 tend to contain many zero bytes. Many tools interpret these bytes as end-of-file or some other special delimiter, causing undesirable, unexpected, and often unpleasant results. For example, if UTF-16 data is loaded unchanged into a C

String

, the string may be truncated from the second byte of the first ASCII character. UTF-8 files contain null only where they actually represent

null

. Of course, such a naive tool should not be chosen to process XML documents. However, documents in legacy systems often end up in strange places, and no one really recognizes or understands that those character sequences are just old wine in new bottles. UTF-8 is less likely to cause problems than UTF-16 or other Unicode encodings for systems that don't support Unicode and XML. What the experts sayXML is the first major standard to fully support UTF-8, but that’s just the beginning. Various standards organizations are gradually recommending UTF-8. For example, URLs containing non-ASCII characters are a long-standing problem on the Web. URLs containing non-ASCII characters that work on a PC won't work on a Mac, and vice versa. The World Wide Web Consortium (

W3C

) and the Internet Engineering Task Force (IETF) recently resolved this issue by agreeing that all URLs must be encoded in UTF-8 and no other encodings.

The W3C and IETF are getting tougher on whether to use UTF-8 first, last, or occasionally. The W3C Character Model for the World Wide Web 1.0: Fundamentals states, "If a character encoding must be chosen, it must be UTF-8, UTF-16, or UTF-32. US-ASCII is upwardly compatible with UTF-8 (US- ASCII strings are also UTF-8 strings, see [RFC 3629]), so if compatibility with US-ASCII is required, UTF-8 is very suitable. "In fact, compatibility with US-ASCII is so important that it is almost required. The W3C wisely explains, "In other cases, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Reasons for choosing one encoding may include efficiency of internal processing and interoperability with other processes."

I agree with the reason for the efficiency of internal processing. For example, the internal representation of strings in the Java™ language is UTF-16, so indexing of strings is faster. However, Java code never exposes this internal representation to the program with which it exchanges data. Instead, for external data exchange, use java.io.Writer, specifying the character set explicitly. When choosing, UTF-8 is highly recommended.

IETF is even more explicit. The IETF Charset Policy [RFC 2277] states that in non-deterministic languages:

protocols must be able to use the UTF-8 character set, which consists of the ISO 10646 encoding set and the UTF-8 character encoding method, See [10646] Annex R (released in revision 2) for the full text.

In addition, the protocol may specify how to use other ISO 10646 character sets and character encoding schemes, such as UTF-16, but the inability to use UTF-8 is a violation of this policy. This violation will not be entered or promoted to the standards track. During the process, it is necessary to go through the change procedure ([BCP9] Section 9) and provide clear and reliable reasons in the protocol specification document.

Existing protocols, or protocols for transferring data from existing data stores, may need to support other datasets, or even use default encodings other than UTF-8. This is allowed, but must be able to support UTF-8.

Points: Support for legacy protocols and files may require acceptance of character sets and encodings other than UTF-8 for some time to come, but I'd be very careful if that had to be the case. Every new protocol, application, and document should use UTF-8.

Chinese, Japanese and Korean

A common misconception is that UTF-8 is a compression format. This is not the case. In UTF-8 ASCII characters take up only half the space compared to other Unicode encodings, especially UTF-16. However, the UTF-8 encoding of some characters takes up 50% more space, especially hieroglyphics like Chinese, Japanese, and Korean (CJK).

But even if CJK XML is encoded in UTF-8, the actual size may be smaller than UTF-16. For example, Chinese XML documents contain a large number of ASCII characters, such as , &, =, ", ' and spaces. The UTF-8 encoding of these characters is smaller than UTF-16. The specific compression/expansion factors vary depending on the document. Different, but in either case, the difference is unlikely to be obvious.

Finally, it is worth mentioning that hieroglyphic scripts such as Chinese and Japanese use characters compared to alphabetical scripts such as Latin and Cyrillic. Often less. Due to the sheer amount of characters, three or more bytes per character are required to fully represent these languages, that is, compared to the same words or sentences in English or Russian. Can be expressed in fewer words. For example, "tree" is represented by "wood" in Japanese (very much like a tree) and requires three bytes in UTF-8, while the English word "tree" contains four letters. , requiring four bytes. The Japanese word "grove" is "林" (two trees close together). Encoding in UTF-8 requires three bytes, while the English word "grove" has five. letters, requires five bytes. The Japanese word "sen" (three trees) still requires three bytes, while the corresponding English word "forest" requires six bytes.

If compression is really needed. , use zip or gzip. After compression, the sizes of UTF-8 and UTF-16 are similar, no matter which encoding is used, the larger the original size, the less redundancy removed by the compression algorithm. More.

Robustness

The real advantage is in the design, UTF-8 is a more robust and easier to interpret format than any other text encoding ever devised before or since. . First of all, compared with UTF-16, UTF-8 does not have the endianness problem. UTF-8 is represented by both big-endian and little-endian, because UTF-8 is based on 8-bit bytes rather than 16-bit words. Defined. UTF-8 has no endianness ambiguity, which must be resolved through endianness flags or other heuristics.

UTF-8 A more important feature is statelessness. Every byte in a UTF-8 stream or sequence is unambiguous. In UTF-8, you can always know the position. That is to say, given a byte, you can immediately determine whether it is a single-byte character, the first byte of a double-byte character, or the first byte of a double-byte character. The second byte, or the second, third, or fourth byte of a three-byte/four-byte character (there are other possibilities, of course, but you get the idea). In UTF-16, it is impossible to determine whether the byte "0x41" is the letter "A". Sometimes it is, sometimes it isn't. Sufficient state must be logged to determine position in the flow. If one byte is lost, all subsequent data will be unusable. In UTF-8, missing or corrupted bytes are easy to determine and do not affect other data.

UTF-8 is not a panacea. Applications that require random access to specific locations in a document may operate faster using fixed-width encodings such as UCS2 or UTF-32. (If you take substitution pairs into account, UTF-16 is a variable-length character encoding.) However, XML processing does not fall into this category of applications. The XML specification specifically requires that parsers start parsing from the first byte of an XML document until the last byte, and all existing parsers do this. Faster random access doesn't help XML processing, and while that might be a good reason to use a different encoding for a database or other system, it doesn't apply to XML.

Conclusion

In an increasingly international world, language and political boundaries are blurring, and character sets that rely on region are no longer applicable. Unicode is the only character set that can interoperate across many geographies. UTF-8 is the best Unicode encoding available:

Extensive tool support, including best-in-class compatibility with legacy ASCII systems.

It is simple and efficient to handle.

Anti-corruption.

Platform independent.

It’s time to stop arguing about character sets and encodings, choose UTF-8 and end the dispute.

The above is the detailed content of Detailed introduction to encoding XML documents using UTF-8. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
WWE 2K25: How To Unlock Everything In MyRise
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Can I open an XML file using PowerPoint? Can I open an XML file using PowerPoint? Feb 19, 2024 pm 09:06 PM

Can XML files be opened with PPT? XML, Extensible Markup Language (Extensible Markup Language), is a universal markup language that is widely used in data exchange and data storage. Compared with HTML, XML is more flexible and can define its own tags and data structures, making the storage and exchange of data more convenient and unified. PPT, or PowerPoint, is a software developed by Microsoft for creating presentations. It provides a comprehensive way of

Convert XML data to CSV format in Python Convert XML data to CSV format in Python Aug 11, 2023 pm 07:41 PM

Convert XML data in Python to CSV format XML (ExtensibleMarkupLanguage) is an extensible markup language commonly used for data storage and transmission. CSV (CommaSeparatedValues) is a comma-delimited text file format commonly used for data import and export. When processing data, sometimes it is necessary to convert XML data to CSV format for easy analysis and processing. Python is a powerful

Python implements conversion between XML and JSON Python implements conversion between XML and JSON Aug 07, 2023 pm 07:10 PM

Python implements conversion between XML and JSON Introduction: In the daily development process, we often need to convert data between different formats. XML and JSON are common data exchange formats. In Python, we can use various libraries to convert between XML and JSON. This article will introduce several commonly used methods, with code examples. 1. To convert XML to JSON in Python, we can use the xml.etree.ElementTree module

Handling errors and exceptions in XML using Python Handling errors and exceptions in XML using Python Aug 08, 2023 pm 12:25 PM

Handling Errors and Exceptions in XML Using Python XML is a commonly used data format used to store and represent structured data. When we use Python to process XML, sometimes we may encounter some errors and exceptions. In this article, I will introduce how to use Python to handle errors and exceptions in XML, and provide some sample code for reference. Use try-except statement to catch XML parsing errors When we use Python to parse XML, sometimes we may encounter some

Python parsing special characters and escape sequences in XML Python parsing special characters and escape sequences in XML Aug 08, 2023 pm 12:46 PM

Python parses special characters and escape sequences in XML XML (eXtensibleMarkupLanguage) is a commonly used data exchange format used to transfer and store data between different systems. When processing XML files, you often encounter situations that contain special characters and escape sequences, which may cause parsing errors or misinterpretation of the data. Therefore, when parsing XML files using Python, we need to understand how to handle these special characters and escape sequences. 1. Special characters and

Knowledge graph: the ideal partner for large models Knowledge graph: the ideal partner for large models Jan 29, 2024 am 09:21 AM

Large language models (LLMs) have the ability to generate smooth and coherent text, bringing new prospects to areas such as artificial intelligence conversation and creative writing. However, LLM also has some key limitations. First, their knowledge is limited to patterns recognized from training data, lacking a true understanding of the world. Second, reasoning skills are limited and cannot make logical inferences or fuse facts from multiple data sources. When faced with more complex and open-ended questions, LLM's answers may become absurd or contradictory, known as "illusions." Therefore, although LLM is very useful in some aspects, it still has certain limitations when dealing with complex problems and real-world situations. In order to bridge these gaps, retrieval-augmented generation (RAG) systems have emerged in recent years. The core idea is

How to handle XML and JSON data formats in C# development How to handle XML and JSON data formats in C# development Oct 09, 2023 pm 06:15 PM

How to handle XML and JSON data formats in C# development requires specific code examples. In modern software development, XML and JSON are two widely used data formats. XML (Extensible Markup Language) is a markup language used to store and transmit data, while JSON (JavaScript Object Notation) is a lightweight data exchange format. In C# development, we often need to process and operate XML and JSON data. This article will focus on how to use C# to process these two data formats, and attach

Several common encoding methods Several common encoding methods Oct 24, 2023 am 10:09 AM

Common encoding methods include ASCII encoding, Unicode encoding, UTF-8 encoding, UTF-16 encoding, GBK encoding, etc. Detailed introduction: 1. ASCII encoding is the earliest character encoding standard, using 7-bit binary numbers to represent 128 characters, including English letters, numbers, punctuation marks, control characters, etc.; 2. Unicode encoding is a method used to represent all characters in the world The standard encoding method of characters, which assigns a unique digital code point to each character; 3. UTF-8 encoding, etc.

See all articles