Home Java javaTutorial How to solve java Chinese garbled characters

How to solve java Chinese garbled characters

Nov 26, 2016 am 09:55 AM
java

With the development and popularization of computers, countries around the world will design their own encoding styles in order to adapt to their own languages ​​and characters. It is precisely because of this chaos that there are many encoding methods, so that the same binary number may be are interpreted into different symbols. In order to solve this incompatibility problem, the great idea Unicode encoding came into being! !

Unicode

Unicode is also called Unicode, Unicode, and Unicode. It was created to solve the limitations of traditional character encoding schemes. It sets a unified and unique code for each character in each language. Binary encoding to meet the requirements for cross-language and cross-platform text conversion and processing. You can imagine Unicode as a "large character container" that contains all the symbols in the world, and each symbol has its own unique encoding, which fundamentally solves the problem of garbled characters. So Unicode is an encoding of all symbols [2].

Unicode developed with the standard of the universal character set and was also published in the form of a book. It is an industry standard that organizes and codes most of the writing systems in the world, making it easier for computers to use way to present and process text. Unicode is still being continuously revised and has included more than 100,000 characters so far. It is widely recognized by the industry and is widely used in the internationalization and localization process of computer software.

We know that Unicode was created to solve the limitations of traditional character encoding schemes. For traditional encoding methods, they all have a common problem: they cannot support multi-language environments, which is not suitable for the open environment of the Internet. Allowed. At present, almost all computer systems support the basic Latin alphabet, and each supports different other encoding methods. In order to be compatible with them, Unicode reserves the first 256 characters for the characters defined by ISO 8859-1, so that the conversion of existing Western European languages ​​​​does not require special considerations; and a large number of the same characters are repeatedly encoded into different character codes Go, allowing the old and complicated encoding methods to be directly converted to and from Unicode encoding without losing any information [1].

Implementation method

The Unicode encoding of a character is determined, but in the actual transmission process, due to the different design of different system platforms and the purpose of saving space, the implementation of Unicode encoding is different. The implementation of Unicode is called Unicode Transformation Format (UTF for short) [1].

Unicode is a character set, which mainly has three implementation methods: UTF-8, UTF-16, and UTF-32. Since UTF-8 is the current mainstream implementation method, UTF-16 and UTF-32 are relatively rarely used, so the following will mainly introduce UTF-8.

UCS

When it comes to Unicode, it may be necessary to know about UCS. UCS (Universal Character Set) is a standard character set defined by the ISO 10646 (or ISO/IEC 10646) standard formulated by ISO. It includes all other character sets, ensuring two-way compatibility with other character sets, that is, if you translate any text string to UCS format and then translate back to the original encoding, you will not lose any information.

UCS not only assigns a code to each character, but also gives it an official name. Hexadecimal numbers representing a UCS or Unicode value are usually preceded by "U+", for example "U+0041" represents the character "A".

Little endian & Big endian

Due to the different designs of each system platform, some platforms may have different understanding of characters (such as the understanding of byte order). This will result in the byte stream being interpreted as different content. For example, the hexadecimal value of a certain character is 4E59, which is split into 4E and 59. When read on the MAC, it starts with the low-order bit. Then when the MAC encounters the byte stream, it will be parsed as 594E. Find The character is "Kui", but on the Windows platform, reading starts from the high byte, which is 4E59, and the found character is "B". In other words, "B" saved on the Windows platform will become "Kui" on the MAC platform. This will inevitably cause confusion, so two methods are used to distinguish between Big endian and Little endian in Unicode encoding. That is, the first byte comes first, which is the big-endian mode, and the second byte comes first, which is the little-endian mode. So a question arises at this time: How does the computer know which encoding method a certain file uses?

It is defined in the Unicode specification that a character indicating the encoding sequence is added to the front of each file. The name of this character is called "ZERO WIDTH NO-BREAK SPACE", represented by FEFF. This is exactly two bytes, and FF is one greater than FE.

If the first two bytes of a text file are FE FF, it means that the file uses big-endian mode; if the first two bytes are FF FE, it means that the file uses small-endian mode.

UTF-8

UTF-8 is a variable-length character encoding for Unicode. It can use 1~4 bytes to represent a symbol, and the byte length changes according to different symbols. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII. This allows the original system that processes ASCII characters to continue to be used without or with only minor modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.

UTF-8 uses one to four bytes to encode each character. The encoding rules are as follows:

1) For single-byte symbols, the first bit of the byte is set to 0, and the next 7 bits are for this symbol. unicode code. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For n-byte symbols (n>1), the first n bits of the first byte are set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are set to 10 . The remaining binary bits not mentioned are all the unicode code of this symbol.

The conversion table is as follows:

How to solve java Chinese garbled characters

According to the above conversion table, it becomes very simple to understand the conversion encoding rules of UTF-8: If the first bit of the first byte is 0, it means this byte It is a character alone; if it is 1, the number of consecutive 1s indicates how many bytes the character occupies.

Take the Chinese character "yan" as an example to demonstrate how to implement UTF-8 encoding [3].

It is known that the unicode of "strict" is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800-0000 FFFF), so the UTF-8 encoding of "strict" requires three Bytes, that is, the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, starting from the last binary digit of "strict", fill in the x in the format from back to front, and fill in the extra bits with 0. In this way, we get that the UTF-8 encoding of "Yan" is "11100100 10111000 10100101", which converted to hexadecimal is E4B8A5.

Conversion between Unicode and UTF-8

Through the above example, we can see that the Unicode code of "Yan" is 4E25 and the UTF-8 encoding is E4B8A5. They are different and need to be converted by the program. To achieve this, the simplest and most intuitive method on the Window platform is Notepad.

There are four options at the bottom of "Encoding (E)": ANSI, Unicode, Unicode big endian, UTF-8.

ANSI: The default encoding method of Notepad is ASCII encoding for English files and GB2312 encoding for Simplified Chinese files. Note: Different ANSI codes are incompatible with each other. When information is exchanged internationally, text belonging to two languages ​​cannot be stored in the same ANSI-encoded text.

Unicode: UCS-2 encoding method, that is, directly using Two bytes store the Unicode code of the character. This method is the "little endian" method.

Unicode big endian: UCS-2 encoding method, "big endian" method.

UTF-8: Read above (UTF-8).

> Viewer" and get the following results:

ANSI: The two bytes "D1 CF" are exactly the GB2312 encoding of "strict".

Unicode: Four bytes "FF FE 25 4E", where "FF FE" represents the small end storage method, and the real encoding is "25 4E".

Unicode big endian: four bytes "FE FF 4E 25", "FE FF" represents the big end storage method, and the real encoding is "4E 25".

UTF-8: The encoding is six bytes "EF BB BF E4 B8 A5". The first three bytes "EF BB BF" indicate that this is UTF-8 encoding, and the last three bytes "E4B8A5" are "strict" For specific encoding, its storage order is consistent with the encoding order.


Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Square Root in Java Square Root in Java Aug 30, 2024 pm 04:26 PM

Guide to Square Root in Java. Here we discuss how Square Root works in Java with example and its code implementation respectively.

Perfect Number in Java Perfect Number in Java Aug 30, 2024 pm 04:28 PM

Guide to Perfect Number in Java. Here we discuss the Definition, How to check Perfect number in Java?, examples with code implementation.

Random Number Generator in Java Random Number Generator in Java Aug 30, 2024 pm 04:27 PM

Guide to Random Number Generator in Java. Here we discuss Functions in Java with examples and two different Generators with ther examples.

Armstrong Number in Java Armstrong Number in Java Aug 30, 2024 pm 04:26 PM

Guide to the Armstrong Number in Java. Here we discuss an introduction to Armstrong's number in java along with some of the code.

Weka in Java Weka in Java Aug 30, 2024 pm 04:28 PM

Guide to Weka in Java. Here we discuss the Introduction, how to use weka java, the type of platform, and advantages with examples.

Smith Number in Java Smith Number in Java Aug 30, 2024 pm 04:28 PM

Guide to Smith Number in Java. Here we discuss the Definition, How to check smith number in Java? example with code implementation.

Java Spring Interview Questions Java Spring Interview Questions Aug 30, 2024 pm 04:29 PM

In this article, we have kept the most asked Java Spring Interview Questions with their detailed answers. So that you can crack the interview.

Break or return from Java 8 stream forEach? Break or return from Java 8 stream forEach? Feb 07, 2025 pm 12:09 PM

Java 8 introduces the Stream API, providing a powerful and expressive way to process data collections. However, a common question when using Stream is: How to break or return from a forEach operation? Traditional loops allow for early interruption or return, but Stream's forEach method does not directly support this method. This article will explain the reasons and explore alternative methods for implementing premature termination in Stream processing systems. Further reading: Java Stream API improvements Understand Stream forEach The forEach method is a terminal operation that performs one operation on each element in the Stream. Its design intention is

See all articles