Home > Java > javaTutorial > body text

Basic introduction-detailed explanation of JAVA character set

黄舟
Release: 2016-12-17 11:01:02
Original
1522 people have browsed it
  1. Overview

    This article mainly includes the following aspects: basic knowledge of coding, java, system software, url, tool software, etc.

    In the following description, we will take the word "Chinese" as an example. After looking up the table, we can know that the GB2312 encoding is "d6d0 cec4", the Unicode encoding is "4e2d 6587", and the UTF encoding is "e4b8ad e69687". Note that these two characters do not have iso8859-1 encoding, but they can be "represented" by iso8859-1 encoding.

    2. Basic knowledge of encoding

    The earliest encoding is iso8859-1, which is similar to ascii encoding. However, in order to facilitate the expression of various languages, many standard encodings have gradually emerged. The important ones are as follows.

    2.1. iso8859-1

    It is a single-byte encoding, the maximum character range that can be represented is 0-255, and is applied to the English series. For example, the encoding of the letter 'a' is 0x61=97.

    It is obvious that the character range represented by iso8859-1 encoding is very narrow and cannot represent Chinese characters. However, since it is a single-byte encoding and is consistent with the most basic representation unit of the computer, iso8859-1 encoding is still used in many cases. And on many protocols, this encoding is used by default. For example, although the word "Chinese" does not exist in iso8859-1 encoding, taking gb2312 encoding as an example, it should be two characters of "d6d0 cec4". When using iso8859-1 encoding, it will be split into 4 bytes. Represents: "d6 d0 ce c4" (in fact, when storing, it is also processed in bytes). And if it is UTF encoding, it is 6 bytes "e4 b8 ad e6 96 87". Obviously, this representation needs to be based on another encoding.

    2.2. GB2312/GBK

    This is the national standard code of Hanzi, which is specially used to represent Chinese characters. It is a double-byte encoding, and the English letters are consistent with iso8859-1 (compatible with iso8859-1 encoding). Among them, gbk encoding can be used to represent traditional Chinese and simplified characters at the same time, while gb2312 can only represent simplified characters. gbk is compatible with gb2312 encoding.

    2.3. unicode

    This is the most unified encoding, which can be used to represent characters in all languages, and is a fixed-length double-byte (also four-byte) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, nor is it compatible with any encoding. However, compared to iso8859-1 encoding, uniocode encoding only adds a 0 byte in front, for example, the letter 'a' is "00 61".

    It should be noted that fixed-length encoding is easy for computers to process (note that GB2312/GBK is not a fixed-length encoding), and unicode can be used to represent all characters, so unicode encoding is used internally in many software, such as java.

    2.4. UTF

    Considering that unicode encoding is not compatible with iso8859-1 encoding and easily takes up more space: because for English letters, unicode also requires two bytes to represent. So unicode is not convenient for transmission and storage. Therefore, UTF encoding was produced. UTF encoding is compatible with ISO8859-1 encoding and can also be used to represent characters in all languages. However, UTF encoding is a variable-length encoding, and the length of each character ranges from 1-6 bytes. In addition, UTF encoding comes with a simple verification function. Generally speaking, English letters are represented by one byte, while Chinese characters use three bytes.

    Note that although UTF is used to use less space, it is only compared to unicode encoding. If you already know that it is Chinese characters, using GB2312/GBK is undoubtedly the most economical. But on the other hand, it is worth noting that although UTF encoding uses 3 bytes for Chinese characters, even for Chinese character web pages, UTF encoding will save more than Unicode encoding, because the web page contains a lot of English characters.

    3. Java processing of characters

    In java application software, there will be many places involving character set encoding. Some places require correct settings, and some places require a certain degree of processing.

    3.1. getBytes(charset)

    This is a standard function for Java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in java memory in unicode encoding. For example, "Chinese" is stored as "4e2d 6587" under normal circumstances (that is, when there is no error). If the charset is "gbk", it is encoded as "d6d0 cec4", and then the byte "d6 d0 ce c4" is returned. If the charset is "utf8", the end is "e4 b8 ad e6 96 87". If it is "iso8859-1", because it cannot be encoded, "3f 3f" (two question marks) is finally returned.

    3.2. new String(charset)

    This is another standard function for Java string processing. It is the opposite of the previous function. It combines and identifies the byte array according to the charset encoding, and finally converts it to unicode for storage. Referring to the above example of getBytes, both "gbk" and "utf8" can get the correct result "4e2d 6587", but iso8859-1 finally becomes "003f 003f" (two question marks).

    Because utf8 can be used to represent/encode all characters, new String(str.getBytes("utf8"), "utf8") === str, which is completely reversible.

    3.3. setCharacterEncoding()

    This function is used to set the http request or corresponding encoding.

    For request, it refers to the encoding of the submitted content. After specifying, you can directly obtain the correct string through getParameter(). If not specified, the iso8859-1 encoding will be used by default, which requires further processing. See "Form input" below. It is worth noting that no getParameter() can be executed before setCharacterEncoding() is executed. The java doc states: This method must be called PRior to reading request parameters or reading input using getReader(). Moreover, this specification is only valid for the POST method, not the GET method. Analyzing the reason, it should be that when executing the first getParameter(), java will analyze all submitted content according to the encoding, and subsequent getParameter() will no longer be analyzed, so setCharacterEncoding() is invalid. For the GET method to submit the form, the submitted content is in the URL, and all the submitted content has been analyzed according to the encoding from the beginning, so setCharacterEncoding() is naturally invalid.

    For response, it specifies the encoding of the output content. At the same time, this setting will be passed to the browser to tell the browser the encoding used to output the content.

    3.4. Processing process

    Two representative examples are analyzed below to illustrate how Java handles coding-related issues.

    3.4.1. Form input

    User input *(gbk:d6d0 cec4) browser *(gbk:d6d0 cec4) web server iso8859-1(00d6 00d ​​000ce 00c4) class, need to be processed in class: getbytes (" iso8859-1") is d6 d0 ce c4, new String("gbk") is d6d0 cec4, and the unicode encoding in the memory is 4e2d 6587.

    l The encoding method input by the user is related to the encoding specified on the page, and also related to the user's operating system, so it is uncertain. The above example uses gbk as an example.

    From browser to web server, you can specify the character set used when submitting content in the form, otherwise the encoding specified by the page will be used. And if the parameters are entered directly in the URL using ?, the encoding is often the encoding of the operating system itself, because it has nothing to do with the page at this time. The above still takes gbk encoding as an example.

    What the Web server receives is a byte stream. By default (getParameter), it will be processed with iso8859-1 encoding. The result is incorrect, so it needs to be processed. But if the encoding is set in advance (through request. setCharacterEncoding ()), the correct result can be obtained directly.

    It is a good habit to specify the encoding in the page, otherwise you may lose control and fail to specify the correct encoding.


The above is the basic introduction - detailed explanation of JAVA character set. For more related articles, please pay attention to the PHP Chinese website (www.php.cn)!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!