Basic introduction-detailed explanation of JAVA character set-javaTutorial-php.cn

Home

Java

javaTutorial

Basic introduction-detailed explanation of JAVA character set

黄舟

Dec 17, 2016 am 11:01 AM

Overview

This article mainly includes the following aspects: basic knowledge of coding, java, system software, url, tool software, etc.

In the following description, we will take the word "Chinese" as an example. After looking up the table, we can know that the GB2312 encoding is "d6d0 cec4", the Unicode encoding is "4e2d 6587", and the UTF encoding is "e4b8ad e69687". Note that these two characters do not have iso8859-1 encoding, but they can be "represented" by iso8859-1 encoding.

2. Basic knowledge of encoding

The earliest encoding is iso8859-1, which is similar to ascii encoding. However, in order to facilitate the expression of various languages, many standard encodings have gradually emerged. The important ones are as follows.

2.1. iso8859-1

It is a single-byte encoding, the maximum character range that can be represented is 0-255, and is applied to the English series. For example, the encoding of the letter 'a' is 0x61=97.

It is obvious that the character range represented by iso8859-1 encoding is very narrow and cannot represent Chinese characters. However, since it is a single-byte encoding and is consistent with the most basic representation unit of the computer, iso8859-1 encoding is still used in many cases. And on many protocols, this encoding is used by default. For example, although the word "Chinese" does not exist in iso8859-1 encoding, taking gb2312 encoding as an example, it should be two characters of "d6d0 cec4". When using iso8859-1 encoding, it will be split into 4 bytes. Represents: "d6 d0 ce c4" (in fact, when storing, it is also processed in bytes). And if it is UTF encoding, it is 6 bytes "e4 b8 ad e6 96 87". Obviously, this representation needs to be based on another encoding.

2.2. GB2312/GBK

This is the national standard code of Hanzi, which is specially used to represent Chinese characters. It is a double-byte encoding, and the English letters are consistent with iso8859-1 (compatible with iso8859-1 encoding). Among them, gbk encoding can be used to represent traditional Chinese and simplified characters at the same time, while gb2312 can only represent simplified characters. gbk is compatible with gb2312 encoding.

2.3. unicode

This is the most unified encoding, which can be used to represent characters in all languages, and is a fixed-length double-byte (also four-byte) encoding, including English letters. So it can be said that it is not compatible with iso8859-1 encoding, nor is it compatible with any encoding. However, compared to iso8859-1 encoding, uniocode encoding only adds a 0 byte in front, for example, the letter 'a' is "00 61".

It should be noted that fixed-length encoding is easy for computers to process (note that GB2312/GBK is not a fixed-length encoding), and unicode can be used to represent all characters, so unicode encoding is used internally in many software, such as java.

2.4. UTF

Considering that unicode encoding is not compatible with iso8859-1 encoding and easily takes up more space: because for English letters, unicode also requires two bytes to represent. So unicode is not convenient for transmission and storage. Therefore, UTF encoding was produced. UTF encoding is compatible with ISO8859-1 encoding and can also be used to represent characters in all languages. However, UTF encoding is a variable-length encoding, and the length of each character ranges from 1-6 bytes. In addition, UTF encoding comes with a simple verification function. Generally speaking, English letters are represented by one byte, while Chinese characters use three bytes.

Note that although UTF is used to use less space, it is only compared to unicode encoding. If you already know that it is Chinese characters, using GB2312/GBK is undoubtedly the most economical. But on the other hand, it is worth noting that although UTF encoding uses 3 bytes for Chinese characters, even for Chinese character web pages, UTF encoding will save more than Unicode encoding, because the web page contains a lot of English characters.

3. Java processing of characters

In java application software, there will be many places involving character set encoding. Some places require correct settings, and some places require a certain degree of processing.

3.1. getBytes(charset)

This is a standard function for Java string processing. Its function is to encode the characters represented by the string according to charset and represent them in bytes. Note that strings are always stored in java memory in unicode encoding. For example, "Chinese" is stored as "4e2d 6587" under normal circumstances (that is, when there is no error). If the charset is "gbk", it is encoded as "d6d0 cec4", and then the byte "d6 d0 ce c4" is returned. If the charset is "utf8", the end is "e4 b8 ad e6 96 87". If it is "iso8859-1", because it cannot be encoded, "3f 3f" (two question marks) is finally returned.

3.2. new String(charset)

This is another standard function for Java string processing. It is the opposite of the previous function. It combines and identifies the byte array according to the charset encoding, and finally converts it to unicode for storage. Referring to the above example of getBytes, both "gbk" and "utf8" can get the correct result "4e2d 6587", but iso8859-1 finally becomes "003f 003f" (two question marks).

Because utf8 can be used to represent/encode all characters, new String(str.getBytes("utf8"), "utf8") === str, which is completely reversible.

3.3. setCharacterEncoding()

This function is used to set the http request or corresponding encoding.

For request, it refers to the encoding of the submitted content. After specifying, you can directly obtain the correct string through getParameter(). If not specified, the iso8859-1 encoding will be used by default, which requires further processing. See "Form input" below. It is worth noting that no getParameter() can be executed before setCharacterEncoding() is executed. The java doc states: This method must be called PRior to reading request parameters or reading input using getReader(). Moreover, this specification is only valid for the POST method, not the GET method. Analyzing the reason, it should be that when executing the first getParameter(), java will analyze all submitted content according to the encoding, and subsequent getParameter() will no longer be analyzed, so setCharacterEncoding() is invalid. For the GET method to submit the form, the submitted content is in the URL, and all the submitted content has been analyzed according to the encoding from the beginning, so setCharacterEncoding() is naturally invalid.

For response, it specifies the encoding of the output content. At the same time, this setting will be passed to the browser to tell the browser the encoding used to output the content.

3.4. Processing process

Two representative examples are analyzed below to illustrate how Java handles coding-related issues.

3.4.1. Form input

User input *(gbk:d6d0 cec4) browser *(gbk:d6d0 cec4) web server iso8859-1(00d6 00d 000ce 00c4) class, need to be processed in class: getbytes (" iso8859-1") is d6 d0 ce c4, new String("gbk") is d6d0 cec4, and the unicode encoding in the memory is 4e2d 6587.

l The encoding method input by the user is related to the encoding specified on the page, and also related to the user's operating system, so it is uncertain. The above example uses gbk as an example.

From browser to web server, you can specify the character set used when submitting content in the form, otherwise the encoding specified by the page will be used. And if the parameters are entered directly in the URL using ?, the encoding is often the encoding of the operating system itself, because it has nothing to do with the page at this time. The above still takes gbk encoding as an example.

What the Web server receives is a byte stream. By default (getParameter), it will be processed with iso8859-1 encoding. The result is incorrect, so it needs to be processed. But if the encoding is set in advance (through request. setCharacterEncoding ()), the correct result can be obtained directly.

It is a good habit to specify the encoding in the page, otherwise you may lose control and fail to specify the correct encoding.

The above is the basic introduction - detailed explanation of JAVA character set. For more related articles, please pay attention to the PHP Chinese website (www.php.cn)!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

Repo: How To Revive Teammates

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hello Kitty Island Adventure: How To Get Giant Seeds

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

4 weeks ago By DDD

R.E.P.O. Save File Location: Where Is It & How to Protect It?

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7366

Java Tutorial

1628

CakePHP Tutorial

1353

Laravel Tutorial

1266

PHP Tutorial

1214

Related knowledge

Square Root in Java Aug 30, 2024 pm 04:26 PM

Guide to Square Root in Java. Here we discuss how Square Root works in Java with example and its code implementation respectively.

Perfect Number in Java Aug 30, 2024 pm 04:28 PM

Guide to Perfect Number in Java. Here we discuss the Definition, How to check Perfect number in Java?, examples with code implementation.

Random Number Generator in Java Aug 30, 2024 pm 04:27 PM

Guide to Random Number Generator in Java. Here we discuss Functions in Java with examples and two different Generators with ther examples.

Weka in Java Aug 30, 2024 pm 04:28 PM

Guide to Weka in Java. Here we discuss the Introduction, how to use weka java, the type of platform, and advantages with examples.

Armstrong Number in Java Aug 30, 2024 pm 04:26 PM

Guide to the Armstrong Number in Java. Here we discuss an introduction to Armstrong's number in java along with some of the code.

Smith Number in Java Aug 30, 2024 pm 04:28 PM

Guide to Smith Number in Java. Here we discuss the Definition, How to check smith number in Java? example with code implementation.

Java Spring Interview Questions Aug 30, 2024 pm 04:29 PM

In this article, we have kept the most asked Java Spring Interview Questions with their detailed answers. So that you can crack the interview.

Break or return from Java 8 stream forEach? Feb 07, 2025 pm 12:09 PM

Java 8 introduces the Stream API, providing a powerful and expressive way to process data collections. However, a common question when using Stream is: How to break or return from a forEach operation? Traditional loops allow for early interruption or return, but Stream's forEach method does not directly support this method. This article will explain the reasons and explore alternative methods for implementing premature termination in Stream processing systems. Further reading: Java Stream API improvements Understand Stream forEach The forEach method is a terminal operation that performs one operation on each element in the Stream. Its design intention is

See all articles