What is the cause of Chinese garbled characters?-Common Problem-php.cn

Table of Contents

Let’s first talk about what garbled characters are

Home

Common Problem

What is the cause of Chinese garbled characters?

青灯夜游

Nov 09, 2022 am 11:14 AM

Garbled characters Chinese garbled

Cause of Chinese garbled characters: decoding method and encoding method are inconsistent. A Chinese character encoded in UTF-8 will be converted into 3 bytes, and if encoded in gbk it will be converted into 2 bytes; and an English character encoded in UTF-8 will be converted into 1 byte, if encoded in gbk it will be converted into 1 byte.

What is the cause of Chinese garbled characters?

The operating environment of this tutorial: Windows 7 system, Dell G3 computer.

Let’s first talk about what garbled characters are

I don’t know if anyone has thought this before. A string contains not only characters, but also its encoding information. For example, String str = "Hello" in Java; I thought this before, the string str hides its encoding method unicode encoding or gbk, iso-8859-1, etc. This understanding is wrong. Characters are characters without any other information. The correct understanding should be that the string that people see in a file is the digital information in the memory that the system reads. Then decode it into some characters and finally display it. That is, when you double-click to open a text file, the system will read and display the digital information in the memory. When you save a text file, the system will encode the file in the encoding method you set. Then put it into memory. So garbled characters are also some characters, just strange characters, and there is no "code".

Let’s talk about the reasons for garbled codes

We often see the explanation of the reasons for garbled codes on the Internet: Garbled codes are caused by the inconsistency between the decoding method and the encoding method. This sentence itself There is nothing wrong, but the same sentence itself just summarizes the garbled code, and it does not help you understand the garbled code.

So the question we want to ask is: Why does the decoding method and encoding method differ and garbled characters appear.

Here are the three encoding methods of utf-8, gbk, and iso-8859-1 as examples.

     @Test
     public void testEncode() throws Exception {
        String str = "你好",en = "h?h";
        
        System.out.println("========中文字符utf-8=======");
        byte[] utf8 = str.getBytes(); // 以utf-8方式编码 ，default:utf-8
        for (byte b : utf8) {            
            System.out.print(b + "\t");
        }
        
        System.out.println("\n"+"========英文字符utf-8=======");
        byte[] utf8_en = en.getBytes(); // 以utf-8方式编码 ，default:utf-8
        for (byte b : utf8_en) {            
            System.out.print(b + "\t");
        }
        
        System.out.println("\n"+"========中文字符gbk=========");
        byte[] gbk = str.getBytes("gbk");
        for (byte b : gbk) {            
            System.out.print(b + "\t");
        }
        
        System.out.println("\n"+"========英文字符gbk=========");
        byte[] gbk_en = en.getBytes("gbk");
        for (byte b : gbk_en) {            
            System.out.print(b + "\t");
        }
        
        String s = new String(utf8,"utf-8");
        String s1 = new String(utf8,"gbk");
        System.out.println("\n"+s + "====gbk:" + s1);
     }

Copy after login

Test the above method and the printed result is:

========中文字符utf-8=======
-28 -67  -96 -27  -91 -67  
========英文字符utf-8=======
104 63  104 
========中文字符gbk=========
-60 -29  -70 -61  
========英文字符gbk=========
104 63  104 
你好====gbk:浣犲ソ
------------------------------------------------------------------------------------

Copy after login

It can be concluded that:

A Chinese character is in utf-8 The encoding will be converted into 3 bytes. If encoded with gbk, it will be converted into 2 bytes.
An English character encoded with utf-8 will be converted into 1 Byte, if encoded in gbk, it will be converted into 1 byte.
It can be seen from the last line of printing combined with the 29-31 lines of code that if the byte array utf8 is decoded in utf-8 mode, there will be no garbled characters and it will still be the original "Hello", and if decoded in gbk mode, three garbled characters appear. Why are there 3 instead of 2? 6/2=3.

Next, let’s talk about iso-8859-1. This encoding is applied to the English series, which means that it cannot represent Chinese (if you want to use it, you must rely on other encodings that are compatible with the iso-8859-1 encoding method). Unreadable characters will be regarded as English question marks '?'. The iso-8859-1 encoding number of English question marks is: 63 (decimal) (in fact, in almost all encoding methods, all English characters are fixed with 1 bytecode representation, except unicode encoding).

     @Test
     public void testISO() throws Exception {
         String str = "你好";
         byte[] bs = str.getBytes("iso-8859-1");
         for (byte b : bs) {
            System.out.println(b);
         }
         System.out.println(new String(bs,"iso-8859-1"));
         System.out.println(new String(bs,"utf-8"));
         System.out.println(new String(bs,"gbk"));
         System.out.println(new String(bs,"unicode"));         
     }

Copy after login

Print results

63
63
??
??
??
㼿

Copy after login

Explanation 63 =》?, all Chinese are considered?, so when this code is executed: byte[] bs = "Hello".getBytes ("iso-8859-1");Information has been lost.

Execute String str = new String(bs, "any charset"); str is no longer equal to "Hello", but two question marks??. So in tomcat we often encounter Chinese characters changing into a long string of ??????, which is the origin of this.

In iso-8859-1, utf-8, and gbk, one bytecode represents an English character.

In unicode encoding, one bytecode cannot represent any character, and it is stipulated It takes two bytecodes (sometimes 4) to represent a character.

Having said so much, many people may ask why so many encoding methods are used. Isn’t it possible to unify them into utf-8 to represent all characters?

Encoding not only considers whether any characters can be represented, but also considers transmission and storage.

1. UTF-8 can indeed represent almost all known characters. As mentioned earlier, only 3 bytes represent a Chinese character in UTF-8 encoding, which obviously takes up space and is not conducive to transmission and storage (transmission and storage are both performed in binary)

2. Undoubtedly, one byte represents one character in the most space-saving manner, such as iso-8859-1. But there are not only English characters in the world, but also characters from various regions and countries. So the number of characters must be greater than 2 to the 8th power.

So combining the above two points, many encoding methods naturally appear.

Understand the rules of various encoding methods: https://jingyan.baidu.com/article/020278118741e91bcd9ce566.html

For more programming-related knowledge, please visit: Programming Teaching! !

The above is the detailed content of What is the cause of Chinese garbled characters?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Chat Commands and How to Use Them

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7525

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

How to solve Chinese garbled characters in Linux Feb 21, 2024 am 10:48 AM

The Linux Chinese garbled problem is a common problem when using Chinese character sets and encodings. Garbled characters may be caused by incorrect file encoding settings, system locale not being installed or set, and terminal display configuration errors, etc. This article will introduce several common workarounds and provide specific code examples. 1. Check the file encoding setting. Use the file command to view the file encoding. Use the file command in the terminal to view the encoding of the file: file-ifilename. If there is "charset" in the output

How to solve tomcat startup garbled code Dec 26, 2023 pm 05:21 PM

Solutions to garbled tomcat startup: 1. Modify Tomcat's conf configuration file; 2. Modify the system language; 3. Modify the command line window encoding; 4. Check the Tomcat server configuration; 5. Check the project encoding; 6. Check the log file; 7 , try other solutions. Detailed introduction: 1. Modify Tomcat's conf configuration file, open Tomcat's conf directory, find the "logging.properties" file, etc.

How to solve the problem of Chinese garbled characters in Windows 10 Jan 16, 2024 pm 02:21 PM

In the Windows 10 system, garbled characters are common. The reason behind this is often that the operating system does not provide default support for some character sets, or there is an error in the set character set options. In order to prescribe the right medicine, we will analyze the actual operating procedures in detail below. How to solve Windows 10 garbled code 1. Open settings and find "Time and Language" 2. Then find "Language" 3. Find "Manage Language Settings" 4. Click "Change System Regional Settings" here 5. Check the box as shown and click Just make sure.

Methods to solve the problem of Chinese garbled characters in PHP Dompdf Mar 05, 2024 pm 03:45 PM

Methods to solve the Chinese garbled problem of PHPDompdf PHPDompdf is a tool for converting HTML documents to PDF files. It is powerful and easy to use. However, when processing Chinese content, you sometimes encounter the problem of garbled Chinese characters. This article will introduce some methods to solve the Chinese garbled problem of PHPDompdf and provide specific code examples. 1. When using font files to process Chinese content, a common problem is that Dompdf does not support Chinese content by default.

Editing method to solve the problem of garbled characters when opening dll files Jan 06, 2024 pm 07:53 PM

When many users use computers, they will find that there are many files with the suffix dll, but many users do not know how to open such files. For those who want to know, please take a look at the following details. Tutorial~How to open and edit dll files: 1. Download a software called "exescope" and download and install it. 2. Then right-click the dll file and select "Edit resources with exescope". 3. Then click "OK" in the pop-up error prompt box. 4. Then on the right panel, click the "+" sign in front of each group to view the content it contains. 5. Click on the dll file you want to view, then click "File" and select "Export". 6. Then you can

Solve the problem of garbled characters in win11 notepad Jan 05, 2024 pm 03:11 PM

Some friends want to open a notepad and find that their win11 notepad is garbled and don't know what to do. In fact, we generally only need to modify the region and language. Win11 Notepad is garbled: First step, use the search function, search and open "Control Panel" Second step, click "Change date, time or number format" under Clock and Region Third step, click the "Manage" option above Card. The fourth step is to click "Change System Regional Settings" below. The fifth step is to change the current system regional settings to "Chinese (Simplified, China)" and click "OK" to save.

How to solve filezilla garbled characters Nov 20, 2023 am 10:16 AM

Solutions to filezilla garbled characters include: 1. Check the encoding settings; 2. Check the file itself; 3. Check the server configuration; 4. Try other transfer tools; 5. Update the software version; 6. Check for network problems; 7. Seek technical support. To solve the problem of FileZilla garbled characters, you need to start from multiple aspects, gradually investigate the cause of the problem, and take corresponding measures to repair it.

Common causes and solutions for Chinese garbled characters in MySQL installation Mar 02, 2024 am 09:00 AM

Common reasons and solutions for Chinese garbled characters in MySQL installation MySQL is a commonly used relational database management system, but you may encounter the problem of Chinese garbled characters during use, which brings trouble to developers and system administrators. The problem of Chinese garbled characters is mainly caused by incorrect character set settings, inconsistent character sets between the database server and the client, etc. This article will introduce in detail the common causes and solutions of Chinese garbled characters in MySQL installation to help everyone better solve this problem. 1. Common reasons: character set setting