Home Backend Development PHP Tutorial How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

Aug 31, 2018 am 11:22 AM
ascii unicode utf-8

How much do you know about character set encoding ASCII, Unicode and UTF-8? This article will give you a thorough understanding of character set encoding. This article introduces ASCII, Unicode and UTF-8 encoding issues and conversions as well as example analysis. Start reading the article

1. ASCII code

We know that inside the computer, all information is ultimately a binary value. Each binary bit (bit) has two states: 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte. In other words, one byte can be used to represent a total of 256 different states, and each state corresponds to a symbol, which is 256 symbols, from 00000000 to 11111111.

In the 1960s, the United States formulated a set of character encodings that unified the relationship between English characters and binary bits. This was called ASCII and is still used today.

ASCII code specifies a total of 128 character encodings. For example, SPACE is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last 7 bits of a byte, and the first bit is uniformly set to 0.

ASCII control characters

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

ASCII displayable characters

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

2. Non-ASCII encoding

It is enough to encode English with 128 symbols, but to represent other languages, 128 symbols are not enough of. For example, in French, if there are phonetic symbols above letters, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bits in the bytes to encode new symbols. For example, the encoding for é in French is 130 (binary 10000010). As a result, the encoding system used in these European countries can represent up to 256 symbols.

However, a new problem arises here. Different countries have different letters, so even if they all use a 256-symbol encoding, the letters they represent are different. For example, 130 represents é in French encoding, represents the letter Gimel (ג) in Hebrew encoding, and represents another symbol in Russian encoding. But no matter what, in all these encoding methods, the symbols represented by 0--127 are the same, and the only difference is the section 128--255.

As for the characters of Asian countries, they use even more symbols, with as many as 100,000 Chinese characters. One byte can only represent 256 symbols, which is definitely not enough. Multiple bytes must be used to express one symbol. For example, the common encoding method for Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so theoretically it can represent up to 256 x 256 = 65536 symbols.

The issue of Chinese encoding requires a special article to discuss, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB class has nothing to do with the Unicode and UTF-8 described later.

3. Unicode

As mentioned in the previous section, there are many encoding methods in the world, and the same binary number can be interpreted into different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise if you use the wrong encoding method to interpret it, garbled characters will appear. Why are emails often garbled? This is because the sender and recipient use different encoding methods.

It is conceivable that if there is a coding that includes all the symbols in the world. Each symbol is given a unique code, so the garbled code problem will disappear. This is Unicode, as its name suggests, an encoding of all symbols.

Unicode is of course a large collection, currently capable of holding more than 1 million symbols. The encoding of each symbol is different. For example, U 0639 represents the Arabic letter Ain, U 0041 represents the English capital letter A, and U 4E25 represents the Chinese character Yan. For a specific symbol correspondence table, you can check unicode.org, or a specialized Chinese character correspondence table.

4. Problems with Unicode

It should be noted that Unicode is just a symbol set. It only specifies the binary code of the symbol, but There is no specification as to how this binary code should be stored.

For example, the Unicode of Chinese character Yan is the hexadecimal number 4E25, which is converted into a binary number with 15 digits (100111000100101). In other words, the representation of this symbol requires at least 2 bytes. Representing other larger symbols may require 3 bytes or 4 bytes, or even more.

There are two serious problems here. The first question is, how to distinguish Unicode and ASCII? How does the computer know that three bytes represent one symbol, rather than three separate symbols? The second problem is that we already know that only one byte is enough to represent English letters. If Unicode uniformly stipulates that each symbol is represented by three or four bytes, then each English letter must be preceded by two characters. Three bytes are 0, which is a huge waste of storage, and the size of the text file will be two or three times larger, which is unacceptable.

The results they cause are: 1) Multiple storage methods of Unicode have emerged, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode could not be promoted for a long time until the emergence of the Internet.

5. UTF-8

The popularity of the Internet strongly requires the emergence of a unified encoding method. UTF-8 is the most widely used Unicode implementation on the Internet. Other implementations include UTF-16 (characters are represented by two or four bytes) and UTF-32 (characters are represented by four bytes), but these are rarely used on the Internet. Again, the connection here is that UTF-8 is an implementation of Unicode.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1~4 bytes to represent a symbol, and the byte length varies according to different symbols.

The encoding rules of UTF-8 are very simple, there are only two:

1. For single-byte symbols, the first bit of the byte is set to 0, and the following The 7 bits are the Unicode code of this symbol. So for English letters, UTF-8 encoding and ASCII encoding are the same.

2. For n-byte symbols (n > 1), the first n bits of the first byte are set to 1, the n 1st bit is set to 0, and the first two bits of the following bytes are set to 1. Always set to 10. The remaining binary bits not mentioned are all the Unicode code of this symbol.

The following table summarizes the encoding rules. The letter x indicates the available encoding bits.

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

#According to the above table, interpreting UTF-8 encoding is very simple. If the first bit of a byte is 0, then the byte alone is a character; if the first bit is 1, then the number of consecutive 1s indicates how many bytes the current character occupies.

Next, we will take the Chinese character Yan as an example to demonstrate how to implement UTF-8 encoding.

Yan’s Unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800 - 0000 FFFF), so Yan’s UTF-8 encoding requires three bytes, that is, the format is 1110xxxx 10xxxxxx 10xxxxxx. Then, starting from the last binary digit of Yan, fill in the x in the format from back to front, and fill in the extra bits with 0. In this way, we get that Yan's UTF-8 encoding is 11100100 10111000 10100101, which converted to hexadecimal is E4B8A5.

6. Conversion between Unicode and UTF-8

Through the example in the previous section, you can see that Yan’s Unicode code is 4E25, UTF-8 encoding is E4B8A5, the two are different. Conversion between them can be achieved through programs.

For Windows platform, one of the simplest conversion methods is to use the built-in notepad applet notepad.exe. After opening the file, click the Save As command in the File menu, and a dialog box will pop up with a coding drop-down bar at the bottom.

How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection)

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

  • ANSI is the default encoding. For English files, it is ASCII encoding, and for Simplified Chinese files, it is GB2312 encoding (only for Windows Simplified Chinese version, if it is Traditional Chinese version, Big5 code will be used).

  • Unicode encoding here refers to the UCS-2 encoding method used by notepad.exe, which directly uses two bytes to store the Unicode code of the character. This option uses the little endian format. .

  • Unicode big endian encoding corresponds to the previous option. I will explain the meaning of little endian and big endian in the next section.

  • UTF-8 encoding, which is the encoding method mentioned in the previous section.

After selecting the "encoding method", click the "Save" button, and the encoding method of the file will be converted immediately.

7. Little endian and Big endian

As mentioned in the previous section, the UCS-2 format can store Unicode codes (code points are not exceeds 0xFFFF). Taking the Chinese character Yan as an example, the Unicode code is 4E25 and needs to be stored in two bytes, one byte is 4E and the other byte is 25. When storing, 4E is in the front and 25 is in the back, which is the Big endian method; 25 is in the front and 4E is in the back, which is the Little endian method.

These two weird names come from the British writer Swift's "Gulliver's Travels". In the book, a civil war broke out in Lilliput. The cause of the war was people's dispute over whether to crack eggs from the big-endian or the little-endian. Because of this incident, six wars broke out, one emperor lost his life, and another emperor lost his throne.

The first byte comes first, which is "Big endian", and the second byte comes first, which is "Little endian".

So naturally, a question will arise: How does the computer know which way a certain file is encoded?

The Unicode specification defines that a character indicating the encoding sequence is added to the front of each file. The name of this character is called "zero width no-break space" (zero width no-break space), represented by FEFF. This is exactly two bytes, and FF is one greater than FE.

If the first two bytes of a text file are FE FF, it means that the file uses big-end mode; if the first two bytes are FF FE, it means that the file uses small-end mode.

8. Example

Below, give an example.

Open the "Notepad" program notepad.exe, create a new text file, the content is a strict character, and save it in ANSI, Unicode, Unicode big endian and UTF-8 encoding.

Then, use the "hex function" in the text editing software UltraEdit to observe the internal encoding of the file.

  • ANSI: The encoding of the file is two bytes D1 CF, which is exactly the strict GB2312 encoding, which also implies that GB2312 is stored in the big head mode.

  • Unicode: The encoding is four bytes FF FE 25 4E, where FF FE indicates that it is stored in small head mode, and the real encoding is 4E25.

  • Unicode big endian: The encoding is four bytes FE FF 4E 25, where FE FF indicates big endian storage.

  • UTF-8: The encoding is six bytes EF BB BF E4 B8 A5. The first three bytes EF BB BF indicate that this is UTF-8 encoding, and the last three bytes are E4B8A5. Yan's specific encoding, its storage order is consistent with the encoding order.

9. Extended reading (extracurricular knowledge)

##The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (the most basic knowledge about character sets)

Talk about Unicode encoding: RFC3629: UTF-8, a transformation format of ISO 10646 (if the regulations of UTF-8 are implemented)

The above is the detailed content of How much do you know about character set encodings ASCII, Unicode and UTF-8? Character set encoding summary (collection). For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
1 months ago By 尊渡假赌尊渡假赌尊渡假赌
Two Point Museum: All Exhibits And Where To Find Them
1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How many bytes does one ascii character occupy? How many bytes does one ascii character occupy? Mar 09, 2023 pm 03:49 PM

One ascii character occupies 1 byte. ASCII code characters are represented by 7-bit or 8-bit binary encoding in the computer and are stored in one byte, that is, one ASCII code occupies one byte. ASCII code can be divided into standard ASCII code and extended ASCII code. Standard ASCII code is also called basic ASCII code. It uses 7-bit binary numbers (the remaining 1 binary digit is 0) to represent all uppercase and lowercase letters, and the numbers 0 to 9. Punctuation marks, and special control characters used in American English.

Quickly learn about ASCII value conversion in PHP Quickly learn about ASCII value conversion in PHP Mar 28, 2024 pm 06:42 PM

ASCII value conversion in PHP is a problem often encountered in programming. ASCII (American Standard Code for Information Interchange) is a standard encoding system for converting characters into numbers. In PHP, we often need to convert between characters and numbers through ASCII code. This article will introduce how to convert ASCII values ​​in PHP and give specific code examples. 1. Change the characters

How to convert unicode to Chinese How to convert unicode to Chinese Dec 14, 2023 am 10:57 AM

Unicode is a character encoding standard used to represent various languages ​​and symbols. To convert Unicode encoding to Chinese characters, you can use Python's built-in functions chr() and ord().

In-depth understanding of PHP: Implementation method of converting JSON Unicode to Chinese In-depth understanding of PHP: Implementation method of converting JSON Unicode to Chinese Mar 05, 2024 pm 02:48 PM

In-depth understanding of PHP: Implementation method of converting JSONUnicode to Chinese During development, we often encounter situations where we need to process JSON data, and Unicode encoding in JSON will cause us some problems in some scenarios, especially when Unicode needs to be converted When encoding is converted to Chinese characters. In PHP, there are some methods that can help us achieve this conversion process. A common method will be introduced below and specific code examples will be provided. First, let us first understand the Un in JSON

Try the method to solve the problem of Chinese garbled characters in Eclipse Try the method to solve the problem of Chinese garbled characters in Eclipse Jan 03, 2024 pm 05:28 PM

Are you troubled by Chinese garbled characters in Eclipse? To try these solutions, you need specific code examples 1. Background introduction With the continuous development of computer technology, Chinese plays an increasingly important role in software development. However, many developers encounter garbled code problems when using Eclipse for Chinese development, which affects work efficiency. Then, this article will introduce some common garbled code problems and give corresponding solutions and code examples to help readers solve the Chinese garbled code problem in Eclipse. 2. Common garbled code problems and solution files

PHP Tutorial: How to Convert JSON Unicode to Chinese Characters PHP Tutorial: How to Convert JSON Unicode to Chinese Characters Mar 05, 2024 pm 06:36 PM

JSON (JavaScriptObjectNotation) is a lightweight data exchange format commonly used for data exchange between web applications. When processing JSON data, we often encounter Unicode-encoded Chinese characters (such as "u4e2du6587") and need to convert them into readable Chinese characters. In PHP, we can achieve this conversion through some simple methods. Next, we will detail how to convert JSONUnico

PHP returns the ASCII value of the first character of the string PHP returns the ASCII value of the first character of the string Mar 21, 2024 am 11:01 AM

This article will explain in detail the ASCII value of the first character of the string returned by PHP. The editor thinks it is very practical, so I share it with you as a reference. I hope you can gain something after reading this article. PHP returns the ASCII value of the first character of a string Introduction In PHP, getting the ASCII value of the first character of a string is a common operation that involves basic knowledge of string processing and character encoding. ASCII values ​​are used to represent the numeric value of characters in computer systems and are critical for character comparison, data transmission and storage. The process of getting the ASCII value of the first character of a string involves the following steps: Get String: Determine the string for which you want to get the ASCII value. It can be a variable or a string constant

Solve the problem of inconsistent Unicode character set encoding when Java connects to MySQL database Solve the problem of inconsistent Unicode character set encoding when Java connects to MySQL database Jun 10, 2023 am 11:39 AM

With the development of technologies such as big data and cloud computing, databases have become one of the important cornerstones of enterprise informatization. In applications developed in Java, connecting to MySQL database has become the norm. However, in this process, we often encounter a thorny problem - inconsistent Unicode character set encoding. This will not only affect our development efficiency, but also affect the performance and stability of the application. This article will introduce how to solve this problem and make Java connect to the MySQL database more smoothly. 1. Unicode

See all articles