Home Common Problem How many bytes do utf8 encoded Chinese characters occupy?

How many bytes do utf8 encoded Chinese characters occupy?

Feb 21, 2023 am 11:40 AM
coding byte utf8

utf8 encoded Chinese characters occupy 3 bytes. In UTF-8 encoding, one Chinese character is equal to three bytes, and one Chinese punctuation mark occupies three bytes; while in Unicode encoding, one Chinese character (including traditional Chinese) is equal to two bytes. UTF-8 uses 1~4 bytes to encode each character. One US-ASCIl character only needs 1 byte to encode. Latin, Greek, Cyrillic, Armenian, and Hebrew with diacritical marks. , Arabic, Syriac and other letters require 2-byte encoding.

How many bytes do utf8 encoded Chinese characters occupy?

The operating environment of this tutorial: Windows 7 system, Dell G3 computer.

How many bytes do utf-8 encoded Chinese characters occupy?

In UTF-8 encoding: one Chinese character is equal to three bytes, and Chinese punctuation occupies three bytes.

One English character is equal to one byte, and English punctuation occupies one byte.

Unicode encoding: One English code is equal to two bytes, and one Chinese character (including traditional Chinese) is equal to two bytes. Chinese punctuation occupies two bytes, and English punctuation takes up two bytes.

How many bytes do utf8 encoded Chinese characters occupy?

UTF-8 uses 1~4 bytes to encode each character:

1. One US-ASCIl character only needs 1 byte encoding ( Unicode range is U 0000~U 007F).

2. Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and other letters with diacritical marks require 2-byte encoding (Unicode range is represented by U 0080~U 07FF).

3. Characters in other languages ​​(including Chinese, Japanese and Korean characters, Southeast Asian characters, Middle Eastern characters, etc.) include most commonly used characters and use 3-byte encoding.

4. Other rarely used language characters use 4-byte encoding.

Extended knowledge:

UTF-8 (8-bit, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the original software that processes ASCII characters can continue to be used without or with only a few modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.

Character set:

UTF-8 encoding rules: If there is only one byte, the value is 0x00-0x7F. The remaining bytes are expanded as follows according to length:

UTF-8 is implemented by 4 encoding methods, namely UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4. Among them:

##UTF8-20xC2-0xDF UTF8-30xE0 UTF8-40xF0 Note: Each encoding may have multiple encoding ranges, each encoding range space as the delimiter for each byte. For example, the first encoding of UTF8-3 must have a value of 0xE0 for the first byte, a range of 0xA0-0xBF for the second byte, and a range of 0x80-0xBF for the third byte.
UTF8, hexadecimal encoding table
##UTF8-1
0x00- 0x7F
0x80-0xBF
0xA0-0xBF
0x80-0xBF0xE1- 0xEC
0x80-0xBF
0x80-0xBF0xED
0x80-0x9F
0x80-0xBF0xEE-0xEF
0x80-0xBF
0x80-0xBF
0x90-0xBF
0x80-0xBF 0x80-0xBF0xF1-0xF3
0x80-0xBF
0x80-0xBF 0x80-0xBF0xF4
0x80- 0x8F
0x80-0xBF 0x80-0xBF

For more related knowledge, please visit the

FAQ

column!

The above is the detailed content of How many bytes do utf8 encoded Chinese characters occupy?. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

1MB of storage capacity is equivalent to how many bytes 1MB of storage capacity is equivalent to how many bytes Mar 03, 2023 pm 05:42 PM

1MB of storage capacity is equivalent to 2 to the 20th power bytes, or 1,048,576 bytes. MB is a storage unit in computers, pronounced as "mega"; because 1MB is equal to 1024KB, and 1KB is equal to 1024B (bytes), so 1MB is equal to 1048576 (1024 *1024) bytes.

How many bytes does 128mb mean? How many bytes does 128mb mean? Nov 29, 2022 am 10:35 AM

128mb refers to 134217728 bytes; the byte conversion formula is "1MB=1024KB=1048576B=8388608bit", which means that 1048576 English letters and 524288 Chinese characters can be saved; the traffic unit conversion formula is 1GB=1024MB, 1MB=1024KB, 1KB= 1024B.

11 common classification feature encoding techniques 11 common classification feature encoding techniques Apr 12, 2023 pm 12:16 PM

Machine learning algorithms only accept numerical input, so if we encounter categorical features, we will encode the categorical features. This article summarizes 11 common categorical variable encoding methods. 1. ONE HOT ENCODING The most popular and commonly used encoding method is One Hot Enoding. A single variable with n observations and d distinct values ​​is converted into d binary variables with n observations, each binary variable is identified by a bit (0, 1). For example: the simplest implementation after coding is to use pandas' get_dummiesnew_df=pd.get_dummies(columns=[‘Sex’], data=df)2,

1 bit equals how many bytes 1 bit equals how many bytes Mar 09, 2023 pm 03:11 PM

1 bit is equal to one-eighth of a byte. In the binary number system, each 0 or 1 is a bit (bit), and a bit is the smallest unit of data storage; every 8 bits (bit, abbreviated as b) constitute a byte (Byte), so "1 byte ( Byte) = 8 bits”. In most computer systems, a byte is an 8-bit (bit) long data unit. Most computers use a byte to represent a character, number, or other character.

How many bytes do utf8 encoded Chinese characters occupy? How many bytes do utf8 encoded Chinese characters occupy? Feb 21, 2023 am 11:40 AM

UTF8 encoded Chinese characters occupy 3 bytes. In UTF-8 encoding, one Chinese character is equal to three bytes, and one Chinese punctuation mark occupies three bytes; while in Unicode encoding, one Chinese character (including traditional Chinese) is equal to two bytes. UTF-8 uses 1~4 bytes to encode each character. One US-ASCIl character only needs 1 byte to encode. Latin, Greek, Cyrillic, Armenian, and Hebrew with diacritical marks. , Arabic, Syriac and other letters require 2-byte encoding.

How many bytes does one ascii character occupy? How many bytes does one ascii character occupy? Mar 09, 2023 pm 03:49 PM

One ascii character occupies 1 byte. ASCII code characters are represented by 7-bit or 8-bit binary encoding in the computer and are stored in one byte, that is, one ASCII code occupies one byte. ASCII code can be divided into standard ASCII code and extended ASCII code. Standard ASCII code is also called basic ASCII code. It uses 7-bit binary numbers (the remaining 1 binary digit is 0) to represent all uppercase and lowercase letters, and the numbers 0 to 9. Punctuation marks, and special control characters used in American English.

How many bytes does an ascii code occupy? How many bytes does an ascii code occupy? Sep 07, 2023 pm 04:03 PM

An ASCII code occupies one byte. ASCII code is a coding standard used to represent characters. It uses 7-bit binary numbers to represent 128 different characters, including letters, numbers, punctuation marks, special characters, etc. A byte is the basic unit of computer storage unit. It consists of 8 binary bits. Each binary bit can be 0 or 1. One byte can represent 256 different values, so it can represent all characters in the ASCII code.

4kb indicates how many bytes there are in the storage unit 4kb indicates how many bytes there are in the storage unit Feb 28, 2023 pm 12:12 PM

4KB means that the storage unit is 4096 bytes. KB refers to kilobyte, which is a multiple form of computer data storage unit byte. A kilobyte is based on the power of 2, that is, a kilobyte (1KB) is equal to 1024 bytes (B ); therefore "4KB=4*1024B=4096B", that is, 4KB represents 4096 bytes.