How many bytes do utf8 encoded Chinese characters occupy?
utf8 encoded Chinese characters occupy 3 bytes. In UTF-8 encoding, one Chinese character is equal to three bytes, and one Chinese punctuation mark occupies three bytes; while in Unicode encoding, one Chinese character (including traditional Chinese) is equal to two bytes. UTF-8 uses 1~4 bytes to encode each character. One US-ASCIl character only needs 1 byte to encode. Latin, Greek, Cyrillic, Armenian, and Hebrew with diacritical marks. , Arabic, Syriac and other letters require 2-byte encoding.
The operating environment of this tutorial: Windows 7 system, Dell G3 computer.
How many bytes do utf-8 encoded Chinese characters occupy?
In UTF-8 encoding: one Chinese character is equal to three bytes, and Chinese punctuation occupies three bytes.
One English character is equal to one byte, and English punctuation occupies one byte.
Unicode encoding: One English code is equal to two bytes, and one Chinese character (including traditional Chinese) is equal to two bytes. Chinese punctuation occupies two bytes, and English punctuation takes up two bytes.
UTF-8 uses 1~4 bytes to encode each character:
1. One US-ASCIl character only needs 1 byte encoding ( Unicode range is U 0000~U 007F).
2. Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac and other letters with diacritical marks require 2-byte encoding (Unicode range is represented by U 0080~U 07FF).
3. Characters in other languages (including Chinese, Japanese and Korean characters, Southeast Asian characters, Middle Eastern characters, etc.) include most commonly used characters and use 3-byte encoding.
4. Other rarely used language characters use 4-byte encoding.
Extended knowledge:
UTF-8 (8-bit, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the original software that processes ASCII characters can continue to be used without or with only a few modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.
Character set:
UTF-8 encoding rules: If there is only one byte, the value is 0x00-0x7F. The remaining bytes are expanded as follows according to length:
UTF-8 is implemented by 4 encoding methods, namely UTF8-1 / UTF8-2 / UTF8-3 / UTF8-4. Among them:
##UTF8-1 | 0x00- 0x7F |
0xC2-0xDF | 0x80-0xBF |
0xE0 | 0xA0-0xBF 0x80-0xBF0xE1- 0xEC 0x80-0xBF 0x80-0xBF0xED 0x80-0x9F 0x80-0xBF0xEE-0xEF 0x80-0xBF 0x80-0xBF
|
0xF0 | 0x90-0xBF 0x80-0xBF 0x80-0xBF0xF1-0xF3 0x80-0xBF 0x80-0xBF 0x80-0xBF0xF4 0x80- 0x8F 0x80-0xBF 0x80-0xBF
|
For more related knowledge, please visit the
FAQThe above is the detailed content of How many bytes do utf8 encoded Chinese characters occupy?. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



1MB of storage capacity is equivalent to 2 to the 20th power bytes, or 1,048,576 bytes. MB is a storage unit in computers, pronounced as "mega"; because 1MB is equal to 1024KB, and 1KB is equal to 1024B (bytes), so 1MB is equal to 1048576 (1024 *1024) bytes.

128mb refers to 134217728 bytes; the byte conversion formula is "1MB=1024KB=1048576B=8388608bit", which means that 1048576 English letters and 524288 Chinese characters can be saved; the traffic unit conversion formula is 1GB=1024MB, 1MB=1024KB, 1KB= 1024B.

Machine learning algorithms only accept numerical input, so if we encounter categorical features, we will encode the categorical features. This article summarizes 11 common categorical variable encoding methods. 1. ONE HOT ENCODING The most popular and commonly used encoding method is One Hot Enoding. A single variable with n observations and d distinct values is converted into d binary variables with n observations, each binary variable is identified by a bit (0, 1). For example: the simplest implementation after coding is to use pandas' get_dummiesnew_df=pd.get_dummies(columns=[‘Sex’], data=df)2,

1 bit is equal to one-eighth of a byte. In the binary number system, each 0 or 1 is a bit (bit), and a bit is the smallest unit of data storage; every 8 bits (bit, abbreviated as b) constitute a byte (Byte), so "1 byte ( Byte) = 8 bits”. In most computer systems, a byte is an 8-bit (bit) long data unit. Most computers use a byte to represent a character, number, or other character.

UTF8 encoded Chinese characters occupy 3 bytes. In UTF-8 encoding, one Chinese character is equal to three bytes, and one Chinese punctuation mark occupies three bytes; while in Unicode encoding, one Chinese character (including traditional Chinese) is equal to two bytes. UTF-8 uses 1~4 bytes to encode each character. One US-ASCIl character only needs 1 byte to encode. Latin, Greek, Cyrillic, Armenian, and Hebrew with diacritical marks. , Arabic, Syriac and other letters require 2-byte encoding.

One ascii character occupies 1 byte. ASCII code characters are represented by 7-bit or 8-bit binary encoding in the computer and are stored in one byte, that is, one ASCII code occupies one byte. ASCII code can be divided into standard ASCII code and extended ASCII code. Standard ASCII code is also called basic ASCII code. It uses 7-bit binary numbers (the remaining 1 binary digit is 0) to represent all uppercase and lowercase letters, and the numbers 0 to 9. Punctuation marks, and special control characters used in American English.

An ASCII code occupies one byte. ASCII code is a coding standard used to represent characters. It uses 7-bit binary numbers to represent 128 different characters, including letters, numbers, punctuation marks, special characters, etc. A byte is the basic unit of computer storage unit. It consists of 8 binary bits. Each binary bit can be 0 or 1. One byte can represent 256 different values, so it can represent all characters in the ASCII code.

4KB means that the storage unit is 4096 bytes. KB refers to kilobyte, which is a multiple form of computer data storage unit byte. A kilobyte is based on the power of 2, that is, a kilobyte (1KB) is equal to 1024 bytes (B ); therefore "4KB=4*1024B=4096B", that is, 4KB represents 4096 bytes.