My initial doubt was: What is the difference between text files and binary files? Why can one display the content, but the other's content often cannot be displayed normally (using a text editor)?
This training note from the University of Maryland clearly explains the difference between the two: text files are a type of binary files, and the underlying storage is also 0 and 1; text files have good readability and portability, but Expression characters are limited; binary file data storage is compact and has no character encoding restrictions. Text files can basically only store content composed of limited characters such as numbers, text, punctuation, etc. Binary files have no character constraints and can store images, audio and video and other data at will.
Using the example of storing numbers, we can vividly see the difference in the storage content of text files and binary files. For example, to store the number 1234567890, the text file needs to store the ASCII codes of the ten numbers 0-9. The corresponding hexadecimal representation is:31 32 33 34 35 36 37 38 39 30, occupying 10 Bytes; the binary corresponding to 1234567890 is "
0100 1001 1001 0110 0000 0010 1101 0010", which occupies 4 bytes (binary representation is 32 bits, one byte is 8 bits), and is stored in 16 of the file The base representation is (big endian):
49 96 02 D2.
characters , while binary files store content in bytes . This is the most essential difference between the two files. Based on this characteristic, some common conclusions can be inferred: binary files are often more compact than text files and take up less space; text files are more user-friendly and can be edited in a WYSIWYG way; binary files often require special programs to open, etc. .
Looking back at the text editor, binary files are often garbled. For example, a binary file stores an integer 1234 (four bytes), which is represented in hexadecimal as:00 00 04 D2. After opening the text editor and interpreting it character by character, you will find that these bytes cannot spell out displayable characters, so you have to treat them as gibberish. The reason for the garbled characters is that the text editor cannot correctly parse the byte stream, which is why binary files need to be opened with special software. For example, a jpg file needs to be opened with a picture viewing software. If it is opened with a music player, it’s over! Video files need to be opened with a player and compression software, so let’s get started!
After understanding the difference between text files and binary files, let’s look at the file format. We know that Windows recognizes the file format according to the file extension and calls the corresponding program to open the file; in (like) Unix systems, the extension is optional, so how do you know what format the file is?
Fortunately, there is the file command, which can tell us what format the file is in. The file extension is not the essential difference in file format, the content is. Change a.zip to a.txt/a.jgp/a.mp3. No matter what the file name is, file will reveal its original shape: Zip archive data, at least v1.0 to extract
.
After talking about the file, let’s talk about the encoding in the file content. There are 127 common ASCII characters. There is no encoding to say. Anyway, almost all encoding methods are compatible with it. Double-byte and multi-byte characters, encoding methods and byte order are the problems that trouble programmers. For a Chinese character, GBK encoding requires two bytes, and the endianness of the local machine must be considered to determine the final form of storage; during network communication, it must be converted into network byte order (big endian) so that the receiver can parse it normally. If developers are not familiar with character encoding and encounter garbled characters during communication, debugging will be difficult.
The formulation of the UCS (Universal Multiple Octet Coded Character Set) standard allows developers to stay away from confusing multi-byte character sets. In the UCS standard, all characters have unique code points, and the corresponding characters can be found based on the code points. UCS uses two bytes to represent a code point (the UCS-4 standard is 4 bytes), corresponding to one character. Because it uses two bytes, it can accommodate 2^16-1 (6w+) characters, which basically accommodates characters commonly used in various countries (UCS-4 can theoretically accommodate up to 2 billion characters, and currently accommodates more than 16W characters) . Note that UCS is just a standard that stipulates the one-to-one correspondence between code points and characters, but does not define how to store them in the computer.
The work of stipulating the storage method of Unicode characters is completed by UTF (Unicode Transformation Format). The most commonly used solutions are UTF-16 and UTF-8. UTF-16 uses two bytes to represent a character. The default character encoding schemes for Windows, MacOS, and Java platforms are UTF-16. Since there are two bytes, there are two schemes: big-endian and little-endian. For files with only ASCII characters, using UTF-16 encoding causes serious waste of space (wasting 50% of storage). The UTF-8 encoding scheme proposed by Ken Thompson (inventor of C language) and Robe Pike (inventor of Go language) It quickly became popular. UTF-8 is a single-byte stream, there is no byte order problem, and no BOM is required. UTF-8 is currently the common web standard.
The value range of USC-2 is U+0000~U+FFFF, and the corresponding relationship with UTF-8 is as follows:
HEX | BINARY |
---|---|
0xxxxxxx | |
110xxxxx 10xxxxxx | |
1110xxxx 10xxxxxx 10xxxxxx | |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
The above is the detailed content of Detailed explanation of php files and character encoding. For more information, please follow other related articles on the PHP Chinese website!