The difference between utf-8 and utf-8 without BOM

WBOY
Release: 2016-08-08 09:20:38
Original
1316 people have browsed it

BOM——Byte Order Mark, which is the byte order mark

There is a character called "ZERO WIDTH NO-BREAK SPACE" in UCS encoding, and its encoding is FEFF. FFFE is a character that does not exist in UCS, so it should not appear in actual transmission. The UCS specification recommends that we transmit the characters "ZERO WIDTH NO-BREAK SPACE" before transmitting the byte stream. In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.

UTF-8 does not require a BOM to indicate the byte order, but can use the BOM to indicate the encoding method. The UTF-8 encoding of the character "ZERO WIDTH NO-BREAK SPACE" is EF BB BF. So if the receiver receives a byte stream starting with EF BB BF, it knows that it is UTF-8 encoded.

In UTF-8 encoded files, the BOM occupies three bytes. If you use Notepad to save a text file as UTF-8 encoding, open the file with UE and switch to the hexadecimal editing state, you can see the FFFE at the beginning. This is a good way to identify UTF-8 encoded files. The software uses BOM to identify whether the file is UTF-8 encoded. Many software also require that the read file must have BOM. However, there are still many softwares that cannot recognize BOM.

In early versions of Firefox, extensions could not have BOM, but versions after Firefox 1.5 have begun to support BOM. Now I discovered that PHP does not support BOM either. PHP did not consider the BOM issue when it was designed, which means that it will not ignore the three characters of the BOM at the beginning of the UTF-8 encoded file.

Since it must be seen in Bo-Blog's wiki, Bo-Blog, which also uses PHP, is also troubled by BOM. Another trouble was mentioned: "Limited by the COOKIE sending mechanism, in files that already have a BOM at the beginning of these files, the COOKIE cannot be sent (because PHP has already sent the file header before the COOKIE is sent), so the login and logout functions Invalid. All functions that rely on COOKIE and SESSION are invalid. "This should be the reason why a blank page appears in the WordPress background, because any executed file contains a BOM, and these three characters will be sent, resulting in dependence on cookies and The session function is invalid.

The solution is, if it only contains English characters (or characters in ASCII encoding), just save the file in ASCII code. If you use an editor such as UE, click File->Convert->UTF-8 to ASCII, or select ASCII encoding in Save As. If it is a line ending in DOS format, you can open it with Notepad, click Save As, and select ASCII encoding. If it contains Chinese characters, you can use UE's save as function and select "UTF-8 without BOM".

BOM should not be added to utf-8. It has no use except letting the editor know that it is utf-8. In fact, the editor is fully capable of judging the encoding of a file based on characteristics among not too many encoding formats. Even if it cannot be automatically recognized, the editor should have a place to set the encoding. So I think BOM is redundant for utf-8.

Utf-16 only needs to add BOM. Because it is encoded in unicode order, it is two bytes in the BMP range, and it needs to be identified as big or little endian.

Actually, I think it is too stupid to introduce the concept of big and small endianness in utf-8. I don’t know what those standards committees think. The significance of the existence of big and small endianness lies in the processing method of the CPU. If the CPU processes big endian, then for little endian, a layer of conversion must be performed, which brings about a decrease in efficiency. But in practical applications, who cares about endianness? Text encoding gives rise to the concept of byte order. It can only be said that those who formulate standards are too rigid. For UTF-16, I think as long as the whole world follows a byte ordering method, there is no need to use BOM to mark it.

Having said that, PHP does not support UTF-16 encoded files. Because the $ symbol, for example, is also two bytes in UTF-8 and cannot be parsed by the PHP decoder. I don’t know if PHP6 will support this after the concept of unicode is introduced in internal processing.

Encoding problem is something that sounds simple but is actually very complicated. Many programs have the concept of hierarchical coding. Like MySQL, it is divided into concepts such as client->connection->storage and storage->connection->result. Storage is divided into system, database, table, and column. I sometimes think, is it necessary to make it so complicated, TNND. Like MySQL, who uses its features? Unless the two clients are allowed to operate in different encoding environments, there is no need to separate the client encoding. In most cases, just binary in/binary out

The above introduces the difference between utf-8 and utf-8 without BOM, including the relevant content. I hope it will be helpful to friends who are interested in PHP tutorials.

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template