macos - What is the encoding mechanism of files?

Question

For example, I have an f.txt file on my mac. The system is utf-8 encoded and contains the data "\xE6\x97\A5" - which is the Chinese character "日" in utf-8 encoding. Then I use ultraedit to edit f.txt Save as the following files: The actual stored content of the f1.txt file is "\xE6\x97\A5", let ultraedit decode it...

迷茫 · Answer

Take vim for example

A text file, vim opens it according to a certain encoding A when opening it, converts it to a certain encoding B, and then converts it to another encoding C when saving. Other text editors are similar. There may not be such settings and automatic completion as vim. .
Encoding B: It has no effect on the entire file. It is only related to the display. It is the encoding used when vim interacts with the operating system.

编码A：使用 set fileencodings=ucs-bom,utf-8,gbk,cp936,latin-1设置。vim 按照设置的顺序检查检测文件的编码。因为某些编码里不存在某些二进制序列的组合，所以如果检测到就认为不是这种编码，检查下一种编码，否则就认为是这一种。因为latin-1可以出现任何二进制序列的组合，所以如果放到第一个，那么将永远以latin-1Show.

There is no character encoding mark in ordinary binary files. But there is a special thing in Unicode called zero-width space (FEFF）而FFFE是不存在的编码，所以在Unicode的标准里可以人为的在开始加入这个字符（这个字符在任何字体下都是没有宽度的，在中文字符里面没有任何的效果跟没有一样，是为了照顾东南亚某些语言的显示而设置的）。这样就便于文本编辑器检查字符和字节顺序，但是在代码里includeThis kind of file often causes problems (this is a big pitfall, the compiler will think this is an illegal character, but you can’t see it).

编码B：set fileencoding=utf-8, the encoding used when saving, will be automatically converted to another encoding when saving. But if the wrong encoding is recognized when you first open it, a non-existent character will not be completely converted when you convert it.

So f1.txt saved as gp18030 may not perform encoding conversion.

"The question is, I want to get the actual stored data is "xE6x97A5", but use gb18030 encoding to explain, how to do it?" What does this mean?

PHP中文网 · Answer

File encoding is the actual code specification of how to store it. Let me answer your question first, 日在UTF8编码中是xE6x97A5，你就不可能说采用GB18030编码结果还为xE6x97A5的日words.

There are different ways for editors to identify text file encodings. Some file encodings have Magic headers, which can be completed by directly identifying the first few bytes. However, most text files do not have such identification codes and rely entirely on the editor. Make guesses based on context and the user's locale.