For example, I have an f.txt file on my mac, and the system is utf-8 encoded.
There is data "\xE6\x97\A5" in it - the Chinese character "日" in utf-8 encoding.
Then I use ultraedit to save f.txt as the following files:
f1.txt file
The actual stored content is "\xE6\x97\A5". If ultraedit interprets it as gb18030 encoding, it will be displayed as garbled characters on the ultraedit interface. Afterwards, it was saved as a gb18030 encoded file, but when opened on the Mac system, it was UTF-8 and displayed normally.
f2.txt file
The actual stored content is "\xE6\x97\A5", which is interpreted as utf-8, then it is displayed as "日"
f3.txt file
Save it directly as gb18030 encoding, then ultraedit will automatically change the encoding, that is, change "\xE6\x97\A5" to "\xC8\xD5". Then vim opens the file and calls ascii encoding interpretation.
Here comes the question,
Since the actual stored data is "\xE6\x97\A5", why does my editor interpret it as utf-8 encoding? What should I do if I want to get the garbled code explained by GBK?
Is some kind of mark added to the binary header of the document? If so, how to view this mark?
Is coding-based semantic analysis performed on the editor side?
Take vim for example
A text file, vim opens it according to a certain encoding A when opening it, converts it to a certain encoding B, and then converts it to another encoding C when saving. Other text editors are similar. There may not be such settings and automatic completion as vim. .
Encoding B: It has no effect on the entire file. It is only related to the display. It is the encoding used when vim interacts with the operating system.
编码A
:使用set fileencodings=ucs-bom,utf-8,gbk,cp936,latin-1
设置。vim 按照设置的顺序检查检测文件的编码。因为某些编码里不存在某些二进制序列的组合,所以如果检测到就认为不是这种编码,检查下一种编码,否则就认为是这一种。因为latin-1
可以出现任何二进制序列的组合,所以如果放到第一个,那么将永远以latin-1
Show.There is no character encoding mark in ordinary binary files. But there is a special thing in Unicode called zero-width space (
FEFF
)而FFFE
是不存在的编码,所以在Unicode的标准里可以人为的在开始加入这个字符(这个字符在任何字体下都是没有宽度的,在中文字符里面没有任何的效果跟没有一样,是为了照顾东南亚某些语言的显示而设置的)。这样就便于文本编辑器检查字符和字节顺序,但是在代码里include
This kind of file often causes problems (this is a big pitfall, the compiler will think this is an illegal character, but you can’t see it).编码B
:set fileencoding=utf-8
, the encoding used when saving, will be automatically converted to another encoding when saving. But if the wrong encoding is recognized when you first open it, a non-existent character will not be completely converted when you convert it.So f1.txt saved as gp18030 may not perform encoding conversion.
"The question is, I want to get the actual stored data is "xE6x97A5", but use gb18030 encoding to explain, how to do it?" What does this mean?
File encoding is the actual code specification of how to store it. Let me answer your question first,
日
在UTF8
编码中是xE6x97A5
,你就不可能说采用GB18030
编码结果还为xE6x97A5
的日
words.There are different ways for editors to identify text file encodings. Some file encodings have
Magic
headers, which can be completed by directly identifying the first few bytes. However, most text files do not have such identification codes and rely entirely on the editor. Make guesses based on context and the user's locale.