Example tutorial on correctly reading Chinese encoded files in .NET (C#)

Y2J
Release: 2017-04-24 16:56:05
Original
2139 people have browsed it

First of all, if the reader is not familiar with encoding or BOM, it is recommended to read this article first: .NET (C#): Character Encoding (Encoding) and Byte Order Mark (BOM).
Chinese coding can basically be divided into two categories:
1. Extended set of ANSI coding: such as GBK, GB2312, GB18030, etc. There is no BOM for this type of coding (some newer standard Chinese coding, such as GB18030 and GBK encoding, all are backward compatible with GB2312 encoding).
2. Unicode encoding set: such as UTF-8, UTF-16, UTF-32, etc. This type of coding can have BOM or not.
3. Some Unicode encodings also have specific byte order issues (Endianess), which are the so-called Little endian and Big endian. Different section orders correspond to different BOMs, such as UTF16, but UTF8 does not have byte order issues. .

OK, after understanding the basic knowledge, let us return to the topic, how to open Chinese text files correctly. The first information that needs to be confirmed is: Does your Unicode encoded file contain a BOM?

If BOM is included, then everything is easy to say! Because if we find the BOM, we will know its specific encoding. If the BOM is not found, it is not Unicode. We can open the text file using the system's default ANSI extended Chinese encoding set and it will be OK.
If the Unicode encoding does not have a BOM (obviously, you cannot guarantee that all Unicode files given to you by users have BOM), then you have to manually determine whether it is GBK from the original bytes? Or UTF8? Or other encoding? . This requires a specific encoding detection algorithm (you can google "charset|encoding detection"). Of course, the encoding detection algorithm may not be 100% accurate. It is precisely because of this that Windows Notepad has Bush hid the facts bug. When browsing the web in Chrome, you will also encounter garbled characters. Personally, I feel that Notepad++'s coding awareness is quite accurate.
There are many coding awareness algorithms, such as this project: https://code.google.com/p/ude


If Unicode comes with BOM, there is no need for a third-party class library . However, there are some things that need to be explained.

The problem is that the text reading methods (File class and StreamReader) in .NET read in UTF8 encoding by default, so this type of GBK text file is directly opened with .NET (if no encoding is specified). It must be gibberish!

First of all, the most effective solution here is to use the system default ANSI extended encoding, which is the system default non-Unicode encoding to read text. Reference code:

//输出系统默认非Unicode编码Console.WriteLine(Encoding.Default.EncodingName);//使用系统默认非Unicode编码来打开文件var fileContent = File.ReadAllText("C:\test.txt", Encoding.Default);
Copy after login

in Simplified Chinese Windows The system should output:

Simplified Chinese (GB2312)...

And using this method is not limited to Simplified Chinese.

Of course, you can also manually specify an encoding, such as GBK encoding, but if you use the specified GBK encoding to open a Unicode file, will the file still be opened successfully? The answer is still successful. The reason is that .NET will automatically detect the BOM by default when opening a file and use the encoding obtained based on the BOM to open the file. If there is no BOM, the file will be opened with the encoding area specified by the user. If the user does not specify the encoding, UTF8 encoding will be used.

This "automatically aware of BOM" parameter can be set in the constructor of StreamReader, corresponding to the detectEncodingFromByteOrderMarks parameter.

But it cannot be set in the corresponding method of the File class. (For example: File.ReadAllText).

For example, the following code uses:

GB2312 encoding, automatically detecting BOM to read GB2312 text

GB2312 encoding, automatically detecting BOM to read Unicode text

GB2312 encoding, reading Unicode text without noticing the BOM

static void Main(){    var gb2312 = Encoding.GetEncoding("GB2312");    //用GB2312编码,自动觉察BOM 来读取GB2312文本    ReadFile("gbk.txt", gb2312, true);    //用GB2312编码,自动觉察BOM 来读取Unicode文本    ReadFile("unicode.txt", gb2312, true);    //用GB2312编码,不觉察BOM 来读取Unicode文本    ReadFile("unicode.txt", gb2312, false);}//通过StreamReader读取文本 static void ReadFile(string path, Encoding enc, bool detectEncodingFromByteOrderMarks){    StreamReader sr;    using (sr = new StreamReader(path, enc, detectEncodingFromByteOrderMarks))    {        Console.WriteLine(sr.ReadToEnd());    }}
Copy after login

Output:

a刘a刘???
Copy after login

The third line is garbled.

Seeing the above, using GB2312 encoding to open Unicode files will also be successful. Because the "Automatically detect BOM" parameter is True, when it is found that the file has a BOM, .NET will detect that it is a Unicode file through the BOM, and then use Unicode to open the file. Of course, if there is no BOM, the specified encoding parameters will be used to open the file. For GB2312 encoded text, there is obviously no BOM, so GB2312 encoding must be specified, otherwise .NET will use the default UTF8 encoding to parse the file, and the result will not be read. The reason for the garbled characters in the third line is that "automatically detect BOM" is False. .NET will directly use the specified GB2312 encoding to read a Unicode encoded text file with BOM, which obviously cannot be successful.

Of course, you can also determine the BOM yourself. If there is no BOM, specify a default encoding to open the text. I wrote about it in a previous article (.NET (C#): Encoding detection from files).

Code:

static void Main(){    PrintText("gb2312.txt");    PrintText("unicode.txt");}//根据文件自动觉察编码并输出内容static void PrintText(string path){    var enc = GetEncoding(path, Encoding.GetEncoding("GB2312"));    using (var sr = new StreamReader(path, enc))    {        Console.WriteLine(sr.ReadToEnd());    }}/// <summary>/// 根据文件尝试返回字符编码/// </summary>/// <param name="file">文件路径</param>/// <param name="defEnc">没有BOM返回的默认编码</param>/// <returns>如果文件无法读取,返回null。否则,返回根据BOM判断的编码或者缺省编码(没有BOM)。</returns>static Encoding GetEncoding(string file, Encoding defEnc){    using (var stream = File.OpenRead(file))    {        //判断流可读?        if (!stream.CanRead)            return null;        //字节数组存储BOM        var bom = new byte[4];        //实际读入的长度        int readc;        readc = stream.Read(bom, 0, 4);        if (readc >= 2)        {            if (readc >= 4)            {                //UTF32,Big-Endian                if (CheckBytes(bom, 4, 0x00, 0x00, 0xFE, 0xFF))                    return new UTF32Encoding(true, true);                //UTF32,Little-Endian                if (CheckBytes(bom, 4, 0xFF, 0xFE, 0x00, 0x00))                    return new UTF32Encoding(false, true);            }            //UTF8            if (readc >= 3 && CheckBytes(bom, 3, 0xEF, 0xBB, 0xBF))                return new UTF8Encoding(true);            //UTF16,Big-Endian            if (CheckBytes(bom, 2, 0xFE, 0xFF))                return new UnicodeEncoding(true, true);            //UTF16,Little-Endian            if (CheckBytes(bom, 2, 0xFF, 0xFE))                return new UnicodeEncoding(false, true);        }        return defEnc;    }}//辅助函数,判断字节中的值static bool CheckBytes(byte[] bytes, int count, params int[] values){    for (int i = 0; i < count; i++)        if (bytes[i] != values[i])            return false;    return true;}
Copy after login

In the above code, for Unicode text, the GetEncoding method will return UTF16 encoding (more specifically: it will also return Big or Little-Endian UTF16 encoding according to BOM), without BOM The file will return the default value GB2312 encoding.

Related Posts:

.NET(C#): Detect the encoding from the file

.NET(C#): Character encoding (Encoding) and byte order mark (BOM) )

.NET(C#): Use the System.Text.Decoder class to process "stream text"

.NET(C#): A brief discussion of assembly manifest resources and RESX resources

The above is the detailed content of Example tutorial on correctly reading Chinese encoded files in .NET (C#). For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!