


Example tutorial on correctly reading Chinese encoded files in .NET (C#)
First of all, if the reader is not familiar with encoding or BOM, it is recommended to read this article first: .NET (C#): Character Encoding (Encoding) and Byte Order Mark (BOM).
Chinese coding can basically be divided into two categories:
1. Extended set of ANSI coding: such as GBK, GB2312, GB18030, etc. There is no BOM for this type of coding (some newer standard Chinese coding, such as GB18030 and GBK encoding, all are backward compatible with GB2312 encoding).
2. Unicode encoding set: such as UTF-8, UTF-16, UTF-32, etc. This type of coding can have BOM or not.
3. Some Unicode encodings also have specific byte order issues (Endianess), which are the so-called Little endian and Big endian. Different section orders correspond to different BOMs, such as UTF16, but UTF8 does not have byte order issues. .
OK, after understanding the basic knowledge, let us return to the topic, how to open Chinese text files correctly. The first information that needs to be confirmed is: Does your Unicode encoded file contain a BOM?
If BOM is included, then everything is easy to say! Because if we find the BOM, we will know its specific encoding. If the BOM is not found, it is not Unicode. We can open the text file using the system's default ANSI extended Chinese encoding set and it will be OK.
If the Unicode encoding does not have a BOM (obviously, you cannot guarantee that all Unicode files given to you by users have BOM), then you have to manually determine whether it is GBK from the original bytes? Or UTF8? Or other encoding? . This requires a specific encoding detection algorithm (you can google "charset|encoding detection"). Of course, the encoding detection algorithm may not be 100% accurate. It is precisely because of this that Windows Notepad has Bush hid the facts bug. When browsing the web in Chrome, you will also encounter garbled characters. Personally, I feel that Notepad++'s coding awareness is quite accurate.
There are many coding awareness algorithms, such as this project: https://code.google.com/p/ude
If Unicode comes with BOM, there is no need for a third-party class library . However, there are some things that need to be explained.
The problem is that the text reading methods (File class and StreamReader) in .NET read in UTF8 encoding by default, so this type of GBK text file is directly opened with .NET (if no encoding is specified). It must be gibberish!
First of all, the most effective solution here is to use the system default ANSI extended encoding, which is the system default non-Unicode encoding to read text. Reference code:
//输出系统默认非Unicode编码Console.WriteLine(Encoding.Default.EncodingName);//使用系统默认非Unicode编码来打开文件var fileContent = File.ReadAllText("C:\test.txt", Encoding.Default);
in Simplified Chinese Windows The system should output:
Simplified Chinese (GB2312)
And using this method is not limited to Simplified Chinese.
Of course, you can also manually specify an encoding, such as GBK encoding, but if you use the specified GBK encoding to open a Unicode file, will the file still be opened successfully? The answer is still successful. The reason is that .NET will automatically detect the BOM by default when opening a file and use the encoding obtained based on the BOM to open the file. If there is no BOM, the file will be opened with the encoding area specified by the user. If the user does not specify the encoding, UTF8 encoding will be used.
This "automatically aware of BOM" parameter can be set in the constructor of StreamReader, corresponding to the detectEncodingFromByteOrderMarks parameter.
But it cannot be set in the corresponding method of the File class. (For example: File.ReadAllText).
For example, the following code uses:
GB2312 encoding, automatically detecting BOM to read GB2312 text
GB2312 encoding, automatically detecting BOM to read Unicode text
GB2312 encoding, reading Unicode text without noticing the BOM
static void Main(){ var gb2312 = Encoding.GetEncoding("GB2312"); //用GB2312编码,自动觉察BOM 来读取GB2312文本 ReadFile("gbk.txt", gb2312, true); //用GB2312编码,自动觉察BOM 来读取Unicode文本 ReadFile("unicode.txt", gb2312, true); //用GB2312编码,不觉察BOM 来读取Unicode文本 ReadFile("unicode.txt", gb2312, false);}//通过StreamReader读取文本 static void ReadFile(string path, Encoding enc, bool detectEncodingFromByteOrderMarks){ StreamReader sr; using (sr = new StreamReader(path, enc, detectEncodingFromByteOrderMarks)) { Console.WriteLine(sr.ReadToEnd()); }}
Output:
a刘a刘???
The third line is garbled.
Seeing the above, using GB2312 encoding to open Unicode files will also be successful. Because the "Automatically detect BOM" parameter is True, when it is found that the file has a BOM, .NET will detect that it is a Unicode file through the BOM, and then use Unicode to open the file. Of course, if there is no BOM, the specified encoding parameters will be used to open the file. For GB2312 encoded text, there is obviously no BOM, so GB2312 encoding must be specified, otherwise .NET will use the default UTF8 encoding to parse the file, and the result will not be read. The reason for the garbled characters in the third line is that "automatically detect BOM" is False. .NET will directly use the specified GB2312 encoding to read a Unicode encoded text file with BOM, which obviously cannot be successful.
Of course, you can also determine the BOM yourself. If there is no BOM, specify a default encoding to open the text. I wrote about it in a previous article (.NET (C#): Encoding detection from files).
Code:
static void Main(){ PrintText("gb2312.txt"); PrintText("unicode.txt");}//根据文件自动觉察编码并输出内容static void PrintText(string path){ var enc = GetEncoding(path, Encoding.GetEncoding("GB2312")); using (var sr = new StreamReader(path, enc)) { Console.WriteLine(sr.ReadToEnd()); }}/// <summary>/// 根据文件尝试返回字符编码/// </summary>/// <param name="file">文件路径</param>/// <param name="defEnc">没有BOM返回的默认编码</param>/// <returns>如果文件无法读取,返回null。否则,返回根据BOM判断的编码或者缺省编码(没有BOM)。</returns>static Encoding GetEncoding(string file, Encoding defEnc){ using (var stream = File.OpenRead(file)) { //判断流可读? if (!stream.CanRead) return null; //字节数组存储BOM var bom = new byte[4]; //实际读入的长度 int readc; readc = stream.Read(bom, 0, 4); if (readc >= 2) { if (readc >= 4) { //UTF32,Big-Endian if (CheckBytes(bom, 4, 0x00, 0x00, 0xFE, 0xFF)) return new UTF32Encoding(true, true); //UTF32,Little-Endian if (CheckBytes(bom, 4, 0xFF, 0xFE, 0x00, 0x00)) return new UTF32Encoding(false, true); } //UTF8 if (readc >= 3 && CheckBytes(bom, 3, 0xEF, 0xBB, 0xBF)) return new UTF8Encoding(true); //UTF16,Big-Endian if (CheckBytes(bom, 2, 0xFE, 0xFF)) return new UnicodeEncoding(true, true); //UTF16,Little-Endian if (CheckBytes(bom, 2, 0xFF, 0xFE)) return new UnicodeEncoding(false, true); } return defEnc; }}//辅助函数,判断字节中的值static bool CheckBytes(byte[] bytes, int count, params int[] values){ for (int i = 0; i < count; i++) if (bytes[i] != values[i]) return false; return true;}
In the above code, for Unicode text, the GetEncoding method will return UTF16 encoding (more specifically: it will also return Big or Little-Endian UTF16 encoding according to BOM), without BOM The file will return the default value GB2312 encoding.
Related Posts:
.NET(C#): Detect the encoding from the file
.NET(C#): Character encoding (Encoding) and byte order mark (BOM) )
.NET(C#): Use the System.Text.Decoder class to process "stream text"
.NET(C#): A brief discussion of assembly manifest resources and RESX resources
The above is the detailed content of Example tutorial on correctly reading Chinese encoded files in .NET (C#). For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Guide to Active Directory with C#. Here we discuss the introduction and how Active Directory works in C# along with the syntax and example.

Guide to Random Number Generator in C#. Here we discuss how Random Number Generator work, concept of pseudo-random and secure numbers.

Guide to C# Serialization. Here we discuss the introduction, steps of C# serialization object, working, and example respectively.

Guide to C# Data Grid View. Here we discuss the examples of how a data grid view can be loaded and exported from the SQL database or an excel file.

Guide to Patterns in C#. Here we discuss the introduction and top 3 types of Patterns in C# along with its examples and code implementation.

Guide to Prime Numbers in C#. Here we discuss the introduction and examples of prime numbers in c# along with code implementation.

Guide to Factorial in C#. Here we discuss the introduction to factorial in c# along with different examples and code implementation.

The difference between multithreading and asynchronous is that multithreading executes multiple threads at the same time, while asynchronously performs operations without blocking the current thread. Multithreading is used for compute-intensive tasks, while asynchronously is used for user interaction. The advantage of multi-threading is to improve computing performance, while the advantage of asynchronous is to not block UI threads. Choosing multithreading or asynchronous depends on the nature of the task: Computation-intensive tasks use multithreading, tasks that interact with external resources and need to keep UI responsiveness use asynchronous.
