When using NodeJS to write front-end tools, the most commonly used text files are text files, so the issue of file encoding is also involved. Our commonly used text encodings are UTF8 and GBK, and UTF8 files may also contain BOM. When reading text files with different encodings, the file content needs to be converted into the UTF8 encoded string used by JS before it can be processed normally.
BOM Removal
BOM is used to mark a text file using Unicode encoding, which itself is a Unicode character ("uFEFF") located in the header of the text file. Under different Unicode encodings, the binary bytes corresponding to the BOM characters are as follows:
Bytes Encoding ---------------------------- FE FF UTF16BE FF FE UTF16LE EF BB BF UTF8
Therefore, we can determine whether the file contains a BOM and which Unicode encoding to use based on what the first few bytes of the text file equal. However, although the BOM character plays a role in marking the file encoding, it is not part of the file content. If the BOM is not removed when reading the text file, there will be problems in certain usage scenarios. For example, after we merge several JS files into one file, if the file contains BOM characters, it will cause browser JS syntax errors. Therefore, when using NodeJS to read text files, you generally need to remove the BOM. For example, the following code implements the function of identifying and removing UTF8 BOM.
function readText(pathname) { var bin = fs.readFileSync(pathname); if (bin[0] === 0xEF && bin[1] === 0xBB && bin[2] === 0xBF) { bin = bin.slice(3); } return bin.toString('utf-8'); }
GBK to UTF8
NodeJS supports specifying the text encoding when reading a text file, or when converting a Buffer to a string, but unfortunately, GBK encoding is not within the scope of NodeJS's own support. Therefore, we generally use the third-party package iconv-lite to convert the encoding. After downloading the package using NPM, we can write a function to read the GBK text file as follows.
var iconv = require('iconv-lite'); function readGBKText(pathname) { var bin = fs.readFileSync(pathname); return iconv.decode(bin, 'gbk'); }
Single byte encoding
Sometimes, we cannot predict which encoding the file we need to read uses, so we cannot specify the correct encoding. For example, some of the CSS files we need to process are encoded in GBK and some in UTF8. Although it is possible to guess the text encoding based on the byte content of the file to a certain extent, what I will introduce here is a somewhat limited, but much simpler technique.
First of all, we know that if a text file only contains English characters, such as Hello World, then there will be no problem reading the file using GBK encoding or UTF8 encoding. This is because under these encodings, characters in the range of ASCII0~128 use the same single-byte encoding.
On the other hand, even if there are Chinese and other characters in a text file, if the characters we need to process are only in the range of ASCII0~128, such as JS code except comments and strings, we can use single byte uniformly. Encoding to read the file, no need to care whether the actual encoding of the file is GBK or UTF8. The following example illustrates this approach.
1. GBK encoding source file content:
var foo = '中文';
2. Corresponding byte:
76 61 72 20 66 6F 6F 20 3D 20 27 D6 D0 CE C4 27 3B
3. The content obtained after reading using single-byte encoding:
var foo = '{乱码}{乱码}{乱码}{乱码}';
4. Replacement content:
var bar = '{乱码}{乱码}{乱码}{乱码}';
5. The corresponding bytes after saving using single-byte encoding:
76 61 72 20 62 61 72 20 3D 20 27 D6 D0 CE C4 27 3B
6. Use GBK encoding to read and get the content:
var bar = '中文';
The trick here is that no matter what garbled characters a single byte larger than 0xEF is parsed into under single-byte encoding, when these garbled characters are saved using the same single-byte encoding, the corresponding bytes behind them remain unchanged.
NodeJS comes with a binary encoding that can be used to implement this method, so in the following example, we use this encoding to demonstrate how to write the code corresponding to the above example.
function replace(pathname) { var str = fs.readFileSync(pathname, 'binary'); str = str.replace('foo', 'bar'); fs.writeFileSync(pathname, str, 'binary'); }