Use Node.js to deal with encoding issues of front-end code files

Use Node.js to deal with encoding issues of front-end code files_node.js

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-05-16 15:15:14

Original

1747 people have browsed it

When using NodeJS to write front-end tools, the most commonly used text files are text files, so the issue of file encoding is also involved. Our commonly used text encodings are UTF8 and GBK, and UTF8 files may also contain BOM. When reading text files with different encodings, the file content needs to be converted into the UTF8 encoded string used by JS before it can be processed normally.

BOM Removal
BOM is used to mark a text file using Unicode encoding, which itself is a Unicode character ("uFEFF") located in the header of the text file. Under different Unicode encodings, the binary bytes corresponding to the BOM characters are as follows:

  Bytes   Encoding
----------------------------
  FE FF    UTF16BE
  FF FE    UTF16LE
  EF BB BF  UTF8

Copy after login

Therefore, we can determine whether the file contains a BOM and which Unicode encoding to use based on what the first few bytes of the text file equal. However, although the BOM character plays a role in marking the file encoding, it is not part of the file content. If the BOM is not removed when reading the text file, there will be problems in certain usage scenarios. For example, after we merge several JS files into one file, if the file contains BOM characters, it will cause browser JS syntax errors. Therefore, when using NodeJS to read text files, you generally need to remove the BOM. For example, the following code implements the function of identifying and removing UTF8 BOM.

function readText(pathname) {
  var bin = fs.readFileSync(pathname);

  if (bin[0] === 0xEF && bin[1] === 0xBB && bin[2] === 0xBF) {
    bin = bin.slice(3);
  }

  return bin.toString('utf-8');
}

Copy after login

GBK to UTF8
NodeJS supports specifying the text encoding when reading a text file, or when converting a Buffer to a string, but unfortunately, GBK encoding is not within the scope of NodeJS's own support. Therefore, we generally use the third-party package iconv-lite to convert the encoding. After downloading the package using NPM, we can write a function to read the GBK text file as follows.

var iconv = require('iconv-lite');

function readGBKText(pathname) {
  var bin = fs.readFileSync(pathname);

  return iconv.decode(bin, 'gbk');
}

Copy after login

Single byte encoding
Sometimes, we cannot predict which encoding the file we need to read uses, so we cannot specify the correct encoding. For example, some of the CSS files we need to process are encoded in GBK and some in UTF8. Although it is possible to guess the text encoding based on the byte content of the file to a certain extent, what I will introduce here is a somewhat limited, but much simpler technique.

First of all, we know that if a text file only contains English characters, such as Hello World, then there will be no problem reading the file using GBK encoding or UTF8 encoding. This is because under these encodings, characters in the range of ASCII0~128 use the same single-byte encoding.

On the other hand, even if there are Chinese and other characters in a text file, if the characters we need to process are only in the range of ASCII0~128, such as JS code except comments and strings, we can use single byte uniformly. Encoding to read the file, no need to care whether the actual encoding of the file is GBK or UTF8. The following example illustrates this approach.

1. GBK encoding source file content:

  var foo = '中文';

Copy after login

2. Corresponding byte:

  76 61 72 20 66 6F 6F 20 3D 20 27 D6 D0 CE C4 27 3B

Copy after login

3. The content obtained after reading using single-byte encoding:

  var foo = '{乱码}{乱码}{乱码}{乱码}';

Copy after login

4. Replacement content:

  var bar = '{乱码}{乱码}{乱码}{乱码}';

Copy after login

5. The corresponding bytes after saving using single-byte encoding:

  76 61 72 20 62 61 72 20 3D 20 27 D6 D0 CE C4 27 3B

Copy after login

6. Use GBK encoding to read and get the content:

  var bar = '中文';

Copy after login

The trick here is that no matter what garbled characters a single byte larger than 0xEF is parsed into under single-byte encoding, when these garbled characters are saved using the same single-byte encoding, the corresponding bytes behind them remain unchanged.

NodeJS comes with a binary encoding that can be used to implement this method, so in the following example, we use this encoding to demonstrate how to write the code corresponding to the above example.

function replace(pathname) {
  var str = fs.readFileSync(pathname, 'binary');
  str = str.replace('foo', 'bar');
  fs.writeFileSync(pathname, str, 'binary');
}

Copy after login