How to Avoid Outputting the BOM Marker When Reading a UTF-8 Encoded File?-javaTutorial-php.cn

How to Avoid Outputting the BOM Marker When Reading a UTF-8 Encoded File?

Mary-Kate Olsen

Release： 2024-11-16 22:43:03

Original

410 people have browsed it

How to Avoid Outputting the BOM Marker When Reading a UTF-8 Encoded File?

Unicode BOM and FileReader

When reading a UTF-8 encoded file with a Byte Order Mark (BOM), you may encounter the issue of the BOM marker being outputted along with the file content. This occurs because Unicode defines a BOM to specify the endianness of the encoded text, which can be interpreted as a character sequence if not handled properly.

In your code snippet:

fr and br are used to read the file as bytes and convert them into characters.
tmp reads each line of the file as a byte array.
text converts the byte array into a UTF-8 encoded string.
content concatenates the lines of the file, including the BOM marker as it is part of the file's content.

To avoid the BOM marker from being included in the output:

Read the file as a String, not as a byte array. This skips the need to convert bytes to characters, avoiding the BOM issue.

String content = new String(Files.readAllBytes(Paths.get(file)), "UTF-8"));

Copy after login

If you must read the file as a byte array, you can manually remove the BOM marker before converting it to a string. The BOM marker is a three-byte sequence:

if (tmp.length >= 3 &&
    tmp[0] == (byte) 0xEF &&
    tmp[1] == (byte) 0xBB &&
    tmp[2] == (byte) 0xBF) {

    // Remove the BOM marker
    tmp = Arrays.copyOfRange(tmp, 3, tmp.length);
}

Copy after login

The above is the detailed content of How to Avoid Outputting the BOM Marker When Reading a UTF-8 Encoded File?. For more information, please follow other related articles on the PHP Chinese website!