Handling BOM Markers in UTF-8 File Reading
When dealing with UTF-8 encoded text files that may contain a Byte Order Mark (BOM), it's crucial to handle the BOM correctly to avoid unexpected output. A BOM is a special sequence of bytes that indicates the byte order of the file. In UTF-8, the BOM is the three-byte sequence EF BB BF.
Consider the following code:
fr = new FileReader(file); br = new BufferedReader(fr); String tmp = null; while ((tmp = br.readLine()) != null) { String text; text = new String(tmp.getBytes(), "UTF-8"); content += text + System.getProperty("line.separator"); }
In this code, the BOM will be included in the output string text because the getBytes() method retrieves the raw bytes of the string, including the BOM. To handle the BOM correctly, you can use the following techniques:
Method 1: Specify the BOM Character Set
You can specify the character set as UTF-8 with BOM using the Charset class:
Charset charset = Charset.forName("UTF-8"); String text = new String(tmp.getBytes(charset), charset);
Method 2: Read and Drop the BOM
If the BOM is not required, you can read and drop it before processing the rest of the file:
char[] bom = new char[3]; int bytesRead = br.read(bom, 0, bom.length); if (bytesRead == bom.length && new String(bom).equals("\uFEFF")) { // BOM exists, drop it }
By implementing one of these techniques, you can ensure that the BOM is handled correctly and that the output string does not include the BOM marker.
The above is the detailed content of How to Handle BOM Markers in UTF-8 File Reading?. For more information, please follow other related articles on the PHP Chinese website!