This article brings you relevant knowledge about java, which mainly introduces the implementation method of Java specifying encoding when creating a file. The article introduces it in detail through sample code, which is very useful for everyone. It has certain reference and learning value when studying or working. I hope it will be helpful to everyone.
Recommended study: "java Video Tutorial"
Foreword: Recently, I learned the knowledge related to Java IO stream. I would like to Practice and consolidate the knowledge you have learned by reading and writing documents. When using the File class to create a file, I suddenly thought, how should I specify the encoding used by the file? Then I thought, how should I check the encoding of a file?
First go to the Internet to find the answer. The results are as follows:
FileInputStream fis=new FileInputStream(“xxxx.txt”); OutputStreamWriter osw=new OutputStreamWriter(fis,“UTF-8”);
The above code probably means that when writing a file, the written characters use UTF-8 encoding is different from what I expected. I want to specify the encoding when creating the file. Like the following,
File myfile = new File("test.txt”, “UTF-8”); if (!myfile.exists()) myfile.createNewFile();
So, I went to check the official documentation of Java API 8. File does not provide a constructor that can specify the character encoding.
At the same time, other methods of accessing character encoding such as set or get are not provided, indicating that character encoding is not an inherent property of the file. Such as file creation time, file modification time, whether it is readable, writable, and executable, these are the inherent attributes of the file, or meta-information, they are part of the file.
We know that any information stored in the computer is a string of 01, and text is no exception.
The processing of characters includes two processes: Encoding and decoding
Encoding: "map" the characters to the 01 string
Decoding: 01 The string "maps" to the characters
. Different character encodings, such as GBK and UTF-8, use different rules for encoding and decoding.
For the same text string: "China", use UTF-8 encoding to save. Generally, three bytes are used to save a Chinese character (the hexadecimal form of the underlying 01 string).
Use GBK encoding to save, using two bytes to represent a Chinese character.
When we write and save the text in the text editor, the editor will "map" the text into a 01 string according to the character encoding type you set.
The character type you set is just a conversion rule for the editor to encode text into 10 strings, and is not an attribute of the text.
When the editor opens the text file, what is displayed is not the underlying 01 string, but text. This is because the editor uses a certain text encoding to decode the 01 string into characters. If, when decoding, the character encoding used is consistent or compatible with the encoding, the text can be displayed correctly. If the character encoding used during decoding is inconsistent or incompatible with the encoding, the characters will be garbled.
For example, I have a text file using GBK encoding, the content is "When will the bright moon come out",
# #I use VS code (a very easy-to-use text editor from Microsoft) to open the file. In terminology, it is to decode the file. The default text encoding used is UTF-8, and the decoding is the same. However, because the bottom layer of my text is a GBK-encoded 01 string (two bytes and one character), using UTF-8 to decode the 01 string will inevitably lead to garbled characters due to inconsistent encoding and decoding. At this time, as long as you manually select the corresponding GBK encoding, the decoded file will not be garbled. Garbled characters also illustrate from the side thatcharacter encoding is not an inherent attribute of the file.
I have talked so much just to illustrate this point:Character encoding is the rule used when decoding and encoding, not an inherent attribute of the file.
I can't help but wonder, why didn't the character encoding be set as part of the file attributes?Assuming it can be set and set to GBK, then the operating system needs to maintain the function. Just like a file is not writable, if a program tries to write the file, the operating system will refuse to write. The bytes that the operating system must write must meet the GBK encoding requirements. Then every time a byte is written, the operating system needs Checking the legality of the byte requires a very large performance overhead and is even impossible to implement, because some special bytes can represent either GBK or UTF-8, which is ambiguous. Now, what's the point of doing this? Is it so that the editor can select the correct encoding based on the encoding properties when opening the file? There is no need. A smart editor can infer what encoding your 01 string uses based on the first few bytes of the content. In addition, you can also manually set the character encoding used for decoding.
When creating a file, the encoding of the file cannot be specified. When writing text to a file (for example, Ctrl S
of a text editor to save, which essentially performs a writing operation), you can choose to convert the text into an encoding rule of 01 string.
For Java programs, the code is as follows, which is the code mentioned at the beginning of the article:
FileInputStream fis=new FileInputStream(“xxxx.txt”); OutputStreamWriter osw=new OutputStreamWriter(fis,“UTF-8”);
Recommended learning: "java Video Tutorial"
The above is the detailed content of Java implementation method of specifying encoding when creating a file. For more information, please follow other related articles on the PHP Chinese website!