Transfer to: coolcode.cn
A few days ago I wrote an article on how to display web pages normally in any character set. The introduction is very simple, that is, character sets other than the first 128 characters are represented by NCR, but I did not introduce the specific conversion method. , because I thought it was too simple at the time. But later I found someone asked this question, so I will explain it in detail here.
The first step is to convert the string of the source character set into the UTF-16 character set. This step is because each character in the UTF-16 character set is two bytes, and it is easy to process later. , and it would be very complicated to process directly on the source character set. The source character set can be obtained from the meta tag in the original web page, or can be specified separately. My program allows the user to specify the source character set in the form, because I cannot guarantee that the file submitted by the user must be an HTML file (other files are also Yes, for example, the Chinese language package source file of WordPress is a po file, and the content in it can also be processed in this way), and even if it is an HTML file, it does not necessarily have a meta tag for specifying the character set, so specify it separately through the form The character set is relatively safe. You may think that converting one character set to another is complicated. Indeed, it is very troublesome to implement it yourself, but it is very easy to do it with PHP because it already contains such a function. , you can easily achieve conversion between various character sets through the iconv function. If the iconv extension is not installed on your machine, you can also use the mb_convert_encoding function. If the Multibyte String extension is not installed, there is nothing you can do. , because it is basically impossible for you to convert so many types of codes yourself, unless you are a top expert! It is recommended to use iconv because it is more efficient and supports more character sets.
After completing the above step, the next step is to process the string in units of two bytes. These two bytes are directly converted into numbers and are xxxxx in xxxx;. If the number is less than 128, use this character directly (note that it becomes a single byte here), otherwise use the form of xxxx;. One thing to note here is that when this number is 65279 (hexadecimal 0xFEFF), please ignore it, because this is the transmission control character in Unicode encoding, and our current string already only has iso-8859- 1 is the first 128 characters in the encoding, so we don't need it.
Okay, the basic idea is this. Here is the implementation program:
Download: nochaoscode.php