This article can be regarded as an end to a problem that I couldn't solve when I was working on a search engine project a few years ago. It is of little use, but it can make up for one of my regrets.
The scene at that time was like this. Normal people’s habit is always to enter normal search terms in the search box and then search. However, there are always some users who think they are smart and copy the URL from the address bar, then change the parameters. Access, similar to http://www.xxx.com/search?keyword =%E4%B8%AD%E6%96%87 (displayed under IE, as for chrome and firefox, Chinese will be displayed in the address bar) , when the request submitted by the user is http://www.xxx.com/search?keyword = Chinese under IE, you will find that the server (web processing backend) cannot recognize such characters at all. This is the reason why the browser When submitting a request to the backend, its parameters must be the URLEncode of the iso-8859-1 specification. When writing web programs, IE must require us to manually convert the encoding, while chrome and firefox can convert it or not, because they will Automatically converted during transmission.
The backend cannot recognize the characters, which is what we often call garbled characters. The reason for this kind of garbled code is also due to decoding errors. Our web container (framework, similar to jetty/tomcat/jboss in java and django in python) will automatically UrlDecode this string of characters. At this time, IE The submitted characters that have not been encoded are decoded, and it is conceivable that they can never come back (how many people have seen this kind of garbled code like me and got sick and rushed to the doctor).
ok, there are actually two solutions to this problem. The first one is before reaching the web backend (it cannot be done at the js layer, because the user directly hits Enter on the address bar), that is to say, before The front end of the server (nginx) performs preprocessing and URL-encodes unencoded characters. The second is to recompile the logic for decoding servlet processing parameters in the web container to determine whether it requires urldecoding.
In view of the difficulty of implementation, I chose the first method, which is to process it in nginx, use lua in nginx to transcode the parameters, and then reverse proxy to the web backend.
Here, it depends on your own project. There are several situations to pay attention to, such as whether your project is UTF-8 encoded or GBK encoded, and whether the customer's environment is UTF-8 or GBK. These must be done differently. Processing, for example, my system is Windows where the browser is located, so the encoding of my client is GBK, and my project is UTF-8, so before urlencoding, I need to convert GBK-》UTF- 8 operations.
set_by_lua $arg_name ' local iconv = require("luaiconv") local cd = iconv.new( "utf-8","gbk") if(string.find(ngx.var.arg_name,"%")){ ngx.var.arg_name, err = cd:iconv(ngx.var.arg_name) } return ngx.escape_uri(ngx.var.arg_name) ';
Three years ago, Google was the only search engine that could process manual input of Chinese characters into the IE address bar. However, looking back today, many companies have already done this.
The above introduces the solution to the garbled code caused by directly inputting Chinese for param in the address bar of IE browser, including the relevant content. I hope it will be helpful to friends who are interested in PHP tutorials.