python - requests headers 解码
黄舟
黄舟 2017-04-18 09:35:48
0
3
866

我需要使用python的requests 下载一些文件,但是文件是中文名的

chrome调试看出来的文件名是

Content-Disposition:attachment; filename=%C9%F1%BC%B6%BB%F5%C0%C9.txt

requests 下载显示的却是乱码

import requests
url = 'http://www.23us.so/modules/article/txtarticle.php?id=156'
req = requests.head(url)
headers = req.headers
print( headers.get('Content-Disposition'))

>> attachment; filename=ÇàÔÆÏÉ·.txt

我试过设置req.encoding 没有效果

怎么把header中的文字恢复出来,requests中似乎没有相关方法
各位可以调试一下

黄舟
黄舟

人生最曼妙的风景,竟是内心的淡定与从容!

reply all(3)
小葫芦

Ahem, you should have released the specific link address earlier. I will follow up on the method and code:

url = 'http://www.23us.so/modules/article/txtarticle.php?id=156'
req = requests.head(url)
headers = req.headers
print(headers.get('Content-Disposition').encode(req.encoding).decode('gbk'))  # gb2312也可以正确解码

Result:

attachment; filename=青云仙路.txt

You can just let req.encoding自己猜目标的编码方式即可.
requests模块的models.py第 769 行注释说的很清楚, 人家可以自动检测目标网页内容的编码类型, 而具体负责检测编码的代码在这里universaldetector.py
所以我们只需要利用下这个特性编码然后再按utf-8 decode it, look at the code:

import requests


url = "http://www.weather.com.cn/data/cityinfo/101010100.html"
req = requests.get(url)
print(req.text)
print(req.encoding)
print(req.text.encode(req.encoding))
print(req.text.encode(req.encoding).decode('utf-8'))

Result:

{"weatherinfo":{"city":"北京","cityid":"101010100","temp1":"-2℃","temp2":"16℃","weather":"晴","img1":"n0.gif","img2":"d0.gif","ptime":"18:00"}}
ISO-8859-1
b'{"weatherinfo":{"city":"\xe5\x8c\x97\xe4\xba\xac","cityid":"101010100","temp1":"-2\xe2\x84\x83","temp2":"16\xe2\x84\x83","weather":"\xe6\x99\xb4","img1":"n0.gif","img2":"d0.gif","ptime":"18:00"}}'
{"weatherinfo":{"city":"北京","cityid":"101010100","temp1":"-2℃","temp2":"16℃","weather":"晴","img1":"n0.gif","img2":"d0.gif","ptime":"18:00"}}
伊谢尔伦

If you display all the headers, there should be a charset attribute.


Update

This is actually URI encode, which is escaped from unicode.
Decoding example is as follows:

def decodeURI(strURI):
    strURI = strURI.replace('%','')
    URI = ''.join((chr(int(strURI[i:i+4],16)) for i in range(0,len(strURI),4)))
    return URI

n = '%C9%F1%BC%B6%BB%F5%C0%C9'
print(decodeURI(n))

Result:

짱벶믵색
is Korean~~


Update again

After thinking about it carefully, it might be another encoding format, so I tried it with gb2312.

n = '%C9%F1%BC%B6%BB%F5%C0%C9'
print(bytes.fromhex(n.replace('%','')).decode('gb2312'))

The result is:

神级货郎

I think this is more reliable~
These methods are available in urllib, they are:quoteunquote
Example:

import urllib

n = 'filename=%C9%F1%BC%B6%BB%F5%C0%C9.txt'
filename = urllib.parse.unquote(n,encoding='gb2312')
print(filename)

The result is:

filename=神级货郎.txt

Three updates

I am explaining the principle~
If you don’t know charset, you can only charset的情况下,只能;requests也是用 chardet 进行猜测。
而且,@ferstar 所说的 req.encoding 是用于 响应体(Response.content) 的,并不能用于 headersguess
; requests also use chardet to guess. Also, what @ferstar said is that req.encoding is used for Response.content) and cannot be used for headers. Before the questioner did not provide the code and web link, I could only use the data given by

the questioner
:

filename=%C9%F1%BC%B6%BB%F5%C0%C9.txt

字符串,不是bytes!所以req.encoding是无效的。
前面我也提到过,这其实是个 URI,从原字符的某个编码转义而来。%URI Look carefully, this is a string, not bytes! So req.encoding is invalid. I also mentioned before that this is actually a URI, which is
escaped

from a certain encoding of the original character. % is the

escape character

for URI.

I have already written the restoration method above, and the result is correct.

Why not accept the correct answer?


Why not accept the correct answer?

Why not accept the correct answer?

Four

I didn’t want to update this post, but @ferstar made a long comment, so it would be inappropriate not to reply~

After the questioner updated and improved the question, I followed up the answer in time, and I was able to get the correct result and solve the questioner's problem. This is a fact; your answer did not seem to be updated before the questioner accepted my answer. This is It is also a fact; I posted the specific implementation of the corresponding source code before, which is also reasonable, and this is even more true; the req.encoding method I mentioned does play a role, rather than being useless for headers as you said, which seems to be true

Quoting @ferstar’s comment, completed

SF content updates have
historical version records

, check them out and compare them.

Asker: ider3小时,@ider 更新了问题#r4,并采纳了@ferstar 的第一版错误答案#r1。
采纳Answer: Agree and acceptAnswer: ferstar
3 hours after I updated the correct answer #r3, @ider updated question #r4 and adopted the first version of @ferstar's incorrect answer #r1. adoptedAfter

, I raised objections in the comments and @ferstar updated the second version of answer #r2.

Also, the second version of @ferstar’s answer
is still wronggb2312,他不过是替换成兼容编码gbk

🎜But why does the second version of @ferstar's answer give the correct result? 🎜Because I found the correct encoding before🎜, he just replaced it with the compatible encoding gbk. 🎜

Tell me morereq.encoding 不能作用于 headers
这个结论依然没变。这是由http原理决定的,headers先于body.

As for the correct way to write this program, I am too lazy to explain and update it, tired!
Unless @ider re-adopts my answer, I might consider it~~

洪涛

Your file name is encoded using gb2312, and your decoding also needs to be set to decode according to gb2312. If decoded according to utf-8, garbled characters will appear. Maybe the decoding you have set is decoded according to utf-8 by default

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template