我需要使用python的requests 下载一些文件,但是文件是中文名的
chrome调试看出来的文件名是
Content-Disposition:attachment; filename=%C9%F1%BC%B6%BB%F5%C0%C9.txt
requests 下载显示的却是乱码
import requests
url = 'http://www.23us.so/modules/article/txtarticle.php?id=156'
req = requests.head(url)
headers = req.headers
print( headers.get('Content-Disposition'))
>> attachment; filename=ÇàÔÆÏÉ·.txt
我试过设置req.encoding 没有效果
怎么把header中的文字恢复出来,requests中似乎没有相关方法
各位可以调试一下
Ahem, you should have released the specific link address earlier. I will follow up on the method and code:
Result:
You can just let
req.encoding
自己猜目标的编码方式即可.requests
模块的models.py
第 769 行注释说的很清楚, 人家可以自动检测目标网页内容的编码类型, 而具体负责检测编码的代码在这里universaldetector.py
所以我们只需要利用下这个特性编码然后再按
utf-8
decode it, look at the code:Result:
If you display all the headers, there should be a charset attribute.
Update
This is actually URI encode, which is escaped from unicode.
Decoding example is as follows:
Result:
짱벶믵색
is Korean~~
Update again
After thinking about it carefully, it might be another encoding format, so I tried it with
gb2312
.The result is:
I think this is more reliable~
These methods are available in urllib, they are:
quote
,unquote
Example:
The result is:
Three updates
I am explaining the principle~
the questioner filename=%C9%F1%BC%B6%BB%F5%C0%C9.txtIf you don’t know
charset
, you can onlycharset
的情况下,只能猜;requests也是用chardet
进行猜测。而且,@ferstar 所说的
req.encoding
是用于响应体(Response.content)
的,并不能用于headers
guess; requests also use
chardet
to guess. Also, what @ferstar said is thatreq.encoding
is used forResponse.content)
and cannot be used forheaders
. Before the questioner did not provide the code and web link, I could only use the data given by
from a certain encoding of the original character.字符串
,不是bytes
!所以req.encoding
是无效的。前面我也提到过,这其实是个
URI
,从原字符的某个编码转义而来。%
是URI
Look carefully, this is astring
, notbytes
! Soreq.encoding
is invalid. I also mentioned before that this is actually aURI
, which isescaped
%
is theescape character
forURI
.I have already written the restoration method above, and the result is correct.
Why not accept the correct answer?
Why not accept the correct answer?
After the questioner updated and improved the question, I followed up the answer in time, and I was able to get the correct result and solve the questioner's problem. This is a fact; your answer did not seem to be updated before the questioner accepted my answer. This is It is also a fact; I posted the specific implementation of the corresponding source code before, which is also reasonable, and this is even more true; the req.encoding method I mentioned does play a role, rather than being useless for headers as you said, which seems to be true
Quoting @ferstar’s comment, completed
, check them out and compare them.SF content updates have
historical version records
Asker: ider
, I raised objections in the comments and @ferstar updated the second version of answer #r2.3
小时,@ider 更新了问题#r4,并采纳
了@ferstar 的第一版错误答案#r1。采纳
Answer: Agree and acceptAnswer: ferstar3
hours after I updated the correct answer #r3, @ider updated question #r4 andadopted
the first version of @ferstar's incorrect answer #r1.adopted
AfterAlso, the second version of @ferstar’s answer
🎜But why does the second version of @ferstar's answer give the correct result? 🎜Because I found the correct encoding before🎜, he just replaced it with the compatible encodingis still wrong
gb2312
,他不过是替换成兼容编码gbk
gbk
. 🎜Tell me more
req.encoding
不能作用于headers
。这个结论依然没变。这是由http原理决定的,
headers
先于body
.As for the correct way to write this program, I am too lazy to explain and update it, tired!
Unless @ider re-adopts my answer, I might consider it~~
Your file name is encoded using gb2312, and your decoding also needs to be set to decode according to gb2312. If decoded according to utf-8, garbled characters will appear. Maybe the decoding you have set is decoded according to utf-8 by default