当使用第三方库requests
的时候,可以这样转换:
import requests
html = requests.get('http://example.com')
html.encoding = 'utf-8'
问题:
使用Scrapy中的Request的时候,怎么把拿到的内容编码转换为utf-8?
demo:
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract_first(),
'votes': response.css('.question .vote-count-post::text').extract_first(),
'body': response.css('.question .post-text').extract_first(),
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
Trying to answer your question, I feel like your understanding of python coding is a bit off.
1. Both requests and requests are just implementation packages of the http protocol.
The encoding of the packet return message comes from the website visited by the HTTP protocol. The encoding format will be written in the header of the http protocol.
For example:
r=requests.get('http://www.baidu.com')
print r.headers['Content-Type']
Output:
text/html;charset=UTF-8
This shows the UTF-8 format of the response message.
Scrapy.Request is the same.
2. If the returned charset=gbk2312, you can determine whether to transcode it to the encoding you need based on your code needs.
r=requests.get('http://www.baidu.com')
print r.content[:1000].decode('utf-8')
print r.content[:1000].decode(' utf-8').encode('gbk')
Just use decode and encode, regardless of whether it’s scrapy or not.