由于不太清楚传输的机制,卡在SCRAPY传输的这个问题上近半个月,翻阅了好多资料,还是不懂,基础比较差所以上来求助各位老师!
不涉及自定义就以SCRAPY默认的格式为例
spider return
的东西需要什么样的格式?
dict?{a:1,b:2,.....}
还是[{a:1,aa:11},{b:2,bb:22},{......}]
return
的东西传去哪了?
是不是下面代码的item?
class pipeline :
def process_item(self, item, spider):
我真的是很菜,但是我很想学希望能得到各位老师的帮助!下面是我的代码,希望能指出缺点
spider:
# -*- coding: utf-8 -*-
import scrapy
from pm25.items import Pm25Item
import re
class InfospSpider(scrapy.Spider):
name = "infosp"
allowed_domains = ["pm25.com"]
start_urls = ['http://www.pm25.com/rank/1day.html', ]
def parse(self, response):
item = Pm25Item()
re_time = re.compile("\d+-\d+-\d+")
date = response.xpath("/html/body/p[4]/p/p/p[2]/span").extract()[0] #单独解析出DATE
# items = []
selector = response.selector.xpath("/html/body/p[5]/p/p[3]/ul[2]/li") #从response里确立解析范围
for subselector in selector: #通过范围逐条解析
try: #防止[0]报错
rank = subselector.xpath("span[1]/text()").extract()[0]
quality = subselector.xpath("span/em/text()")[0].extract()
city = subselector.xpath("a/text()").extract()[0]
province = subselector.xpath("span[3]/text()").extract()[0]
aqi = subselector.xpath("span[4]/text()").extract()[0]
pm25 = subselector.xpath("span[5]/text()").extract()[0]
except IndexError:
print(rank,quality,city,province,aqi,pm25)
item['date'] = re_time.findall(date)[0]
item['rank'] = rank
item['quality'] = quality
item['province'] = city
item['city'] = province
item['aqi'] = aqi
item['pm25'] = pm25
# items.append(item)
yield item #这里不懂该怎么用,出来的是什么格式,
#有的教程会return items,所以希望能得到指点
pipeline:
import time
class Pm25Pipeline(object):
def process_item(self, item, spider):
today = time.strftime("%y%m%d",time.localtime())
fname = str(today) + ".txt"
with open(fname,"a") as f:
for tmp in item: #不知道这里是否写的对,
#个人理解是spider return出来的item是yiled dict
#[{a:1,aa:11},{b:2,bb:22},{......}]
f.write(tmp["date"] + '\t' +
tmp["rank"] + '\t' +
tmp["quality"] + '\t' +
tmp["province"] + '\t' +
tmp["city"] + '\t' +
tmp["aqi"] + '\t' +
tmp["pm25"] + '\n'
)
f.close()
return item
items:
import scrapy
class Pm25Item(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
date = scrapy.Field()
rank = scrapy.Field()
quality = scrapy.Field()
province = scrapy.Field()
city = scrapy.Field()
aqi = scrapy.Field()
pm25 = scrapy.Field()
pass
部分运行报错代码:
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '30',
'city': '新疆',
'date': '2017-04-02',
'pm25': '13 ',
'province': '伊犁哈萨克州',
'quality': '优',
'rank': '357'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
'city': '西藏',
'date': '2017-04-02',
'pm25': '11 ',
'province': '林芝',
'quality': '优',
'rank': '358'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '28',
'city': '云南',
'date': '2017-04-02',
'pm25': '11 ',
'province': '丽江',
'quality': '优',
'rank': '359'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '27',
'city': '云南',
'date': '2017-04-02',
'pm25': '15 ',
'province': '玉溪',
'quality': '优',
'rank': '360'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '26',
'city': '云南',
'date': '2017-04-02',
'pm25': '10 ',
'province': '楚雄州',
'quality': '优',
'rank': '361'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '24',
'city': '云南',
'date': '2017-04-02',
'pm25': '11 ',
'province': '迪庆州',
'quality': '优',
'rank': '362'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.scraper] ERROR: Error processing {'aqi': '22',
'city': '云南',
'date': '2017-04-02',
'pm25': '9 ',
'province': '怒江州',
'quality': '优',
'rank': '363'}
Traceback (most recent call last):
File "d:\python35\lib\site-packages\twisted\internet\defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "D:\pypro\pm25\pm25\pipelines.py", line 23, in process_item
tmp["pm25"] + '\n'
TypeError: string indices must be integers
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Closing spider (finished)
2017-04-03 10:23:14 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 328,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 38229,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 4, 3, 2, 23, 14, 972356),
'log_count/DEBUG': 2,
'log_count/ERROR': 363,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 4, 3, 2, 23, 13, 226730)}
2017-04-03 10:23:14 [scrapy.core.engine] INFO: Spider closed (finished)
希望能到到各位老师的帮助再次感谢~!
Just write it directly, no need to do a loop, the item is processed individually, not a list like you think:
Search: TypeError: string indices must be integers, figure out what the problem is
Locate the number of lines, and solve the problem
Scrapy's Item is similar to a python dictionary, with some extended functions.
Scrapy’s design, every time an Item is generated, it can be passed to the pipeline for processing. What you wrote in it is looping over the keys of the item dictionary. The keys should be strings. If you use the __getitem__ syntax, you will be prompted to use numbers instead of numbers.
for tmp in item
You can put one
item
看作一个字典,实际它就是dict
类的派生类。你在pipeline
里对这个item
直接遍历,取到的tmp
实际是都是字典的键,类型是字符串,所以tmp['pm25']
这种操作报出TypeError:string类型的对象索引必须是int型
.