1.使用多线程,在queue这个卡住了.如何将获取的到url,持续放入生产者队列,然后在定义获取这url.尝试使用把这些url保存为一个list.但是量太大,不可行.
2.使用google搜索各种教程,但是url基本都是一个固定的list.
3.获取url代码如下,代码是需要一直循环才得到最终的url.
4.脚本完整代码:
https://github.com/cfqtxd1/le...
#获取声音链接
url = 'http://www.ximalaya.com/dq/all/'
request = urllib2.Request(url)
response = urllib2.urlopen(request)
pagecode = response.read()
soup = BeautifulSoup(pagecode, 'lxml')
sound_tag = soup.findAll('a' ,attrs={'class': 'tagBtn'})
host = 'http://www.ximalaya.com'
for tag in sound_tag:
urltab = 'http://www.ximalaya.com%s' % tag['href'] #urltab是大分类链接
numbercode = Soup(urltab)
pagenumber = numbercode.findAll(name='a', attrs={'class': 'pagingBar_page'})
numberlist = [] #获取分类下页面最大数
for numbers in pagenumber:
numberlist.append(numbers.string)
try:
maxpagenumber = int(numberlist[-2]) + 1
except Exception as a:
maxpagenumber = 1
for i in range(1, maxpagenumber):
urltab2 = (urltab + '%s') % i
print '开始抓取%s,第%s页数据' % (tag.string, i)
# print urltab2
if Link_exists(urltab2) == True:
code = Soup(urltab2)
links_title = code.findAll(name='a', attrs={'class': 'discoverAlbum_title'})
for link in links_title:
# print link['href']
encoding_support = ContentEncodingProcessor
opener = urllib2.build_opener(encoding_support, urllib2.HTTPHandler)
html = opener.open(link['href']).read()
pagetree = etree.HTML(html)
pagenu = pagetree.xpath("//@data-page") #获取专辑下节目最大页数
try:
maxpage = pagenu[-2]
except Exception as e:
logging.exception(e)
maxpage = 1
# for link in links_title:
for p in range(1, int(maxpage)+1):
aurl = link['href'] + '?page=' + str(p) #aurl是album链接
# print aurl
print '开始爬取%s,第%s页数据' % (link.string, p)
encoding_support = ContentEncodingProcessor
opener = urllib2.build_opener(encoding_support, urllib2.HTTPHandler)
# 直接用opener打开网页,如果服务器支持gzip/defalte则自动解压缩
html = opener.open(aurl).read()
ttree = etree.HTML(html)
sound_id = ttree.xpath("//@sound_ids") #获取节目id
urlid = sound_id[0].split(",")
for id in urlid:
if id != '':
jsonurl = 'http://www.ximalaya.com/tracks/%s.json' % id #最终url
5.请帮忙提供下思路.或者示意代码皆可.谢谢!
可以把list结果保存在redis或者mongo里边