新入门做的一个爬取文段程序,代码看起来比较臃肿,请问有没有什么好的建议呢?
另外我准备在for循环末尾把数据插入mysql数据库,这样做好吗,还是说再弄一个for循环,第一个for循环把数据存进二维列表里,第二个循环再逐条插入数据呢?
#-*- coding:utf-8 -*-
import re
from pyquery import PyQuery as pq
import time
#过滤html标签
def stripTag(x):
return re.sub('<(.*?)>','',str(x))
#转换时间戳
def timeStamp(x):
return time.mktime(time.strptime(x,'%Y-%m-%d %H:%M'))
#获取网页局部源码
d = pq(url='http://www.juexiang.com/list/1017')
d = pq(d('.left').html())
x = d('p.arttitle')
#匹配时间格式
pattern = re.compile(r"[0-9]{4}(.*)[0-9]{2}")
#for循环获取标题、作者、时间
for i in x:
a = pq(pq(i).html())
title = stripTag(pq(a('a').eq(0).text()))
author = stripTag(pq(a('a').eq(1).text()))
time1 = str(pq(a('span').eq(2).text()))
time1 = timeStamp((pattern.search(time1)).group())
print(title,'\t',author,'\t',time1,'\n')
If you also add database operations in the for loop, the code will look ugly. In fact, each part can be divided into functions or classes to handle,
For example,
1. You can build a function to obtain web page content and filter html tags
2. Obtaining timestamps and converting timestamps can be combined in one function
3. Obtaining the title, author and time can be processed with a function
This function can be reused, and it is very convenient and flexible to call, because this is just a small program, and the effect can be achieved by refactoring a few functions