我新手写的代码,用来处理爬虫下来的htm文件内容,虽然解决问题,但是会有遗漏文件不处理。爬虫是爬一些文章的网站下来的,和网页另存为没什么区别。
想大神们帮我看看我的代码,怎么优化不会有遗漏。比较小白的代码,麻烦了!!!
# -*- coding: utf-8 -*
import re
import glob
filename_list = glob.glob('*.html')
for i in filename_list:
txt = ""
with open(i, "r") as htmfile:
txt = htmfile.read()
scdy = r"<hr[\s\S]*?<hr"
onedotxt = re.findall(scdy, txt)
if onedotxt:
r = onedotxt[0]
twotxt=re.sub('<[^>]*>', '', r)
threetxt=re.sub('<hr', '', twotxt)
fourtxt=re.sub('’', '', threetxt)
fivetxt=re.sub('”', '"', fourtxt)
sixtxt=re.sub('“', '"', fivetxt)
endstr=re.sub('–', '-', sixtxt)
name = endstr.split('\n')[1]
with open(name+".txt", "w") as wf:
wf.write(endstr)
filename_list = glob.glob('.html') + glob.glob('.htm')