This article mainly introduces the implementation code of python Selenium crawling content and storing it in MySQL database. Friends who need it can refer to
I passed it earlier An article describes how to crawl CSDN blog summaries and other information. Usually, after using Selenium crawler to crawl data, it needs to be stored in TXT text, but this is difficult to perform data processing and data analysis. This article mainly talks about crawling my personal blog information through Selenium, and then storing it in the database MySQL to analyze the data, such as analyzing which time period there are more blogs, combining WordCloud to analyze the topic of the article, article reading ranking, etc. .
This is a basic article. I hope it will be helpful to you. If there are any errors or deficiencies in the article, please forgive me. The next article will briefly explain the process of data analysis.
1. Crawling results
The crawled address is: http://blog.csdn.net/Eastmount
## The result of crawling and stored to the MySQL database is shown below:
The complete code is as follows:
# coding=utf-8 from selenium import webdriver from selenium.webdriver.common.keys import Keys import selenium.webdriver.support.ui as ui import re import time import os import codecs import MySQLdb #打开Firefox浏览器 设定等待加载时间 driver = webdriver.Firefox() wait = ui.WebDriverWait(driver,10) #获取每个博主的博客页面低端总页码 def getPage(): print 'getPage' number = 0 texts = driver.find_element_by_xpath("//p[@id='papelist']").text print '页码', texts m = re.findall(r'(\w*[0-9]+)\w*',texts) #正则表达式寻找数字 print '页数:' + str(m[1]) return int(m[1]) #主函数 def main(): #获取txt文件总行数 count = len(open("Blog_URL.txt",'rU').readlines()) print count n = 0 urlfile = open("Blog_URL.txt",'r') #循环获取每个博主的文章摘信息 while n < count: #这里爬取2个人博客信息,正常情况count个博主信息 url = urlfile.readline() url = url.strip("\n") print url driver.get(url) #获取总页码 allPage = getPage() print u'页码总数为:', allPage time.sleep(2) #数据库操作结合 try: conn=MySQLdb.connect(host='localhost',user='root', passwd='123456',port=3306, db='test01') cur=conn.cursor() #数据库游标 #报错:UnicodeEncodeError: 'latin-1' codec can't encode character conn.set_character_set('utf8') cur.execute('SET NAMES utf8;') cur.execute('SET CHARACTER SET utf8;') cur.execute('SET character_set_connection=utf8;') #具体内容处理 m = 1 #第1页 while m <= allPage: ur = url + "/article/list/" + str(m) print ur driver.get(ur) #标题 article_title = driver.find_elements_by_xpath("//p[@class='article_title']") for title in article_title: #print url con = title.text con = con.strip("\n") #print con + '\n' #摘要 article_description = driver.find_elements_by_xpath("//p[@class='article_description']") for description in article_description: con = description.text con = con.strip("\n") #print con + '\n' #信息 article_manage = driver.find_elements_by_xpath("//p[@class='article_manage']") for manage in article_manage: con = manage.text con = con.strip("\n") #print con + '\n' num = 0 print u'长度', len(article_title) while num < len(article_title): #插入数据 8个值 sql = '''insert into csdn_blog (URL,Author,Artitle,Description,Manage,FBTime,YDNum,PLNum) values(%s, %s, %s, %s, %s, %s, %s, %s)''' Artitle = article_title[num].text Description = article_description[num].text Manage = article_manage[num].text print Artitle print Description print Manage #获取作者 Author = url.split('/')[-1] #获取阅读数和评论数 mode = re.compile(r'\d+\.?\d*') YDNum = mode.findall(Manage)[-2] PLNum = mode.findall(Manage)[-1] print YDNum print PLNum #获取发布时间 end = Manage.find(u' 阅读') FBTime = Manage[:end] cur.execute(sql, (url, Author, Artitle, Description, Manage,FBTime,YDNum,PLNum)) num = num + 1 else: print u'数据库插入成功' m = m + 1 #异常处理 except MySQLdb.Error,e: print "Mysql Error %d: %s" % (e.args[0], e.args[1]) finally: cur.close() conn.commit() conn.close() n = n + 1 else: urlfile.close() print 'Load Over' main()
The analysis process is as follows.
1. Get the blogger’s total page number
First read the blogger’s address from Blog_Url.txt, and then access and obtain the total page number. The code is as follows:
#获取每个博主的博客页面低端总页码 def getPage(): print 'getPage' number = 0 texts = driver.find_element_by_xpath("//p[@id='papelist']").text print '页码', texts m = re.findall(r'(\w*[0-9]+)\w*',texts) #正则表达式寻找数字 print '页数:' + str(m[1]) return int(m[1])
The blog page turning here uses URL connection, which is more convenient. For example: http://blog.csdn.net/Eastmount/article/list/2
So you only need: 1. Get the total page number; 2. Crawl the information of each page; 3. Set the URL to loop through Page; 4. Crawl again.
You can also click "Next Page" to jump. If there is no "Next Page", the jump will stop, the crawler will end, and then the next blogger will be crawled.
Then review the elements and analyze each blog page, if using Beaut if
ulSoup crawling will report an error "Forbidden". It is found that each article is composed of a, as shown below, you only need to locate the position.
This can be climbed here to position the location. Here you need to locate the title, Abstract, and time.
代码如下所示。注意,在while中同时获取三个值,它们是对应的。
#标题 article_title = driver.find_elements_by_xpath("//p[@class='article_title']") for title in article_title: con = title.text con = con.strip("\n") print con + '\n' #摘要 article_description = driver.find_elements_by_xpath("//p[@class='article_description']") for description in article_description: con = description.text con = con.strip("\n") print con + '\n' #信息 article_manage = driver.find_elements_by_xpath("//p[@class='article_manage']") for manage in article_manage: con = manage.text con = con.strip("\n") print con + '\n' num = 0 print u'长度', len(article_title) while num < len(article_title): Artitle = article_title[num].text Description = article_description[num].text Manage = article_manage[num].text print Artitle, Description, Manage
4.特殊字符串处理
获取URL最后一个/后的博主名称、获取字符串时间、阅读数代码如下:
#获取博主姓名 url = "http://blog.csdn.net/Eastmount" print url.split('/')[-1] #输出: Eastmount #获取数字 name = "2015-09-08 18:06 阅读(909) 评论(0)" print name import re mode = re.compile(r'\d+\.?\d*') print mode.findall(name) #输出: ['2015', '09', '08', '18', '06', '909', '0'] print mode.findall(name)[-2] #输出: 909 #获取时间 end = name.find(r' 阅读') print name[:end] #输出: 2015-09-08 18:06 import time, datetime a = time.strptime(name[:end],'%Y-%m-%d %H:%M') print a #输出: time.struct_time(tm_year=2015, tm_mon=9, tm_mday=8, tm_hour=18, tm_min=6, # tm_sec=0, tm_wday=1, tm_yday=251, tm_isdst=-1)
三. 数据库相关操作
SQL语句创建表代码如下:
CREATE TABLE `csdn` ( `ID` int(11) NOT NULL AUTO_INCREMENT, `URL` varchar(100) COLLATE utf8_bin DEFAULT NULL, `Author` varchar(50) COLLATE utf8_bin DEFAULT NULL COMMENT '作者', `Artitle` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '标题', `Description` varchar(400) COLLATE utf8_bin DEFAULT NULL COMMENT '摘要', `Manage` varchar(100) COLLATE utf8_bin DEFAULT NULL COMMENT '信息', `FBTime` datetime DEFAULT NULL COMMENT '发布日期', `YDNum` int(11) DEFAULT NULL COMMENT '阅读数', `PLNum` int(11) DEFAULT NULL COMMENT '评论数', `DZNum` int(11) DEFAULT NULL COMMENT '点赞数', PRIMARY KEY (`ID`) ) ENGINE=InnoDB AUTO_INCREMENT=9371 DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
显示如下图所示:
其中,Python调用MySQL推荐下面这篇文字。
python专题九.Mysql数据库编程基础知识
核心代码如下所示:
# coding:utf-8 import MySQLdb try: conn=MySQLdb.connect(host='localhost',user='root',passwd='123456',port=3306, db='test01') cur=conn.cursor() #插入数据 sql = '''insert into student values(%s, %s, %s)''' cur.execute(sql, ('yxz','111111', '10')) #查看数据 print u'\n插入数据:' cur.execute('select * from student') for data in cur.fetchall(): print '%s %s %s' % data cur.close() conn.commit() conn.close() except MySQLdb.Error,e: print "Mysql Error %d: %s" % (e.args[0], e.args[1])
注意,在下载过程中,有的网站是新版本的,无法获取页码。
比如:http://blog.csdn.net/michaelzhou224
这时需要简单设置,跳过这些链接,并保存到文件中,核心代码如下所示:
#获取每个博主的博客页面低端总页码 def getPage(): print 'getPage' number = 0 #texts = driver.find_element_by_xpath("//p[@id='papelist']").text texts = driver.find_element_by_xpath("//p[@class='pagelist']").text print 'testsss' print u'页码', texts if texts=="": print u'页码为0 网站错误' return 0 m = re.findall(r'(\w*[0-9]+)\w*',texts) #正则表达式寻找数字 print u'页数:' + str(m[1]) return int(m[1])
主函数修改:
error = codecs.open("Blog_Error.txt", 'a', 'utf-8') #循环获取每个博主的文章摘信息 while n < count: #这里爬取2个人博客信息,正常情况count个博主信息 url = urlfile.readline() url = url.strip("\n") print url driver.get(url+"/article/list/1") #print driver.page_source #获取总页码 allPage = getPage() print u'页码总数为:', allPage #返回错误,否则程序总截住 if allPage==0: error.write(url + "\r\n") print u'错误URL' continue; #跳过进入下一个博主 time.sleep(2) #数据库操作结合 try: .....
The above is the detailed content of Example diagram of using python Selenium to crawl content and store MySQL database. For more information, please follow other related articles on the PHP Chinese website!