Python makes embarrassing encyclopedia crawler-Python Tutorial-php.cn

Python makes embarrassing encyclopedia crawler

高洛峰

Release： 2017-02-24 16:07:09

Original

1363 people have browsed it

I woke up in the morning with nothing to do, and out of nowhere a joke from the Encyclopedia of Embarrassing Things popped up. Then I thought that since you sent it to me, I would write a crawler to crawl on your website, as a way to practice my skills. , and secondly, it’s a bit of fun.

In fact, I have been exposed to the content of the database in the past two days. The crawled data can be saved in the database for future use. Okay, let’s not talk nonsense. Let’s take a look at the data results crawled by the program.

Python 制作糗事百科爬虫

It is worth mentioning that I want to crawl embarrassing things in the program all at once There are 30 pages of content in the encyclopedia, but a connection error occurred. When I reduced the page number to 20 pages, the program can run normally. I don’t know the reason. If someone who is eager to know can tell me, I will be grateful. All.

The program is very simple, just upload the source code

# coding=utf8

import re
import requests
from lxml import etree
from multiprocessing.dummy import Pool as ThreadPool
import sys

reload(sys)
sys.setdefaultencoding(&#39;utf-8&#39;)


def getnewpage(url, total):
 nowpage = int(re.search(&#39;(\d+)&#39;, url, re.S).group(1))
 urls = []

 for i in range(nowpage, total + 1):
  link = re.sub(&#39;(\d+)&#39;, &#39;%s&#39; % i, url, re.S)
  urls.append(link)

 return urls


def spider(url):
 html = requests.get(url)
 selector = etree.HTML(html.text)

 author = selector.xpath(&#39;//*[@id="content-left"]/p/p[1]/a[2]/@title&#39;)
 content = selector.xpath(&#39;//*[@id="content-left"]/p/p[2]/text()&#39;)
 vote = selector.xpath(&#39;//*[@id="content-left"]/p/p[3]/span/i/text()&#39;)

 length = len(author)
 for i in range(0, length):
  f.writelines(&#39;作者 : &#39; + author[i] + &#39;\n&#39;)
  f.writelines(&#39;内容 ：&#39; + str(content[i]).replace(&#39;\n&#39;,&#39;&#39;) + &#39;\n&#39;)
  f.writelines(&#39;支持 ： &#39; + vote[i] + &#39;\n\n&#39;)


if __name__ == &#39;__main__&#39;:

 f = open(&#39;info.txt&#39;, &#39;a&#39;)
 url = &#39;http://www.qiushibaike.com/text/page/1/&#39;
 urls = getnewpage(url, 20)

 pool = ThreadPool(4)
 pool.map(spider,urls)
 f.close()

Copy after login

If there are any parts you don’t understand, you can refer to my first three articles in sequence .

For more articles related to Python making embarrassing encyclopedia crawlers, please pay attention to the PHP Chinese website!