Home > Backend Development > Python Tutorial > python3 crawls WeChat articles

python3 crawls WeChat articles

巴扎黑
Release: 2017-07-21 13:46:32
Original
1610 people have browsed it

Prerequisite:

python3.4

windows

Function: Search related WeChat articles through Sogou’s WeChat search interface, and import titles and related links into Excel tables Medium

Note: The xlsxwriter module is required, and the program writing time is 2017/7/11, so as to avoid that the program cannot be used later, which may be due to relevant changes made to the website. The program is relatively simple, excluding more than 40 lines of comments.

Title:

Idea: Open the initial Url --> Get the title and link regularly --> Change the page loop in the second step --> Import the obtained title and link into Excel

The first step of the crawler is to do it manually (gossip)

Enter the URL mentioned above, such as input: "image recognition", search, the URL will become "" marked in red It is an important parameter. When type=1, it is searching for official accounts. Regardless, query='search keywords', the keywords have been encoded, and there is also a hidden parameter page=1

when you jump to the second You can see "" when page +search+'&page='+str(page)

search is the keyword to be searched. Use quote() encoding to insert

1 search = urllib.request.quote(search)
Copy after login

page is used for looping
1 for page in range(1,pagenum+1):
2     url = 'http://weixin.sogou.com/weixin?type=2&query='+search+'&page='+str(page)
Copy after login

The complete url has been obtained. Next, access the url and obtain the data (create opener object , add header())

1 import urllib.request
2     header = ('User-Agent','Mozilla/5.0')
3     opener = urllib.request.build_opener()
4     opener.addheaders = [header]
5     urllib.request.install_opener(opener)
6     data = urllib.request.urlopen(url).read().decode()
Copy after login
Get the page content, use regular expression to obtain relevant data

1 import re
2     finddata = re.compile('<a target="_blank" href="(.*?)".*?uigs="article_title_.*?">(.*?)</a>').findall(data)
3     #finddata = [('',''),('','')]
Copy after login
There is interference in the data obtained through regular expression Item (link: 'amp;') and irrelevant item (title: '<...><....>'), use replace() to solve

1 title = title.replace('<em><!--red_beg-->','')
2 title = title.replace('<!--red_end--></em>','')
Copy after login
1 link = link.replace('amp;','')
Copy after login

Save the processed titles and links in the list
1 title_link.append(link)
2 title_link.append(title)
Copy after login
The titles and links searched in this way are obtained Okay, next import Excel

Create Excel first
1 import xlsxwriter
2 workbook = xlsxwriter.Workbook(search+'.xlsx')
Copy after login
3 worksheet = workbook.add_worksheet('微信')
Copy after login

Import the data in title_link into Excel

1 for i in range(0,len(title_link),2):
2     worksheet.write('A'+str(i+1),title_link[i+1])
3     worksheet.write('C'+str(i+1),title_link[i])
4 workbook.close()
Copy after login
Complete code :

 1 '''
 2 python3.4 + windows
 3 羽凡-2017/7/11-
 4 用于搜索微信文章,保存标题及链接至Excel中
 5 每个页面10秒延迟,防止被限制
 6 import urllib.request,xlsxwriter,re,time
 7 '''
 8 import urllib.request
 9 search = str(input("搜索微信文章:"))
10 pagenum = int(input('搜索页数:'))
11 import xlsxwriter
12 workbook = xlsxwriter.Workbook(search+'.xlsx')
13 search = urllib.request.quote(search)
14 title_link = []
15 for page in range(1,pagenum+1):
16     url = 'http://weixin.sogou.com/weixin?type=2&query='+search+'&page='+str(page)
17     import urllib.request
18     header = ('User-Agent','Mozilla/5.0')
19     opener = urllib.request.build_opener()
20     opener.addheaders = [header]
21     urllib.request.install_opener(opener)
22     data = urllib.request.urlopen(url).read().decode()
23     import re
24     finddata = re.compile('<a target="_blank" href="(.*?)".*?uigs="article_title_.*?">(.*?)</a>').findall(data)
25     #finddata = [('',''),('','')]
26     for i in range(len(finddata)):
27         title = finddata[i][1]
28         title = title.replace('<em><!--red_beg-->','')
29         title = title.replace('<!--red_end--></em>','')
30         try:
31             #标题中可能存在引号
32             title = title.replace('&ldquo;','"')
33             title = title.replace('&rdquo;','"')
34         except:
35             pass
36         link = finddata[i][0]
37         link = link.replace('amp;','')
38         title_link.append(link)
39         title_link.append(title)
40     print('第'+str(page)+'页')
41     import time
42     time.sleep(10)
43 worksheet = workbook.add_worksheet('微信')
44 worksheet.set_column('A:A',70)
45 worksheet.set_column('C:C',100)
46 bold = workbook.add_format({'bold':True})
47 worksheet.write('A1','标题',bold)
48 worksheet.write('C1','链接',bold)
49 for i in range(0,len(title_link),2):
50     worksheet.write('A'+str(i+1),title_link[i+1])
51     worksheet.write('C'+str(i+1),title_link[i])
52 workbook.close()
53 print('导入Excel完毕!')
Copy after login

The above is the detailed content of python3 crawls WeChat articles. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template