Community

Learn

Tools Library

AI Tools

Leisure

English

Home > Backend Development > Python Tutorial > Python extracts the most popular Q&A content on Zhihu

Python extracts the most popular Q&A content on Zhihu

大家讲道理

Release： 2016-11-09 11:29:25

Original

1163 people have browsed it

#-*- coding: utf-8 -*-
import urllib.request
import re
from _io import open
def yunpan_search():
    url = "https://www.zhihu.com/explore"
    req = urllib.request.Request(url, headers = {
        &#39;Connection&#39;: &#39;Keep-Alive&#39;,
        &#39;Accept&#39;: &#39;text/html, application/xhtml+xml, */*&#39;,
       &#39;Accept-Language&#39;: &#39;en-US,en;q=0.8,zh-Hans-CN;q=0.5,zh-Hans;q=0.3&#39;,
        &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 6.3; WOW64; Trident/7.0; rv:11.0) like Gecko&#39;
})
    opener = urllib.request.urlopen(req)
    html = opener.read()
    html = html.decode(&#39;utf-8&#39;)
    rex = &#39;(?<=<textarea class="content hidden">\n).*?(?=<span class="answer-date-link-wrap">)&#39;
    m = re.findall(rex,html,re.S)
    f = open(&#39;/root/Desktop/zhihu.txt&#39;,&#39;w&#39;)
    for i in m:
        f.write(i)
        f.write(&#39;\n\n&#39;)
    f.close()
    print("抓取成功!")
    file = open(&#39;/root/Desktop/zhihu.txt&#39;,&#39;r+&#39;)
    fullfile = file.readlines()
    text = []
    p = re.compile(r&#39;\w*&#39;, re.L)
    pp = re.compile(r"(&;)*")
    for line in fullfile:
        lines = p.sub(&#39;&#39;,line)
        liness = pp.sub(&#39;&#39;,lines)
        text.append(liness)
    file.seek(0)
    file.truncate(0)
    file.writelines(text)
    file.close()
    print("处理成功！")
 
if __name__==&#39;__main__&#39;:
    yunpan_search()

Copy after login

Related labels：

代码片段，代码分享，PHP代码分享，Java代码分享 Ruby代码分享，Python代码分享，HTML代码分享，CSS代

Previous article：Simulate login packet python implementation Next article：python method to convert text into speech

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Latest Articles by Author

.Net Core distributed mail system

1970-01-01 08:00:00
WeChat third-party login demo

2023-03-07 22:34:01
Events in BOM, DOM and JS

1970-01-01 08:00:00
.net core generates entity classes based on database

1970-01-01 08:00:00
cordova basic commands

1970-01-01 08:00:00
Analyze mysql row record modifications based on binlog

1970-01-01 08:00:00
php simple crawler

2023-03-07 22:32:01
2017 recruitment season: Super summary of PHP interview questions!

1970-01-01 08:00:00
Detailed explanation of the use of python os module

1970-01-01 08:00:00
How is autoreload implemented in Django developer mode?

1970-01-01 08:00:00

Latest Issues

Team collaboration - What should I do if someone needs the feature I wrote as a dependency in git flow?

From 1970-01-01 08:00:00

0

0

0

Objective-c - Constraints for iOS a warning issue

From 1970-01-01 08:00:00

0

0

0

Confusion about using gitlab's fork&pull request mode within the team

From 1970-01-01 08:00:00

0

0

0

Objective-c - In iOS development, Instagram cannot be authorized after logging in. Instagram does not jump back to the application. How to get the callback address?

From 1970-01-01 08:00:00

0

0

0

Version Control - About the use of SVN and GIT in company projects?

From 1970-01-01 08:00:00

0

0

0

Related Topics

More>

Popular Recommendations

Popular Tutorials

More>

Related Tutorials

Popular Recommendations

Latest courses

Latest Downloads

More>

Web Effects

Website Source Code

Website Materials

Front End Template