Home Backend Development Python Tutorial 零基础写python爬虫之抓取糗事百科代码分享

零基础写python爬虫之抓取糗事百科代码分享

Jun 06, 2016 am 11:20 AM
python reptile

项目内容:

用Python写的糗事百科的网络爬虫。

使用方法:

新建一个Bug.py文件,然后将代码复制到里面后,双击运行。

程序功能:

在命令提示行中浏览糗事百科。

原理解释:

首先,先浏览一下糗事百科的主页:http://www.qiushibaike.com/hot/page/1
可以看出来,链接中page/后面的数字就是对应的页码,记住这一点为以后的编写做准备。
然后,右击查看页面源码:

观察发现,每一个段子都用div标记,其中class必为content,title是发帖时间,我们只需要用正则表达式将其“扣”出来就可以了。
明白了原理之后,剩下的就是正则表达式的内容了,可以参照这篇文章:
http://www.bitsCN.com/article/57150.htm

运行效果:


代码如下:


# -*- coding: utf-8 -*-   
    
import urllib2   
import urllib   
import re   
import thread   
import time     
#----------- 加载处理糗事百科 -----------   
class Spider_Model:   
       
    def __init__(self):   
        self.page = 1   
        self.pages = []   
        self.enable = False   
   
    # 将所有的段子都扣出来,添加到列表中并且返回列表   
    def GetPage(self,page):   
        myUrl = "http://m.qiushibaike.com/hot/page/" + page   
        user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  
        headers = { 'User-Agent' : user_agent }  
        req = urllib2.Request(myUrl, headers = headers)  
        myResponse = urllib2.urlopen(req) 
        myPage = myResponse.read()   
        #encode的作用是将unicode编码转换成其他编码的字符串   
        #decode的作用是将其他编码的字符串转换成unicode编码   
        unicodePage = myPage.decode("utf-8")   
   
        # 找出所有class="content"的div标记   
        #re.S是任意匹配模式,也就是.可以匹配换行符   
        myItems = re.findall('

(.*?)',unicodePage,re.S)   
        items = []   
        for item in myItems:   
            # item 中第一个是div的标题,也就是时间   
            # item 中第二个是div的内容,也就是内容   
            items.append([item[0].replace("\n",""),item[1].replace("\n","")])   
        return items   
   
    # 用于加载新的段子   
    def LoadPage(self):   
        # 如果用户未输入quit则一直运行   
        while self.enable:   
            # 如果pages数组中的内容小于2个   
            if len(self.pages)                 try:   
                    # 获取新的页面中的段子们   
                    myPage = self.GetPage(str(self.page))   
                    self.page += 1   
                    self.pages.append(myPage)   
                except:   
                    print '无法链接糗事百科!'   
            else:   
                time.sleep(1)   
           
    def ShowPage(self,nowPage,page):   
        for items in nowPage:   
            print u'第%d页' % page , items[0]  , items[1]   
            myInput = raw_input()   
            if myInput == "quit":   
                self.enable = False   
                break   
           
    def Start(self):   
        self.enable = True   
        page = self.page   
   
        print u'正在加载中请稍候......'   
           
        # 新建一个线程在后台加载段子并存储   
        thread.start_new_thread(self.LoadPage,())   
           
        #----------- 加载处理糗事百科 -----------   
        while self.enable:   
            # 如果self的page数组中存有元素   
            if self.pages:   
                nowPage = self.pages[0]   
                del self.pages[0]   
                self.ShowPage(nowPage,page)   
                page += 1   
    
#----------- 程序的入口处 -----------   
print u""" 
--------------------------------------- 
   程序:糗百爬虫 
   版本:0.3 
   作者:why 
   日期:2014-06-03 
   语言:Python 2.7 
   操作:输入quit退出阅读糗事百科 
   功能:按下回车依次浏览今日的糗百热点 
--------------------------------------- 
""" 
print u'请按下回车浏览今日的糗百内容:'   
raw_input(' ')   
myModel = Spider_Model()   
myModel.Start()   

Q&A:
1.为什么有段时间显示糗事百科不可用?
答:前段时间因为糗事百科添加了Header的检验,导致无法爬取,需要在代码中模拟Header。现在代码已经作了修改,可以正常使用。

2.为什么需要单独新建个线程?
答:基本流程是这样的:爬虫在后台新起一个线程,一直爬取两页的糗事百科,如果剩余不足两页,则再爬一页。用户按下回车只是从库存中获取最新的内容,而不是上网获取,所以浏览更顺畅。也可以把加载放在主线程,不过这样会导致爬取过程中等待时间过长的问题。

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP and Python: Code Examples and Comparison PHP and Python: Code Examples and Comparison Apr 15, 2025 am 12:07 AM

PHP and Python have their own advantages and disadvantages, and the choice depends on project needs and personal preferences. 1.PHP is suitable for rapid development and maintenance of large-scale web applications. 2. Python dominates the field of data science and machine learning.

Python vs. JavaScript: Community, Libraries, and Resources Python vs. JavaScript: Community, Libraries, and Resources Apr 15, 2025 am 12:16 AM

Python and JavaScript have their own advantages and disadvantages in terms of community, libraries and resources. 1) The Python community is friendly and suitable for beginners, but the front-end development resources are not as rich as JavaScript. 2) Python is powerful in data science and machine learning libraries, while JavaScript is better in front-end development libraries and frameworks. 3) Both have rich learning resources, but Python is suitable for starting with official documents, while JavaScript is better with MDNWebDocs. The choice should be based on project needs and personal interests.

Detailed explanation of docker principle Detailed explanation of docker principle Apr 14, 2025 pm 11:57 PM

Docker uses Linux kernel features to provide an efficient and isolated application running environment. Its working principle is as follows: 1. The mirror is used as a read-only template, which contains everything you need to run the application; 2. The Union File System (UnionFS) stacks multiple file systems, only storing the differences, saving space and speeding up; 3. The daemon manages the mirrors and containers, and the client uses them for interaction; 4. Namespaces and cgroups implement container isolation and resource limitations; 5. Multiple network modes support container interconnection. Only by understanding these core concepts can you better utilize Docker.

How to run programs in terminal vscode How to run programs in terminal vscode Apr 15, 2025 pm 06:42 PM

In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

Can visual studio code be used in python Can visual studio code be used in python Apr 15, 2025 pm 08:18 PM

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

Can vs code run in Windows 8 Can vs code run in Windows 8 Apr 15, 2025 pm 07:24 PM

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

Is the vscode extension malicious? Is the vscode extension malicious? Apr 15, 2025 pm 07:57 PM

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

Python: Automation, Scripting, and Task Management Python: Automation, Scripting, and Task Management Apr 16, 2025 am 12:14 AM

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

See all articles