Home Backend Development Python Tutorial python使用rabbitmq实现网络爬虫示例

python使用rabbitmq实现网络爬虫示例

Jun 06, 2016 am 11:29 AM
rabbitmq Web Crawler

编写tasks.py

代码如下:


from celery import Celery
from tornado.httpclient import HTTPClient
app = Celery('tasks')
app.config_from_object('celeryconfig')
@app.task
def get_html(url):
    http_client = HTTPClient()
    try:
        response = http_client.fetch(url,follow_redirects=True)
        return response.body
    except httpclient.HTTPError as e:
        return None
    http_client.close()

编写celeryconfig.py

代码如下:


CELERY_IMPORTS = ('tasks',)
BROKER_URL = 'amqp://guest@localhost:5672//'
CELERY_RESULT_BACKEND = 'amqp://'

编写spider.py

代码如下:


from tasks import get_html
from queue import Queue
from bs4 import BeautifulSoup
from urllib.parse import urlparse,urljoin
import threading
class spider(object):
    def __init__(self):
        self.visited={}
        self.queue=Queue()
    def process_html(self, html):
        pass
        #print(html)
    def _add_links_to_queue(self,url_base,html):
        soup = BeautifulSoup(html)
        links=soup.find_all('a')
        for link in links:
            try:
                url=link['href']
            except:
                pass
            else:
                url_com=urlparse(url)
                if not url_com.netloc:
                    self.queue.put(urljoin(url_base,url))
                else:
                    self.queue.put(url_com.geturl())
    def start(self,url):
        self.queue.put(url)
        for i in range(20):
            t = threading.Thread(target=self._worker)
            t.daemon = True
            t.start()
        self.queue.join()
    def _worker(self):
        while 1:
            url=self.queue.get()
            if url in self.visited:
                continue
            else:
                result=get_html.delay(url)
                try:
                    html=result.get(timeout=5)
                except Exception as e:
                    print(url)
                    print(e)
                self.process_html(html)
                self._add_links_to_queue(url,html)

                self.visited[url]=True
                self.queue.task_done()
s=spider()
s.start("http://www.bitsCN.com/")

由于html中某些特殊情况的存在,程序还有待完善。

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to use RabbitMQ to implement distributed message processing in PHP How to use RabbitMQ to implement distributed message processing in PHP Jul 18, 2023 am 11:00 AM

How to use RabbitMQ to implement distributed message processing in PHP Introduction: In large-scale application development, distributed systems have become a common requirement. Distributed message processing is a pattern that improves the efficiency and reliability of the system by distributing tasks to multiple processing nodes. RabbitMQ is an open source, reliable message queuing system that uses the AMQP protocol to implement message delivery and processing. In this article we will cover how to use RabbitMQ in PHP for distribution

How to build a reliable messaging app with React and RabbitMQ How to build a reliable messaging app with React and RabbitMQ Sep 28, 2023 pm 08:24 PM

How to build a reliable messaging application with React and RabbitMQ Introduction: Modern applications need to support reliable messaging to achieve features such as real-time updates and data synchronization. React is a popular JavaScript library for building user interfaces, while RabbitMQ is a reliable messaging middleware. This article will introduce how to combine React and RabbitMQ to build a reliable messaging application, and provide specific code examples. RabbitMQ overview:

How to build a powerful web crawler application using React and Python How to build a powerful web crawler application using React and Python Sep 26, 2023 pm 01:04 PM

How to build a powerful web crawler application using React and Python Introduction: A web crawler is an automated program used to crawl web page data through the Internet. With the continuous development of the Internet and the explosive growth of data, web crawlers are becoming more and more popular. This article will introduce how to use React and Python, two popular technologies, to build a powerful web crawler application. We will explore the advantages of React as a front-end framework and Python as a crawler engine, and provide specific code examples. 1. For

Solution for real-time data synchronization between Golang and RabbitMQ Solution for real-time data synchronization between Golang and RabbitMQ Sep 27, 2023 pm 10:41 PM

Introduction to the solution for real-time data synchronization between Golang and RabbitMQ: In today's era, with the popularity of the Internet and the explosive growth of data volume, real-time data synchronization has become more and more important. In order to solve the problems of asynchronous data transmission and data synchronization, many companies have begun to use message queues to achieve real-time synchronization of data. This article will introduce a real-time data synchronization solution based on Golang and RabbitMQ, and provide specific code examples. 1. What is RabbitMQ? Rabbi

Golang RabbitMQ: Architectural design and implementation of a highly available message queue system Golang RabbitMQ: Architectural design and implementation of a highly available message queue system Sep 28, 2023 am 08:18 AM

GolangRabbitMQ: The architectural design and implementation of a highly available message queue system requires specific code examples. Introduction: With the continuous development of Internet technology and its wide application, message queues have become an indispensable part of modern software systems. As a tool to implement decoupling, asynchronous communication, fault-tolerant processing and other functions, message queue provides high availability and scalability support for distributed systems. As an efficient and concise programming language, Golang is widely used to build high-concurrency and high-performance systems.

Develop efficient web crawlers and data scraping tools using Vue.js and Perl languages Develop efficient web crawlers and data scraping tools using Vue.js and Perl languages Jul 31, 2023 pm 06:43 PM

Use Vue.js and Perl languages ​​to develop efficient web crawlers and data scraping tools. In recent years, with the rapid development of the Internet and the increasing importance of data, the demand for web crawlers and data scraping tools has also increased. In this context, it is a good choice to combine Vue.js and Perl language to develop efficient web crawlers and data scraping tools. This article will introduce how to develop such a tool using Vue.js and Perl language, and attach corresponding code examples. 1. Introduction to Vue.js and Perl language

PHP study notes: web crawlers and data collection PHP study notes: web crawlers and data collection Oct 08, 2023 pm 12:04 PM

PHP study notes: Web crawler and data collection Introduction: A web crawler is a tool that automatically crawls data from the Internet. It can simulate human behavior, browse web pages and collect the required data. As a popular server-side scripting language, PHP also plays an important role in the field of web crawlers and data collection. This article will explain how to write a web crawler using PHP and provide practical code examples. 1. Basic principles of web crawlers The basic principles of web crawlers are to send HTTP requests, receive and parse the H response of the server.

What are the commonly used technologies for web crawlers? What are the commonly used technologies for web crawlers? Nov 10, 2023 pm 05:44 PM

Commonly used technologies for web crawlers include focused crawler technology, crawling strategies based on link evaluation, crawling strategies based on content evaluation, focused crawling technology, etc. Detailed introduction: 1. Focused crawler technology is a themed web crawler that adds link evaluation and content evaluation modules. The key point of its crawling strategy is to evaluate the page content and the importance of links; 2. Use Web pages as semi-structured documents, which have A lot of structural information can be used to evaluate link importance; 3. Crawling strategies based on content evaluation, etc.

See all articles