Scrapy and scrapy-splash framework quickly load js pages-JS Tutorial-php.cn

Table of Contents

2. Splash environment construction

3. Scrapy crawler loading js project test, taking google news as an example.

Home

Web Front-end

JS Tutorial

Scrapy and scrapy-splash framework quickly load js pages

小云云

Mar 07, 2018 pm 02:01 PM

javascript scrapy

1. Preface

When we use crawler programs to crawl web pages, crawling static pages is generally relatively simple, and we have written many cases before. But how to crawl pages dynamically loaded using js?

There are several crawling methods for dynamic js pages:

Achieved through selenium+phantomjs.

phantomjs is a headless browser, selenium is an automated testing framework, request the page through the headless browser, wait for js to load, and then obtain the data through automated testing selenium . Because headless browsers consume a lot of resources, they are lacking in performance.

Scrapy-splash framework:

Splash as a js rendering service is lightweight based on Twisted and QT development Browser engine and provides direct http api. The fast and lightweight features make it easy for distributed development.
The splash and scrapy crawler frameworks are integrated. The two are compatible with each other and have better crawling efficiency.

2. Splash environment construction

The Splash service is based on docker containers, so we need to install docker containers first.

2.1 Docker installation (windows 10 home version)

If it is win 10 professional version or other operating systems, it is easier to install. To install docker in windows 10 home version, you need to go through toolbox ( Requires the latest) tools to be installed.

Regarding the installation of docker, refer to the document: Install Docker on WIN10

2.2 splash installation

docker pull scrapinghub/splash

Copy after login

2.3 Start the Splash service

docker run -p 8050:8050 scrapinghub/splash

Copy after login

Scrapy and scrapy-splash framework quickly load js pages

At this time, open your browser and enter 192.168.99.100:8050. You will see an interface like this.

Scrapy and scrapy-splash framework quickly load js pages

You can enter any URL in the red box in the picture above and click Render me! to see what it will look like after rendering

2.4 Install python Scrapy-splash package

pip install scrapy-splash

Copy after login

3. Scrapy crawler loading js project test, taking google news as an example.

Due to business needs, we crawl some foreign news websites, such as Google News. But I found that it was actually js code. So I started to use the scrapy-splash framework and cooperated with Splash's js rendering service to obtain data. See the following code for details:

3.1 settings.py configuration information

# 渲染服务的urlSPLASH_URL = &#39;http://192.168.99.100:8050&#39;# 去重过滤器DUPEFILTER_CLASS = &#39;scrapy_splash.SplashAwareDupeFilter&#39;# 使用Splash的Http缓存HTTPCACHE_STORAGE = &#39;scrapy_splash.SplashAwareFSCacheStorage&#39;SPIDER_MIDDLEWARES = {    &#39;scrapy_splash.SplashDeduplicateArgsMiddleware&#39;: 100,
}#下载器中间件DOWNLOADER_MIDDLEWARES = {    &#39;scrapy_splash.SplashCookiesMiddleware&#39;: 723,    &#39;scrapy_splash.SplashMiddleware&#39;: 725,    &#39;scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware&#39;: 810,
}# 请求头DEFAULT_REQUEST_HEADERS = {    &#39;User-Agent&#39;: &#39;Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.89 Safari/537.36&#39;,    &#39;Accept&#39;: &#39;text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8&#39;,
}# 管道ITEM_PIPELINES = {   &#39;news.pipelines.NewsPipeline&#39;: 300,
}

Copy after login

3.2 items field definition

class NewsItem(scrapy.Item):    # 标题
    title = scrapy.Field()    # 图片的url链接
    Scrapy and scrapy-splash framework quickly load js pages_url = scrapy.Field()    # 新闻来源
    source = scrapy.Field()    # 点击的url
    action_url = scrapy.Field()

Copy after login

3.3 Spider code

In the spider directory, create A new_spider.py file, the file content is as follows:

from scrapy import Spiderfrom scrapy_splash import SplashRequestfrom news.items import NewsItemclass GoolgeNewsSpider(Spider):
    name = "google_news"

    start_urls = ["https://news.google.com/news/headlines?ned=cn&gl=CN&hl=zh-CN"]    def start_requests(self):
        for url in self.start_urls:            # 通过SplashRequest请求等待1秒
            yield SplashRequest(url, self.parse, args={&#39;wait&#39;: 1})    def parse(self, response):
        for element in response.xpath(&#39;//p[@class="qx0yFc"]&#39;):
            actionUrl = element.xpath(&#39;.//a[@class="nuEeue hzdq5d ME7ew"]/@href&#39;).extract_first()
            title = element.xpath(&#39;.//a[@class="nuEeue hzdq5d ME7ew"]/text()&#39;).extract_first()
            source = element.xpath(&#39;.//span[@class="IH8C7b Pc0Wt"]/text()&#39;).extract_first()
            Scrapy and scrapy-splash framework quickly load js pagesUrl = element.xpath(&#39;.//img[@class="lmFAjc"]/@src&#39;).extract_first()

            item = NewsItem()
            item[&#39;title&#39;] = title
            item[&#39;Scrapy and scrapy-splash framework quickly load js pages_url&#39;] = Scrapy and scrapy-splash framework quickly load js pagesUrl
            item[&#39;action_url&#39;] = actionUrl
            item[&#39;source&#39;] = source            yield item

Copy after login

3.4 pipelines.py code

Store the item data in the mysql database.

Create db_news database

CREATE DATABASE db_news

Copy after login

Create tb_news table

CREATE TABLE tb_google_news(
    id INT AUTO_INCREMENT,
    title VARCHAR(50),
    Scrapy and scrapy-splash framework quickly load js pages_url VARCHAR(200),
    action_url VARCHAR(200),
    source VARCHAR(30),    PRIMARY KEY(id)
)ENGINE=INNODB DEFAULT CHARSET=utf8;

Copy after login

NewsPipeline class

class NewsPipeline(object):
    def __init__(self):
        self.conn = pymysql.connect(host=&#39;localhost&#39;, port=3306, user=&#39;root&#39;, passwd=&#39;root&#39;, db=&#39;db_news&#39;,charset=&#39;utf8&#39;)
        self.cursor = self.conn.cursor()    def process_item(self, item, spider):
        sql = &#39;&#39;&#39;insert into tb_google_news (title,Scrapy and scrapy-splash framework quickly load js pages_url,action_url,source) values(%s,%s,%s,%s)&#39;&#39;&#39;
        self.cursor.execute(sql, (item["title"], item["Scrapy and scrapy-splash framework quickly load js pages_url"], item["action_url"], item["source"]))
        self.conn.commit()        return item    def close_spider(self):
        self.cursor.close()
        self.conn.close()

Copy after login

3.5 Execute scrapy crawler

Execute on the console:

scrapy crawl google_news

Copy after login

The following picture is displayed in the database:

Scrapy and scrapy-splash framework quickly load js pages

Related recommendations:

Basic introduction to the scrapy command

Installation Scrapy tutorial

scrapy crawler framework Introduction

The above is the detailed content of Scrapy and scrapy-splash framework quickly load js pages. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

R.E.P.O. Best Graphic Settings

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

2 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

WWE 2K25: How To Unlock Everything In MyRise

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7517

CakePHP Tutorial

1378

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

Related knowledge

How to implement an online speech recognition system using WebSocket and JavaScript Dec 17, 2023 pm 02:54 PM

How to use WebSocket and JavaScript to implement an online speech recognition system Introduction: With the continuous development of technology, speech recognition technology has become an important part of the field of artificial intelligence. The online speech recognition system based on WebSocket and JavaScript has the characteristics of low latency, real-time and cross-platform, and has become a widely used solution. This article will introduce how to use WebSocket and JavaScript to implement an online speech recognition system.

WebSocket and JavaScript: key technologies for implementing real-time monitoring systems Dec 17, 2023 pm 05:30 PM

WebSocket and JavaScript: Key technologies for realizing real-time monitoring systems Introduction: With the rapid development of Internet technology, real-time monitoring systems have been widely used in various fields. One of the key technologies to achieve real-time monitoring is the combination of WebSocket and JavaScript. This article will introduce the application of WebSocket and JavaScript in real-time monitoring systems, give code examples, and explain their implementation principles in detail. 1. WebSocket technology

How to use JavaScript and WebSocket to implement a real-time online ordering system Dec 17, 2023 pm 12:09 PM

Introduction to how to use JavaScript and WebSocket to implement a real-time online ordering system: With the popularity of the Internet and the advancement of technology, more and more restaurants have begun to provide online ordering services. In order to implement a real-time online ordering system, we can use JavaScript and WebSocket technology. WebSocket is a full-duplex communication protocol based on the TCP protocol, which can realize real-time two-way communication between the client and the server. In the real-time online ordering system, when the user selects dishes and places an order

How to implement an online reservation system using WebSocket and JavaScript Dec 17, 2023 am 09:39 AM

How to use WebSocket and JavaScript to implement an online reservation system. In today's digital era, more and more businesses and services need to provide online reservation functions. It is crucial to implement an efficient and real-time online reservation system. This article will introduce how to use WebSocket and JavaScript to implement an online reservation system, and provide specific code examples. 1. What is WebSocket? WebSocket is a full-duplex method on a single TCP connection.

JavaScript and WebSocket: Building an efficient real-time weather forecasting system Dec 17, 2023 pm 05:13 PM

JavaScript and WebSocket: Building an efficient real-time weather forecast system Introduction: Today, the accuracy of weather forecasts is of great significance to daily life and decision-making. As technology develops, we can provide more accurate and reliable weather forecasts by obtaining weather data in real time. In this article, we will learn how to use JavaScript and WebSocket technology to build an efficient real-time weather forecast system. This article will demonstrate the implementation process through specific code examples. We

Simple JavaScript Tutorial: How to Get HTTP Status Code Jan 05, 2024 pm 06:08 PM

JavaScript tutorial: How to get HTTP status code, specific code examples are required. Preface: In web development, data interaction with the server is often involved. When communicating with the server, we often need to obtain the returned HTTP status code to determine whether the operation is successful, and perform corresponding processing based on different status codes. This article will teach you how to use JavaScript to obtain HTTP status codes and provide some practical code examples. Using XMLHttpRequest

How to use insertBefore in javascript Nov 24, 2023 am 11:56 AM

Usage: In JavaScript, the insertBefore() method is used to insert a new node in the DOM tree. This method requires two parameters: the new node to be inserted and the reference node (that is, the node where the new node will be inserted).

How to get HTTP status code in JavaScript the easy way Jan 05, 2024 pm 01:37 PM

Introduction to the method of obtaining HTTP status code in JavaScript: In front-end development, we often need to deal with the interaction with the back-end interface, and HTTP status code is a very important part of it. Understanding and obtaining HTTP status codes helps us better handle the data returned by the interface. This article will introduce how to use JavaScript to obtain HTTP status codes and provide specific code examples. 1. What is HTTP status code? HTTP status code means that when the browser initiates a request to the server, the service

See all articles