Home Backend Development Python Tutorial Python uses Phantomjs to crawl the web page after rendering JS

Python uses Phantomjs to crawl the web page after rendering JS

Nov 07, 2016 am 09:59 AM

I recently needed to crawl a website, but unfortunately the pages were all generated after JS rendering. Ordinary crawler frameworks couldn't handle it, so I thought of using Phantomjs to build a proxy.
There seems to be no ready-made third-party library for Python to call Phantomjs (if there is, please inform Xiao2). After taking a walk, I found that only pyspider provides a ready-made solution.
After a brief trial, I feel that pyspider is more like a crawler tool built for novices, like an old lady, sometimes meticulous, sometimes chatty.
Lightweight gadgets should be more popular. With a little selfishness, I can use my favorite BeautifulSoup together without having to learn PyQuery (pyspider is used to parse HTML), let alone endure the browser writing Python Bad experience (laughing).
So I spent an afternoon taking out the part of pyspider that implements the Phantomjs agent and turning it into a small crawler module. I hope everyone will like it (thanks binux!).

Preparation

Of course you need Phantomjs, nonsense! (It is best to use supervisord to guard under Linux. Phantomjs must be kept open when crawling)

Start with phantomjs_fetcher.js in the project path: phantomjs phantomjs_fetcher.js [port]

Install tornado dependencies (using tornado httpclient module)

The call is super simple

from tornado_fetcher import Fetcher
# 创建一个爬虫
>>> fetcher=Fetcher(
    user_agent='phantomjs', # 模拟浏览器的User-Agent
    phantomjs_proxy='http://localhost:12306', # phantomjs的地址
    poolsize=10, # 最大的httpclient数量
    async=False # 同步还是异步
    )
# 开始连接Phantomjs的代理,可以渲染JS!
>>> fetcher.phantomjs_fetch(url)
# 渲染成功后执行额外的JS脚本(注意用function包起来!)
>>> fetcher.phantomjs_fetch(url, js_script='function(){setTimeout("window.scrollTo(0,100000)}", 1000)')
Copy after login


Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

What is the function of C language sum? What is the function of C language sum? Apr 03, 2025 pm 02:21 PM

There is no built-in sum function in C language, so it needs to be written by yourself. Sum can be achieved by traversing the array and accumulating elements: Loop version: Sum is calculated using for loop and array length. Pointer version: Use pointers to point to array elements, and efficient summing is achieved through self-increment pointers. Dynamically allocate array version: Dynamically allocate arrays and manage memory yourself, ensuring that allocated memory is freed to prevent memory leaks.

Who gets paid more Python or JavaScript? Who gets paid more Python or JavaScript? Apr 04, 2025 am 12:09 AM

There is no absolute salary for Python and JavaScript developers, depending on skills and industry needs. 1. Python may be paid more in data science and machine learning. 2. JavaScript has great demand in front-end and full-stack development, and its salary is also considerable. 3. Influencing factors include experience, geographical location, company size and specific skills.

Is distinctIdistinguish related? Is distinctIdistinguish related? Apr 03, 2025 pm 10:30 PM

Although distinct and distinct are related to distinction, they are used differently: distinct (adjective) describes the uniqueness of things themselves and is used to emphasize differences between things; distinct (verb) represents the distinction behavior or ability, and is used to describe the discrimination process. In programming, distinct is often used to represent the uniqueness of elements in a collection, such as deduplication operations; distinct is reflected in the design of algorithms or functions, such as distinguishing odd and even numbers. When optimizing, the distinct operation should select the appropriate algorithm and data structure, while the distinct operation should optimize the distinction between logical efficiency and pay attention to writing clear and readable code.

Does H5 page production require continuous maintenance? Does H5 page production require continuous maintenance? Apr 05, 2025 pm 11:27 PM

The H5 page needs to be maintained continuously, because of factors such as code vulnerabilities, browser compatibility, performance optimization, security updates and user experience improvements. Effective maintenance methods include establishing a complete testing system, using version control tools, regularly monitoring page performance, collecting user feedback and formulating maintenance plans.

How to understand !x in C? How to understand !x in C? Apr 03, 2025 pm 02:33 PM

!x Understanding !x is a logical non-operator in C language. It booleans the value of x, that is, true changes to false, false changes to true. But be aware that truth and falsehood in C are represented by numerical values ​​rather than boolean types, non-zero is regarded as true, and only 0 is regarded as false. Therefore, !x deals with negative numbers the same as positive numbers and is considered true.

What does sum mean in C language? What does sum mean in C language? Apr 03, 2025 pm 02:36 PM

There is no built-in sum function in C for sum, but it can be implemented by: using a loop to accumulate elements one by one; using a pointer to access and accumulate elements one by one; for large data volumes, consider parallel calculations.

How to obtain real-time application and viewer data on the 58.com work page? How to obtain real-time application and viewer data on the 58.com work page? Apr 05, 2025 am 08:06 AM

How to obtain dynamic data of 58.com work page while crawling? When crawling a work page of 58.com using crawler tools, you may encounter this...

Copy and paste Love code Copy and paste Love code for free Copy and paste Love code Copy and paste Love code for free Apr 04, 2025 am 06:48 AM

Copying and pasting the code is not impossible, but it should be treated with caution. Dependencies such as environment, libraries, versions, etc. in the code may not match the current project, resulting in errors or unpredictable results. Be sure to ensure the context is consistent, including file paths, dependent libraries, and Python versions. Additionally, when copying and pasting the code for a specific library, you may need to install the library and its dependencies. Common errors include path errors, version conflicts, and inconsistent code styles. Performance optimization needs to be redesigned or refactored according to the original purpose and constraints of the code. It is crucial to understand and debug copied code, and do not copy and paste blindly.

See all articles