Home Web Front-end JS Tutorial Implementation code for web scraping using phantomjs_javascript skills

Implementation code for web scraping using phantomjs_javascript skills

May 16, 2016 pm 04:35 PM
phantomjs web scraping

Because phantomjs is a headless browser that can run js, it can also run dom nodes, which is perfect for web crawling.

For example, we want to batch crawl the content of "Today in History" on the web page. Website

Observing the dom structure, we only need to get the title value of .list li a. So we use advanced selectors to build DOM fragments

var d= ''
var c = document.querySelectorAll('.list li a')
var l = c.length;
for(var i =0;i<l;i++){
d=d+c[i].title+'\n'
}
Copy after login

After that, you only need to let the js code run in phantomjs~

var page = require('webpage').create();
	page.open('http://www.todayonhistory.com/', function (status) { //打开页面
		if (status !== 'success') {
			console.log('FAIL to load the address');
		} else {
			console.log(page.evaluate(function () {
					var d= ''
					var c = document.querySelectorAll('.list li a')
					var l = c.length;
					for(var i =0;i<l;i++){
					d=d+c[i].title+'\n'
					}
						return d
				}))

		}
		phantom.exit();
	});
Copy after login

Finally we save it as catch.js, execute it in dos, and output the content to a txt file (you can also use the file api of phantomjs to write)

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Using Selenium and PhantomJS in Scrapy crawler Using Selenium and PhantomJS in Scrapy crawler Jun 22, 2023 pm 06:03 PM

Using Selenium and PhantomJS in Scrapy crawlers Scrapy is an excellent web crawler framework under Python and has been widely used in data collection and processing in various fields. In the implementation of the crawler, sometimes it is necessary to simulate browser operations to obtain the content presented by certain websites. In this case, Selenium and PhantomJS are needed. Selenium simulates human operations on the browser, allowing us to automate web application testing

How to use PhantomJS for interfaceless testing in PHP How to use PhantomJS for interfaceless testing in PHP Jun 27, 2023 am 09:27 AM

In the modern web development environment, interfaceless testing is an indispensable step because it can simulate user operations and verify the correctness of the UI. PhantomJS is a popular tool for automated testing in a headless environment. This article will introduce how to use PhantomJS in PHP for interfaceless testing. 1. Install PhantomJS First, you need to install PhantomJS on the machine. You can download and install it from the official website. The following are the installation steps under Linux: Next

How to use the concurrent function in Go language to crawl multiple web pages in parallel? How to use the concurrent function in Go language to crawl multiple web pages in parallel? Jul 29, 2023 pm 07:13 PM

How to use the concurrent function in Go language to crawl multiple web pages in parallel? In modern web development, it is often necessary to scrape data from multiple web pages. The general approach is to initiate network requests one by one and wait for responses, which is less efficient. The Go language provides powerful concurrency functions that can improve efficiency by crawling multiple web pages in parallel. This article will introduce how to use the concurrent function of Go language to achieve parallel crawling of multiple web pages, as well as some precautions. First, we need to create concurrent tasks using the go keyword built into the Go language. Pass

Web scraping and data extraction techniques in Python Web scraping and data extraction techniques in Python Sep 16, 2023 pm 02:37 PM

Python has become the programming language of choice for a variety of applications, and its versatility extends to the world of web scraping. With its rich ecosystem of libraries and frameworks, Python provides a powerful toolkit for extracting data from websites and unlocking valuable insights. Whether you are a data enthusiast, researcher, or industry professional, web scraping in Python can be a valuable skill for leveraging the vast amounts of information available online. In this tutorial, we will delve into the world of web scraping and explore the various techniques and tools in Python that can be used to extract data from websites. We'll uncover the basics of web scraping, understand the legal and ethical considerations surrounding the practice, and delve into the practical aspects of data extraction. In the next part of this article

How does PHP perform web scraping and data scraping? How does PHP perform web scraping and data scraping? Jun 29, 2023 am 08:42 AM

PHP is a server-side scripting language that is widely used in fields such as website development and data processing. Among them, web crawling and data crawling are one of the important application scenarios of PHP. This article will introduce the basic principles and common methods of how to crawl web pages and data with PHP. 1. The principles of web crawling and data crawling Web crawling and data crawling refer to automatically accessing web pages through programs and obtaining the required information. The basic principle is to obtain the HTML source code of the target web page through the HTTP protocol, and then parse the HTML source code

How to use PhantomJS in Java to implement HTML page screenshot function? How to use PhantomJS in Java to implement HTML page screenshot function? Apr 24, 2023 am 11:37 AM

I. How to generate a background picture in the mini program and share it to Moments? At present, there seems to be no good solution for the front end, so it can only be supported by the back end. So how can it be played? Scenarios that generate pictures are relatively simple and simple, and can be directly supported by jdk. Generally speaking, there is no too complicated logic. I have written a picture synthesis logic before, and used awt to implement it: simple and simple templates for picture synthesis can be directly supported. , but if it is more complicated, it is undoubtedly more disgusting to let the backend support it. I also searched for some open source libraries for rendering HTML on github. I don’t know if it is because of the wrong posture or something, but I don’t have very satisfactory results. Now for complex templates, I have to How to support it? That is the guide for this article, using phantom

Learn how to batch download images from web pages using win10 Learn how to batch download images from web pages using win10 Jan 03, 2024 pm 02:04 PM

When using win10 to download pictures and videos, a single download is very inconvenient for users who need to download pictures in large batches. So how can I batch download pictures from web pages in win10. Let me tell you now. Hope this helps. How to batch download pictures from web pages in win10 1. First, install Thunder on the computer. 2. Turn on the computer and open the built-in Edge browser. Enter the search keywords in the input box, and then Baidu. 3. Click, as shown in the figure below. 4. In the new interface, click the three small dots icon in the upper right corner, and then select. IE is included with the computer itself. No installation is required. 5. In the IE interface that jumps to, right-click the increasingly blank space and select 6. In the Thunder download interface, click on the top

How to use PhantomJs to complete the html image output function in Java How to use PhantomJs to complete the html image output function in Java May 12, 2023 am 08:55 AM

I. How to generate a background picture in the mini program and share it to Moments? At present, there seems to be no good solution for the front end, so it can only be supported by the back end. So how can it be played? Scenarios that generate pictures are relatively simple and simple, and can be directly supported by jdk. Generally speaking, there is no too complicated logic. I have written a picture synthesis logic before, and used awt to implement it: simple and simple templates for picture synthesis can be directly supported. , but if it is more complicated, it is undoubtedly more disgusting to let the backend support it. I also searched for some open source libraries for rendering HTML on github. I don’t know if it is because of the wrong posture or something, but I don’t have very satisfactory results. Now for complex templates, I have to How to support it? That is the guide for this article, using phantom

See all articles