Detailed explanation of data collection in PHP_PHP tutorial-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

Detailed explanation of data collection in PHP_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jul 21, 2016 pm 03:10 PM

php s snoopy 。 introduce good tool data collection yes go deep of able Detailed explanation collection

Here are two good tools that can be used for PHP collection. One is Snoopy and the other is simple_html_dom. There are many ways to collect (in fact, there are only 2-3 in essence, and the others are derivatives). PHP comes with several methods that can also be used to collect directly. But, in the spirit of carrying laziness through to the end. We can still use these two tools to make collection easier.

There are many introductions to Snoopy on the Internet. The following is Snoopy’s SDK translated by others
/////////////////// ////////////////////////////////////////////
Snoopy is a php Class, used to simulate the functions of the browser, can obtain web content and send forms.
Some features of Snoopy:
1 Fetch the content of the web page fetch
2 Fetch the text content of the web page (remove HTML tags) fetchtext
3 Fetch links to web pages, form fetchlinks fetchform
4 supports proxy host
5 supports basic username/password verification
6 supports setting user_agent, referer (source), cookies and header content (header file)
7 supports browser redirection and can control redirection depth
8 can expand links in web pages into high-quality URLs (default)
9 submit data and obtain return values
10 support Tracking HTML framework
11 supports passing cookies when redirecting
PHP 4 or above is required. Since it is a PHP class, it does not need to be expanded. It is the best choice when the server does not support curl.
class method :
fetch($URI)
————–
This is the method used to fetch the content of the web page.
The $URI parameter is the URL address of the crawled web page.
The fetched results are stored in $this->results.
If you are scraping a frame, Snoopy will track each frame and store it in an array, and then store it in $this->results.
fetchtext($URI)
————
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data and only return the text content in the web page. .
fetchform($URI)
————
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the form content in the web page ( form).
fetchlinks($URI)
————-
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the links in the web page ( link).
By default, relative links will be automatically completed and converted into full URLs.
submit($URI,$formvars)
——————-
This method sends a confirmation form to the link address specified by $URL. $formvars is an array that stores form parameters.
submittext($URI,$formvars)
————————–
This method is similar to submit(). The only difference is that this method will remove HTML tags and other irrelevant data. Only the text content in the web page after login is returned.
submitlinks($URI)
————-
This method is similar to submit(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the links in the web page ( link).
By default, relative links will be automatically completed and converted into full URLs.
Class attributes: (Default values are in brackets)
$host The connected host
$port The connected port
$proxy_host The proxy used Host, if any
$proxy_port The proxy host port used, if any
$agent User agent camouflage (Snoopy v0.1)
$referer source information, if any
$cookies cookies, if any
$rawheaders other header information, if any
$maxredirs maximum number of redirects, 0=not allowed (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks Whether to complete all links to complete addresses (true)
$user authentication user name, if any
$pass authentication user name, if any
$accept http accept type (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
$error Where to report the error, if any
$response_code from Response code returned by the server
$headers Header information returned from the server
$maxlength Maximum returned data length
$read_timeout Read operation timeout (requires PHP 4 Beta 4+)
Set to 0 No timeout
$timed_out If a read operation times out, this attribute returns true (requires PHP 4 Beta 4+)
$maxframes The maximum number of frames allowed to be tracked
$status The status of the captured http
$temp_dir The temporary file directory (/tmp) that the web server can write to
$curl_path The directory of the cURL binary. If there is no cURL binary, set it to false
The following is the demo

Copy code The code is as follows:

include "Snoopy.class.php";
$snoopy = new Snoopy;
$snoopy->proxy_host = "www.7767.cn";
$snoopy->proxy_port = "8080";
$snoopy->agent = "(compatible; MSIE 4.01; MSN 2.5; AOL 4.0; Windows 98)";
$snoopy->referer = "http://www.7767.cn/";
$snoopy->cookies["SessionID"] = 238472834723489l;
$snoopy->cookies["favoriteColor"] = "RED";
$snoopy->rawheaders["Pragma"] = "no-cache";
$snoopy->maxredirs = 2;
$snoopy->offsiteok = false;
$snoopy->expandlinks = false;
$snoopy->user = "joe";
$snoopy->pass = "bloe";
if($snoopy->fetchtext("http://www.7767.cn"))
{
echo "

".htmlspecialchars($snoopy->results)."

\n";
}
else
echo "error fetching document: ".$snoopy->error."\n";

//////////////////////////////////////////////////////////////
Snoopy的特点是“大”和“全”，一个fetch什么都采到了，可以作为采集的第一步。接下来就需要用simple_html_dom来细细的把想要的部分，扣出来。当然，如果你特别特别擅长正则，而且又钟爱正则，你也可以用正则去匹配抓取。

simple_html_dom其实是一个dom解析的过程。php内部也提供了一些解析的方法，但是这个simple_html_dom可以说做得比较专业，一个类，满足了很多你想要的功能。
////////////////////////////////////////////////////////////////
// 用一个URL或文件名，创建一个目标文档对象，也就是目标网页
$html = file_get_html ('http://www.7767.cn/' );
//$html = file_get_html ('test.htm' );
//用一个字符串作为一个目标网页。你可以通过Snoopy获取页面，然后再拿到这里来处理
$myhtml = str_get_html ('Hello!' );
// 找到所有的图片，返回的是数组
foreach($html->find ('img' ) as $element)
echo $element->src . '
' ;
// 找到所有的链接
foreach($html->find ('a' ) as $element)
echo $element->href . '
' ;

find方法很好用，通常它返回的是一个包含对象的数组。查找目标元素的时候可以通过class或者id，或者其他属性获取目标字符串。

//通过目标div的class属性，查找div，find方法中第二个参数是返回的那个数组中的第几个。从0开始是第一个
$target_div = $html->find ('div.targetclass',0 );
//查看结果是否是你想要的，直接echo就可以了
echo $target_div;

//比较关键的一点是，这个采集对象创建完后，一定要销毁掉，否则php页面有可能会“卡”上30秒左右，这个取决于你服务器的那个时间限制。销毁的方法是：
$html->clear();
unset($html);
本人认为simple_html_dom比较优秀的地方就是，把采集控制得像JS一样容易。在下面提供的下载包中有英文的手册
simplehtmldom_1_11/simplehtmldom/manual/manual.htm

array $e->getAllAttributes ()	array $e->attr
string $e->getAttribute ( $name )	string $e->attribute
void $e->setAttribute ( $name, $value )	void $value = $e->attribute
bool $e->hasAttribute ( $name )	bool isset($e->attribute )
void $e->removeAttribute ( $name )	void $e->attribute = null
element $e->getElementById ( $id )	mixed $e->find ( "#$id", 0 )
mixed $e->getElementsById ( $id [,$index] )	mixed $e->find ( "#$id" [, int $index] )
element $e->getElementByTagName ($name )	mixed $e->find ( $name, 0 )
mixed $e->getElementsByTagName ( $name [, $index] )	mixed $e->find ( $name [, int $index] )
element $e->parentNode ()	element $e->parent ()
mixed $e->childNodes ( [$index] )	mixed $e->children ( [int $index] )
element $e->firstChild ()	element $e->first_child ()
element $e->lastChild ()	element $e->last_child ()
element $e->nextSibling ()	element $e->next_sibling ()
element $e->previousSibling ()	element $e->prev_sibling ()

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Nordhold: Fusion System, Explained

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Blue Prince: How To Get To The Basement

1 months ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1423

Laravel Tutorial

1318

PHP Tutorial

1269

C# Tutorial

1248

Related knowledge

What kind of software is a digital currency app? Top 10 Apps for Digital Currencies in the World Apr 30, 2025 pm 07:06 PM

With the popularization and development of digital currency, more and more people are beginning to pay attention to and use digital currency apps. These applications provide users with a convenient way to manage and trade digital assets. So, what kind of software is a digital currency app? Let us have an in-depth understanding and take stock of the top ten digital currency apps in the world.

What is the significance of the session_start() function? May 03, 2025 am 12:18 AM

session_start()iscrucialinPHPformanagingusersessions.1)Itinitiatesanewsessionifnoneexists,2)resumesanexistingsession,and3)setsasessioncookieforcontinuityacrossrequests,enablingapplicationslikeuserauthenticationandpersonalizedcontent.

Quantitative Exchange Ranking 2025 Top 10 Recommendations for Digital Currency Quantitative Trading APPs Apr 30, 2025 pm 07:24 PM

The built-in quantization tools on the exchange include: 1. Binance: Provides Binance Futures quantitative module, low handling fees, and supports AI-assisted transactions. 2. OKX (Ouyi): Supports multi-account management and intelligent order routing, and provides institutional-level risk control. The independent quantitative strategy platforms include: 3. 3Commas: drag-and-drop strategy generator, suitable for multi-platform hedging arbitrage. 4. Quadency: Professional-level algorithm strategy library, supporting customized risk thresholds. 5. Pionex: Built-in 16 preset strategy, low transaction fee. Vertical domain tools include: 6. Cryptohopper: cloud-based quantitative platform, supporting 150 technical indicators. 7. Bitsgap:

Is the digital currency app formal? Top 10 formal and legal virtual currency trading apps in the world Apr 30, 2025 pm 07:09 PM

Recommended cryptocurrency trading platforms include: 1. Binance: the world's largest trading volume, supports 1,400 currencies, FCA and MAS certification. 2. OKX: Strong technical strength, supports 400 currencies, approved by the Hong Kong Securities Regulatory Commission. 3. Coinbase: The largest compliance platform in the United States, suitable for beginners, SEC and FinCEN supervision. 4. Kraken: a veteran European brand, ISO 27001 certified, holds a US MSB and UK FCA license. 5. Gate.io: The most complete currency (800), low transaction fees, and obtained a license from multiple countries. 6. Huobi Global: an old platform that provides a variety of services, and holds Japanese FSA and Hong Kong TCSP licenses. 7. KuCoin

Composer: The Package Manager for PHP Developers May 02, 2025 am 12:23 AM

Composer is a dependency management tool for PHP, and manages project dependencies through composer.json file. 1) parse composer.json to obtain dependency information; 2) parse dependencies to form a dependency tree; 3) download and install dependencies from Packagist to the vendor directory; 4) generate composer.lock file to lock the dependency version to ensure team consistency and project maintainability.

How to download the Hong Kong Digital Currency Exchange app? The top ten digital currency exchange apps are included Apr 30, 2025 pm 07:12 PM

The methods to download the Hong Kong Digital Currency Exchange APP include: 1. Select a compliant platform, such as OSL, HashKey or Binance HK, etc.; 2. Download through official channels, iOS users download on the App Store, Android users download through Google Play or official website; 3. Register and verify their identity, use Hong Kong mobile phone number or email address to upload identity and address certificates; 4. Set security measures, enable two-factor authentication and regularly check account activities.

How reliable is Binance Plaza? May 07, 2025 pm 07:18 PM

Binance Square is a social media platform provided by Binance Exchange, aiming to provide users with a space to communicate and share information related to cryptocurrencies. This article will explore the functions, reliability and user experience of Binance Plaza in detail to help you better understand this platform.

phpMyAdmin's Function: Interacting with MySQL (SQL) May 07, 2025 am 12:16 AM

phpMyAdmin simplifies MySQL database management through the web interface. 1) Create databases and tables: Use graphical interface to operate easily. 2) Execute complex queries: such as JOIN query, implemented through SQL editor. 3) Optimization and best practices: including SQL query optimization, index management and data backup.

See all articles