Here are two good tools that can be used for PHP collection. One is Snoopy and the other is simple_html_dom. There are many ways to collect (in fact, there are only 2-3 in essence, and the others are derivatives). PHP comes with several methods that can also be used to collect directly. But, in the spirit of carrying laziness through to the end. We can still use these two tools to make collection easier.
There are many introductions to Snoopy on the Internet. The following is Snoopy’s SDK translated by others
/////////////////// ////////////////////////////////////////////
Snoopy is a php Class, used to simulate the functions of the browser, can obtain web content and send forms.
Some features of Snoopy:
1 Fetch the content of the web page fetch
2 Fetch the text content of the web page (remove HTML tags) fetchtext
3 Fetch links to web pages, form fetchlinks fetchform
4 supports proxy host
5 supports basic username/password verification
6 supports setting user_agent, referer (source), cookies and header content (header file)
7 supports browser redirection and can control redirection depth
8 can expand links in web pages into high-quality URLs (default)
9 submit data and obtain return values
10 support Tracking HTML framework
11 supports passing cookies when redirecting
PHP 4 or above is required. Since it is a PHP class, it does not need to be expanded. It is the best choice when the server does not support curl.
class method :
fetch($URI)
————–
This is the method used to fetch the content of the web page.
The $URI parameter is the URL address of the crawled web page.
The fetched results are stored in $this->results.
If you are scraping a frame, Snoopy will track each frame and store it in an array, and then store it in $this->results.
fetchtext($URI)
————
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data and only return the text content in the web page. .
fetchform($URI)
————
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the form content in the web page ( form).
fetchlinks($URI)
————-
This method is similar to fetch(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the links in the web page ( link).
By default, relative links will be automatically completed and converted into full URLs.
submit($URI,$formvars)
——————-
This method sends a confirmation form to the link address specified by $URL. $formvars is an array that stores form parameters.
submittext($URI,$formvars)
————————–
This method is similar to submit(). The only difference is that this method will remove HTML tags and other irrelevant data. Only the text content in the web page after login is returned.
submitlinks($URI)
————-
This method is similar to submit(). The only difference is that this method will remove HTML tags and other irrelevant data, and only return the links in the web page ( link).
By default, relative links will be automatically completed and converted into full URLs.
Class attributes: (Default values are in brackets)
$host The connected host
$port The connected port
$proxy_host The proxy used Host, if any
$proxy_port The proxy host port used, if any
$agent User agent camouflage (Snoopy v0.1)
$referer source information, if any
$cookies cookies, if any
$rawheaders other header information, if any
$maxredirs maximum number of redirects, 0=not allowed (5)
$offsiteok whether or not to allow redirects off-site. (true)
$expandlinks Whether to complete all links to complete addresses (true)
$user authentication user name, if any
$pass authentication user name, if any
$accept http accept type (image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*)
$error Where to report the error, if any
$response_code from Response code returned by the server
$headers Header information returned from the server
$maxlength Maximum returned data length
$read_timeout Read operation timeout (requires PHP 4 Beta 4+)
Set to 0 No timeout
$timed_out If a read operation times out, this attribute returns true (requires PHP 4 Beta 4+)
$maxframes The maximum number of frames allowed to be tracked
$status The status of the captured http
$temp_dir The temporary file directory (/tmp) that the web server can write to
$curl_path The directory of the cURL binary. If there is no cURL binary, set it to false
The following is the demo
".htmlspecialchars($snoopy->results)."\n";
array $e->getAllAttributes () |
array $e->attr
|
string $e->getAttribute ( $name ) |
string $e->attribute
|
void $e->setAttribute ( $name, $value ) |
void $value = $e->attribute
|
bool $e->hasAttribute ( $name ) |
bool isset($e->attribute ) |
void $e->removeAttribute ( $name ) |
void $e->attribute = null |
element $e->getElementById ( $id ) |
mixed $e->find ( "#$id", 0 ) |
mixed $e->getElementsById ( $id [,$index] ) |
mixed $e->find ( "#$id" [, int $index] ) |
element $e->getElementByTagName ($name ) |
mixed $e->find ( $name, 0 ) |
mixed $e->getElementsByTagName ( $name [, $index] ) |
mixed $e->find ( $name [, int $index] ) |
element $e->parentNode () |
element $e->parent () |
mixed $e->childNodes ( [$index] ) |
mixed $e->children ( [int $index] ) |
element $e->firstChild () |
element $e->first_child () |
element $e->lastChild () |
element $e->last_child () |
element $e->nextSibling () |
element $e->next_sibling () |
element $e->previousSibling () |
element $e->prev_sibling () |