


Detailed explanation of how to crawl and analyze web pages with PHP_php skills
The example in this article describes the method of crawling and analyzing web pages with PHP. Share it with everyone for your reference, the details are as follows:
Crawling and analyzing a file is very simple. This tutorial will take you step by step through an example to implement it. Let's get started!
First, we must decide which URL addresses we will crawl. This can be set in a script or passed via $QUERY_STRING. For simplicity, let's set the variable directly in the script.
<?php $url = 'http://www.php.net'; ?>
In the second step, we grab the specified file and store it in an array through the file() function.
<?php $url = 'http://www.php.net'; $lines_array = file($url); ?>
Okay, now we have the files in the array. However, the text we want to analyze may not all be in one line. To resolve this file, we can simply convert the array $lines_array into a string. We can use the implode(x,y) function to achieve this. If you want to use explode later (array of string variables), it may be better to set x to "|" or "!" or other similar delimiter. But for our purposes, it's best to set x to a space. y is another necessary parameter because it is the array you want to process with implode().
<?php $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); ?>
Now that the crawling work is done, it’s time to analyze. For the purposes of this example, we want to get everything between
and . In order to parse out the string, we also need something called a regular expression.<?php $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); eregi("<head>(.*)</head>", $lines_string, $head); ?>
Let’s take a look at the code. As you can see, the eregi() function is executed in the following format:
eregi("<head>(.*)</head>", $lines_string, $head);
"(.*)" means everything, which can be interpreted as, "analyze everything between
and ". $lines_string is the string we are analyzing, and $head is the array where the analyzed results are stored.Finally, we can enter the data. Since there is only one instance between
and , we can safely assume that there is only one element in the array, and it is the one we want. Let's print it out.<?php $url = 'http://www.php.net'; $lines_array = file($url); $lines_string = implode('', $lines_array); eregi("<head>(.*)</head>", $lines_string, $head); echo $head[0]; ?>
That’s all the code.
<?php //获取所有内容url保存到文件 function get_index ( $save_file , $prefix = "index_" ){ $count = 68 ; $i = 1 ; if ( file_exists ( $save_file )) @ unlink ( $save_file ); $fp = fopen ( $save_file , "a+" ) or die( "Open " . $save_file . " failed" ); while( $i < $count ){ $url = $prefix . $i . ".htm" ; echo "Get " . $url . "..." ; $url_str = get_content_url ( get_url ( $url )); echo " OK/n" ; fwrite ( $fp , $url_str ); ++ $i ; } fclose ( $fp ); } //获取目标多媒体对象 function get_object ( $url_file , $save_file , $split = "|--:**:--|" ){ if (! file_exists ( $url_file )) die( $url_file . " not exist" ); $file_arr = file ( $url_file ); if (! is_array ( $file_arr ) || empty( $file_arr )) die( $url_file . " not content" ); $url_arr = array_unique ( $file_arr ); if ( file_exists ( $save_file )) @ unlink ( $save_file ); $fp = fopen ( $save_file , "a+" ) or die( "Open save file " . $save_file . " failed" ); foreach( $url_arr as $url ){ if (empty( $url )) continue; echo "Get " . $url . "..." ; $html_str = get_url ( $url ); echo $html_str ; echo $url ; exit; $obj_str = get_content_object ( $html_str ); echo " OK/n" ; fwrite ( $fp , $obj_str ); } fclose ( $fp ); } //遍历目录获取文件内容 function get_dir ( $save_file , $dir ){ $dp = opendir ( $dir ); if ( file_exists ( $save_file )) @ unlink ( $save_file ); $fp = fopen ( $save_file , "a+" ) or die( "Open save file " . $save_file . " failed" ); while(( $file = readdir ( $dp )) != false ){ if ( $file != "." && $file != ".." ){ echo "Read file " . $file . "..." ; $file_content = file_get_contents ( $dir . $file ); $obj_str = get_content_object ( $file_content ); echo " OK/n" ; fwrite ( $fp , $obj_str ); } } fclose ( $fp ); } //获取指定url内容 function get_url ( $url ){ $reg = '/^http:////[^//].+$/' ; if (! preg_match ( $reg , $url )) die( $url . " invalid" ); $fp = fopen ( $url , "r" ) or die( "Open url: " . $url . " failed." ); while( $fc = fread ( $fp , 8192 )){ $content .= $fc ; } fclose ( $fp ); if (empty( $content )){ die( "Get url: " . $url . " content failed." ); } return $content ; } //使用socket获取指定网页 function get_content_by_socket ( $url , $host ){ $fp = fsockopen ( $host , 80 ) or die( "Open " . $url . " failed" ); $header = "GET /" . $url . " HTTP/1.1/r/n" ; $header .= "Accept: */*/r/n" ; $header .= "Accept-Language: zh-cn/r/n" ; $header .= "Accept-Encoding: gzip, deflate/r/n" ; $header .= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; Maxthon; InfoPath.1; .NET CLR 2.0.50727)/r/n" ; $header .= "Host: " . $host . "/r/n" ; $header .= "Connection: Keep-Alive/r/n" ; //$header .= "Cookie: cnzz02=2; rtime=1; ltime=1148456424859; cnzz_eid=56601755-/r/n/r/n"; $header .= "Connection: Close/r/n/r/n" ; fwrite ( $fp , $header ); while (! feof ( $fp )) { $contents .= fgets ( $fp , 8192 ); } fclose ( $fp ); return $contents ; } //获取指定内容里的url function get_content_url ( $host_url , $file_contents ){ //$reg = '/^(#|<a href="http://lib.csdn.net/base/18" class='replace_word' title="JavaScript知识库" target='_blank' style='color:#df3434; font-weight:bold;'>JavaScript</a>.*?|ftp:////.+|http:////.+|.*?href.*?|play.*?|index.*?|.*?asp)+$/i'; //$reg = '/^(down.*?/.html|/d+_/d+/.htm.*?)$/i'; $rex = "/([hH][rR][eE][Ff])/s*=/s*['/"]*([^>'/"/s]+)[/"'>]*/s*/i" ; $reg = '/^(down.*?/.html)$/i' ; preg_match_all ( $rex , $file_contents , $r ); $result = "" ; //array(); foreach( $r as $c ){ if ( is_array ( $c )){ foreach( $c as $d ){ if ( preg_match ( $reg , $d )){ $result .= $host_url . $d . "/n" ; } } } } return $result ; } //获取指定内容中的多媒体文件 function get_content_object ( $str , $split = "|--:**:--|" ){ $regx = "/href/s*=/s*['/"]*([^>'/"/s]+)[/"'>]*/s*(.*?<//b>)/i" ; preg_match_all ( $regx , $str , $result ); if ( count ( $result ) == 3 ){ $result [ 2 ] = str_replace ( "多媒体: " , "" , $result [ 2 ]); $result [ 2 ] = str_replace ( " " , "" , $result [ 2 ]); $result = $result [ 1 ][ 0 ] . $split . $result [ 2 ][ 0 ] . "/n" ; } return $result ; } ?>
Readers who are interested in more PHP-related content can check out the special topics of this site: "Summary of PHP regular expression usage", "Summary of PHP ajax skills and applications", " Summary of PHP operations and operator usage", "Summary of PHP network programming skills", "PHP basic syntax introductory tutorial", "PHP operation office Summary of document skills (including word, excel, access, ppt) ", "Summary of PHP date and time usage ", "PHP object-oriented programming introductory tutorial ", " Summary of php string (string) usage", "Introduction to php mysql database operation tutorial" and "Summary of common php database operation skills"
I hope this article will be helpful to everyone in PHP programming.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics











PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

PHP and Python each have their own advantages, and choose according to project requirements. 1.PHP is suitable for web development, especially for rapid development and maintenance of websites. 2. Python is suitable for data science, machine learning and artificial intelligence, with concise syntax and suitable for beginners.

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.

PHP is a scripting language widely used on the server side, especially suitable for web development. 1.PHP can embed HTML, process HTTP requests and responses, and supports a variety of databases. 2.PHP is used to generate dynamic web content, process form data, access databases, etc., with strong community support and open source resources. 3. PHP is an interpreted language, and the execution process includes lexical analysis, grammatical analysis, compilation and execution. 4.PHP can be combined with MySQL for advanced applications such as user registration systems. 5. When debugging PHP, you can use functions such as error_reporting() and var_dump(). 6. Optimize PHP code to use caching mechanisms, optimize database queries and use built-in functions. 7
