A web crawler program refers to a program that automatically obtains and parses information on the Internet. It is also one of the important tools for data collection and information processing. In the Internet age, data is an extremely valuable asset. Being able to quickly and accurately obtain information on target websites is very important for both businesses and individuals. Using web crawlers can achieve this goal more efficiently.
As an efficient programming language, PHP’s excellent network programming features and rich open source libraries make it a very suitable language for developing web crawler programs. This article will introduce in detail how to use PHP to develop an efficient web crawler program.
1. Basic principles of crawler programs
The basic working principle of web crawler programs is to obtain the source code of web pages through network protocols, then parse the information according to specific rules, and finally store the required data in a database or other in the file. The general process is as follows:
1. Send a request to the target URL and obtain the web page source code
2. Parse the information in the source code, such as links, text, pictures, etc.
3. Store the required information to the database or other files
4. Repeat the above steps until the crawling task is completed
The core part of the crawler program is the parser, whose task is to parse the obtained web page source code and extract the required information . Web page source code parsing is usually implemented using regular expressions or parsing functions provided by the framework. Regular expressions are more flexible to use, but are complex and error-prone; using the parsing functions provided by the framework is easy to use, but also has limitations.
2. Practical development of web crawler program
This article takes the development of a simple web crawler program as an example to introduce its development process.
Before developing a web crawler program, you first need to clarify the target website to be crawled and the information that needs to be crawled. This article takes crawling Sina News popular recommendations as an example. The requirement is: crawl the popular news recommended titles and links on the Sina News homepage and store them in the database.
In PHP, you can use the curl function library to get the web page source code. The following code demonstrates how to use the curl function library to obtain the web page source code of Sina News homepage.
<?php $url = 'http://news.sina.com.cn/'; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $html = curl_exec($ch); curl_close($ch); echo $html; ?>
The above code uses the curl function library to send a request to the Sina News homepage and obtain its web page source code. The curl_setopt() function sets the returned result as a string after obtaining the page and automatically sets the Referer of the requested web page.
After obtaining the source code of the web page, you need to parse the information in it to extract the required data. In PHP, this can be achieved using regular expressions or parsing functions provided by the framework. The code below demonstrates how to extract news headlines and links using PHP's built-in DOMDocument class.
<?php $url = 'http://news.sina.com.cn/'; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $html = curl_exec($ch); curl_close($ch); // 使用 DOMDocument 类解析 HTML $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $news_list = $xpath->query('//div[@class="blk12"]/h2/a'); foreach ($news_list as $news) { $title = trim($news->nodeValue); $link = $news->getAttribute('href'); echo $title . ' ' . $link . PHP_EOL; } ?>
In the above code, //div[@class="blk12"]/h2/a is an XPath expression, used to select all h2 elements under the div element with the class attribute "blk12" a element. The program uses a foreach loop to traverse all the a elements obtained, and operates the nodeValue and getAttribute() methods of DOMNode to obtain their text and href attribute values.
After obtaining the crawled information, it needs to be stored in the database. This article uses the MySQL database as an example. The code below demonstrates how to store scraped news titles and links into a MySQL database.
<?php // 连接数据库 $host = 'localhost'; $user = 'root'; $password = 'root'; $database = 'test'; $charset = 'utf8mb4'; $dsn = "mysql:host={$host};dbname={$database};charset={$charset}"; $pdo = new PDO($dsn, $user, $password); // 获取新浪新闻主页热门推荐新闻标题和链接 $url = 'http://news.sina.com.cn/'; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $html = curl_exec($ch); curl_close($ch); // 使用 DOMDocument 类解析 HTML $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $news_list = $xpath->query('//div[@class="blk12"]/h2/a'); // 插入数据库 $sql = "INSERT INTO news(title, link) VALUES(:title, :link)"; $stmt = $pdo->prepare($sql); foreach ($news_list as $news) { $title = trim($news->nodeValue); $link = $news->getAttribute('href'); $stmt->bindParam(':title', $title); $stmt->bindParam(':link', $link); $stmt->execute(); } ?>
In the above code, PDO is used to connect to the MySQL database, and a data table named news is defined to store news titles and links. The program uses PDO's prepare() function and bindParam() function to avoid SQL injection attacks and data type errors.
By combining the above codes together, you can get a simple web crawler program. The complete code is as follows:
<?php // 连接数据库 $host = 'localhost'; $user = 'root'; $password = 'root'; $database = 'test'; $charset = 'utf8mb4'; $dsn = "mysql:host={$host};dbname={$database};charset={$charset}"; $pdo = new PDO($dsn, $user, $password); // 获取新浪新闻主页热门推荐新闻标题和链接 $url = 'http://news.sina.com.cn/'; $ch = curl_init(); curl_setopt($ch, CURLOPT_URL, $url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); $html = curl_exec($ch); curl_close($ch); // 使用 DOMDocument 类解析 HTML $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $news_list = $xpath->query('//div[@class="blk12"]/h2/a'); // 插入数据库 $sql = "INSERT INTO news(title, link) VALUES(:title, :link)"; $stmt = $pdo->prepare($sql); foreach ($news_list as $news) { $title = trim($news->nodeValue); $link = $news->getAttribute('href'); $stmt->bindParam(':title', $title); $stmt->bindParam(':link', $link); $stmt->execute(); } ?>
3. Summary
The development of web crawler programs requires the use of multiple technologies, including network programming, information analysis, data storage, etc. As an efficient programming language, PHP has outstanding advantages in network programming, and its rich open source class libraries make it a very suitable language for developing web crawler programs.
In actual development, web crawler programs need to pay attention to issues such as legal compliance, data privacy, and anti-crawler mechanisms. Developers should conduct relevant development under the premise of legal compliance. At the same time, reasonable settings such as program request speed, random HTTP request headers, and use of proxy IP can effectively avoid blocking by the anti-crawler mechanism.
To develop a web crawler program, you need to fully consider its actual needs and feasibility, and choose appropriate technologies and strategies. The example code provided in this article is just a simple implementation. If you need a more complete crawler program, you need to further study the relevant knowledge.
The above is the detailed content of PHP in practice: efficient web crawler program development. For more information, please follow other related articles on the PHP Chinese website!