With the development of the Internet, information has exploded, and news accounts for the highest proportion of information. In order to get the latest and most valuable news information faster, people usually browse the latest information on news websites. However, the total amount of news a person can read every day is limited, so we need an efficient tool to crawl news information. This article will share the practical experience of crawling Sina News using a crawler program implemented in PHP.
1. Basic knowledge of crawlers
A crawler is an automated program whose function is to simulate a browser making a request, parse the returned page data, extract the required information and save or download it. Common crawler programming languages include Python, Java, JavaScript, etc. This article chooses PHP language to write crawler programs because PHP language is very suitable for web development, and it has many powerful HTTP request functions and DOM parsing libraries, which can easily complete web page crawling and information extraction.
2. Write a crawler program
1. Determine the target website
Before you start writing a crawler program, you need to first determine the target website you want to crawl. This article chooses the Sina news website. First, we need to understand the web page structure and data storage method of the website.
2. Simulate the browser to make a request
To successfully obtain the data of the target website, you need to simulate the browser to make a request to the target website. In PHP, we can use the cURL function library to accomplish this process. For example:
$url = 'http://news.sina.com.cn/'; $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, true); curl_setopt($ch, CURLOPT_ENCODING, ''); curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"); curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); curl_setopt($ch, CURLOPT_MAXREDIRS, 3); curl_setopt($ch, CURLOPT_TIMEOUT, 10); $html = curl_exec($ch); curl_close($ch);
This code uses cURL to issue a GET request, and the request address is the homepage of Sina News. We can see that some parameters are used in the request, such as: CURLOPT_RETURNTRANSFER
is used to tell the cURL function to return the request results instead of outputting them directly to the browser; CURLOPT_USERAGENT
is used to Simulate the identity of the browser so that the target website will not guard against our crawlers; CURLOPT_FOLLOWLOCATION
is used to automatically track redirects so that the complete page source code can be obtained.
3. Parse page data
After successfully obtaining the page source code, we need to parse the data and extract the required information. The parsing process can be divided into two steps: first, convert the HTML code into a DOM object, which reflects the hierarchical relationship of the document structure; then, filter out the required information from the DOM object according to the corresponding syntax rules.
In PHP, we can use the DOMDocument class and SimpleXMLElement class to parse HTML and XML codes. For example, in the following code snippet, we extract the news title, link and summary from the Sina News homepage:
//创建 DOM 对象 $dom = new DOMDocument(); //HTML 代码转 DOM 对象 $dom->loadHTML($html); //获取所有新闻列表 $newsList = $dom->getElementById('syncad_1'); //遍历新闻列表并提取信息 foreach ($newsList->getElementsByTagName('li') as $item) { //提取标题链接 $linkNode = $item->getElementsByTagName('a')->item(0); $link = $linkNode->getAttribute('href'); //提取标题 $titleNode = $linkNode->getElementsByTagName('span')->item(0); $title = $titleNode->nodeValue; //提取摘要 $summaryNode = $item->getElementsByTagName('p')->item(0); $summary = $summaryNode->nodeValue; //保存数据到数组中 $data[] = [ 'title' => $title, 'link' => $link, 'summary' => $summary ]; }
In the above code example, we first use the getElementById
method to obtain All news lists, and then use the getElementsByTagName
method to filter out the li elements, and traverse the list to extract the required information. Among them, we used the getAttribute
method and the nodeValue
method to extract the attribute value and text content.
4. Save data
After successfully extracting the required information, we need to save it to a local file or database for subsequent use. In this article, we use the MySQL database to save data, and use the PDO extension that comes with PHP to connect and operate with the database. The following is a code example for saving data to a MySQL database:
//数据库连接 $dsn = 'mysql:host=127.0.0.1;dbname=news;charset=utf8'; $username = 'root'; $password = '123456'; $options = [ PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION, ]; $pdo = new PDO($dsn, $username, $password, $options); //保存数据到数据库中 $stmt = $pdo->prepare("INSERT INTO news (title, link, summary) VALUES (:title, :link, :summary)"); foreach ($data as $item) { $stmt->bindParam(':title', $item['title']); $stmt->bindParam(':link', $item['link']); $stmt->bindParam(':summary', $item['summary']); $stmt->execute(); }
In the above code, we first create a table named news to save the title, link and summary information of the news. Afterwards, use the PDO function library to implement steps such as connection, preprocessing, parameter binding, and execution of the MySQL database. Here we use the bindParam
method to bind parameters and execute SQL statements.
3. Summary
This article introduces how to use PHP language to write a crawler program to crawl the Sina news website as an example. In this process, the example code includes steps such as crawling the target, issuing a request, parsing the data, and saving the data. In practice, you may also need to consider some website anti-crawling measures, data cleaning, multi-thread crawling and other issues, but these are more advanced crawler technologies and can be studied in depth in future studies.
The above is the detailed content of Practical PHP crawler for crawling Sina News. For more information, please follow other related articles on the PHP Chinese website!