A web crawler is an automated data collection tool that can automatically capture data on the network by simulating user behavior and store or analyze it. As a widely used web development language, PHP also has a wealth of web crawler development tools and technologies.
This article will introduce how to use PHP's fsockopen function to implement HTTP requests, thereby building a simple web crawler system. The fsockopen function is a PHP function related to Socket communication and can be used to establish a network connection based on the TCP/IP protocol. When using fsockopen to make an HTTP request, you need to follow the HTTP protocol specifications and send the correct request header information and request body data to obtain the response content of the target page. Below we will show this process step by step.
When using the fsockopen function to establish a network connection, you need to specify the host name and port number of the target server, and you can choose to use the HTTP or HTTPS protocol. The following is a simple network connection example:
$hostname = 'example.com'; // 目标服务器主机名 $port = 80; // 目标服务器端口号 $protocol = 'tcp'; // 使用 TCP/IP 协议 $handle = fsockopen($protocol . '://' . $hostname, $port, $errno, $errstr); if (!$handle) { echo '网络连接错误'; }
In this example, we specify the host name of the target server as example.com, using the TCP/IP protocol, and the port number is 80. If the connection is successful, a socket handle $handle will be returned; otherwise, a network connection error message will be output.
After establishing a network connection, we need to send the correct HTTP request header information and request body data in accordance with the HTTP protocol. Specifically, we need to define the request method, request path, request header information and request body data, and splice them into a string that conforms to the HTTP protocol according to the specification. The following is an example of sending an HTTP GET request:
$path = '/'; // 请求路径 $method = 'GET'; // 请求方法 // 组装请求头信息 $headers = array( 'Host: ' . $hostname, 'Connection: close', 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)', ); // 组装请求体数据 $body = ''; // 拼接 HTTP 请求 $request = $method . ' ' . $path . " HTTP/1.1 "; $request .= implode(" ", $headers) . " "; $request .= " "; $request .= $body; // 发送请求 fwrite($handle, $request);
In this example, we define the request path as the root directory / and the request method as GET. Then, we define the request header information, which includes Host, Connection, and User-Agent. For convenience, we use a simple User-Agent here. In actual development, you may need to use a more random and complex UA to avoid being blocked by the server. Next, we defined the request body data to be empty. Finally, we concatenate the HTTP request and send it to the target server via the fwrite function.
When the target server receives the HTTP request, it will return an HTTP response. This response includes response header information and response body data. We need to use PHP's fread function to read the response content from the socket handle and parse the response header and response body data. Here is an example:
// 接收响应 $response = ''; while (!feof($handle)) { $response .= fgets($handle); } // 关闭连接 fclose($handle); // 解析响应 list($header, $body) = explode(" ", $response, 2); $headers = explode(" ", $header); $status = array_shift($headers); list($version, $code, $reason) = explode(' ', $status, 3);
In this example, we use a loop to read the response content line by line and store it in the $response variable. We then closed the network connection to the target server. Next, we use the explode function to parse out the response header and response body, and obtain the status code and response description from the response header. In actual development, we may also need to parse other response header information, such as Content-Type, Set-Cookie, etc.
So far, we have implemented a relatively simple HTTP request sending and response parsing process. You can further improve and adjust the functions and performance of the web crawler system according to your own needs, such as using a proxy server, adding random delays, etc. At the same time, we should also abide by the norms and ethics of web crawlers, not abuse crawler tools, and not infringe on the legitimate rights and interests of the website and user privacy.
The above is the detailed content of PHP web crawler uses fsockopen to implement HTTP requests. For more information, please follow other related articles on the PHP Chinese website!