For the HTTP chunked data returned by the web server, we may want to get a callback when each chunk returns, instead of Callback after all responses are returned. For example, when the server is icomet.
The code for using curl in PHP is as follows:
<?php $url = "http://127.0.0.1:8100/stream"; $ch = curl_init($url); curl_setopt($ch, CURLOPT_WRITEFUNCTION, 'myfunc'); $result = curl_exec($ch); curl_close($ch); function myfunc($ch, $data){ $bytes = strlen($data); // 处理 data return $bytes; }
However, there is a problem here. For a chunk, the callback function may be called multiple times, each time about 16k of data. This is obviously not what we want. Because a chunk of icomet ends with "n" , so the callback function can do some buffering.
function myfunc($ch, $data){ $bytes = strlen($data); static $buf = ''; $buf .= $data; while(1){ $pos = strpos($buf, "\n"); if($pos === false){ break; } $data = substr($buf, 0, $pos+1); $buf = substr($buf, $pos+1); // 处理 data } }
Let me introduce chunked php to use fsockopen to read segmented data (transfer-encoding: chunked)
I encountered a magical problem when using fsockopen to read data. The specific situation is as follows:
Reading address: http://blog.maxthon.cn/?feed=rss2
Read code:
<?php $fp = fsockopen("blog.maxthon.cn", 80, $errno, $errstr, 30); if (!$fp) { echo "$errstr ($errno)<br />\n"; } else { $out = "GET /?feed=rss2 HTTP/1.1\r\n"; $out .= "Host: blog.maxthon.cn\r\n"; $out .= "Connection: Close\r\n\r\n"; fwrite($fp, $out); while (!feof($fp)) { echo fgets($fp, 128); } fclose($fp); } ?>
Return http content:
Date: Mon, 29 Mar 2010 10:16:13 GMT Server: Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8b PHP/5.2.6 X-Powered-By: PHP/5.2.6 X-Pingback: http://blog.maxthon.cn/xmlrpc.php Last-Modified: Wed, 03 Mar 2010 03:13:41 GMT ETag: "8f16b619f32188bde3bc008a60c2cc11" Keep-Alive: timeout=15, max=120 Connection: Keep-Alive Transfer-Encoding: chunked Content-Type: text/xml; charset=UTF-8 22de <?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" <description><![CDATA[2009年12月31日 1711 ....... 1fe8 ]]></description> <content:encoded><![CDATA[<p>2009年12月31日<br /> 1711</p>
Please pay attention to the four characters marked in red above. They will appear every other piece of data, but data retrieved using other methods such as curl, file_get_contents, etc. do not have these things. If I switch to other websites to crawl, this will happen to only a few websites. After multiple searches without a solution, I accidentally saw such a statement in the return header above: Transfer-Encoding: chunked, and the common Content The -lenght field is gone. The general meaning of this statement is that the transfer encoding is segmented.
Search this keyword on Google and find the explanation of this statement on Wikipedia (since there is no Chinese version, I can only translate it myself):
Chunked Transfer Encoding is a mechanism that allows HTTP messages to be split in several parts. This can be applied to both HTTP requests (from client to server) and HTTP responses (from server to client)
Chunked transfer encoding is a mechanism that allows HTTP messages to be transmitted in several parts. Applies to both HTTP requests (from client to server) and HTTP responses (from server to client)
For example, let us consider the way in which an HTTP server may transmit data to a client application (usually a web browser). Normally, data delivered in HTTP responses is sent in one piece, whose length is indicated by the Content -Length header field. The length of the data is important, because the client needs to know where the response ends and any following response starts. With chunked encoding, however, the data is broken up into a series of blocks of data and transmitted in one or more "chunks" so that a server may start sending data before it knows the final size of the content that it's sending. Often, the size of these blocks is the same, but this is not always the case.
For example, let's consider the ways in which an HTTP server can transmit data to a client application (usually a web browser). Normally, HTTP response data is sent to the client in one block, and the length of the data is represented by the Content-Length header field. The length of the data is important because the client needs to know where the response ends and when subsequent responses start. With Chunked encoding, the data is divided into a series of data chunks and one or more forwarded "chunks" anyway, so the server can start sending data before it knows the length of the content. Usually, the sizes of these data blocks are the same, but this is not absolute.
After understanding the general meaning, let’s look at an example:
Chunked encoding is formed by concatenating several Chunks and ends with a chunk indicating a length of 0. Each Chunk is divided into two parts: the header and the text. The header content specifies the total number of characters (hexadecimal numbers) and the quantity unit (generally not written) of the next paragraph of text. The text part is the actual content of the specified length. The two parts Separate them with carriage return and line feed (CRLF). The content in the last Chunk of length 0 is called footer, which is some additional Header information (usually can be ignored directly). The specific Chunk encoding format is as follows:
Encoded response content:
HTTP/1.1 200 OK
Content-Type: text/plain
Transfer-Encoding: chunked
25
This is the first piece of data
1A
Then this is the second piece of data
0
Decoded data:
This is the first piece of content, and then this is the second piece of data
The situation is clear, so how do we decode this encoded data?
In the comments under the fsockopen function in the official PHP manual, many people have already proposed solutions
Method 1.
<?php function unchunk($result) { return preg_replace_callback( '/(?:(?:\r\n|\n)|^)([0-9A-F]+)(?:\r\n|\n){1,2}(.*?)'. '((?:\r\n|\n)(?:[0-9A-F]+(?:\r\n|\n))|$)/si', create_function( '$matches', 'return hexdec($matches[1]) == strlen($matches[2]) ? $matches[2] : $matches[0];' ), $result ); }
Method 2.
function unchunkHttp11($data) { $fp = 0; $outData = ""; while ($fp < strlen($data)) { $rawnum = substr($data, $fp, strpos(substr($data, $fp), "\r\n") + 2); $num = hexdec(trim($rawnum)); $fp += strlen($rawnum); $chunk = substr($data, $fp, $num); $outData .= $chunk; $fp += strlen($chunk); } return $outData; }
Note: The parameters of these two functions are the returned http raw data (including headers)