php file_get_contents抓取Gzip网页乱码的三种解决方法
用 file_get_contents() 函数抓取网页会发生乱码现象。有两个原因会导致乱码,一个是编码问题,一个是目标页面开了Gzip,下面说的就是开了Gzip功能如何才能不乱码的方法
把抓取到的内容转下编码即可($content=iconv("GBK", "UTF-8//IGNORE", $content);),我们这里讨论的是如何抓取开了Gzip的页面。怎么判断呢?获取的头部当中有Content-Encoding: gzip说明内容是GZIP压缩的。用FireBug看一下就知道页面开了gzip没有。下面是用firebug查看我的博客的头信息,Gzip是开了的。
请求头信息原始头信息
Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Encoding gzip, deflate Accept-Language zh-cn,zh;q=0.8,en-us;q=0.5,en;q=0.3 Connection keep-alive Cookie __utma=225240837.787252530.1317310581.1335406161.1335411401.1537; __utmz=225240837.1326850415.887.3.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=%E4%BB%BB%E4%BD%95%E9%A1%B9%E7%9B%AE%E9%83%BD%E4%B8%8D%E4%BC%9A%E9%82%A3%E4%B9%88%E7%AE%80%E5%8D%95%20site%3Awww.nowamagic.net; PHPSESSID=888mj4425p8s0m7s0frre3ovc7; __utmc=225240837; __utmb=225240837.1.10.1335411401 Host www.nowamagic.net User-Agent Mozilla/5.0 (Windows NT 5.1; rv:12.0) Gecko/20100101 Firefox/12.0
下面介绍一些解决方案:
1. 使用自带的zlib库
如果服务器已经装了zlib库,用下面的代码可以轻易解决乱码问题。
$data = file_get_contents("compress.zlib://".$url);
2. 使用CURL代替file_get_contents
function curl_get($url, $gzip=false){ $curl = curl_init($url); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 10); if($gzip) curl_setopt($curl, CURLOPT_ENCODING, "gzip"); // 关键在这里 $content = curl_exec($curl); curl_close($curl); return $content; }
3. 使用gzip解压函数
function gzdecode($data) { $len = strlen($data); if ($len < 18 || strcmp(substr($data,0,2),"\x1f\x8b")) { return null; // Not GZIP format (See RFC 1952) } $method = ord(substr($data,2,1)); // Compression method $flags = ord(substr($data,3,1)); // Flags if ($flags & 31 != $flags) { // Reserved bits are set -- NOT ALLOWED by RFC 1952 return null; } // NOTE: $mtime may be negative (PHP integer limitations) $mtime = unpack("V", substr($data,4,4)); $mtime = $mtime[1]; $xfl = substr($data,8,1); $os = substr($data,8,1); $headerlen = 10; $extralen = 0; $extra = ""; if ($flags & 4) { // 2-byte length prefixed EXTRA data in header if ($len - $headerlen - 2 < 8) { return false; // Invalid format } $extralen = unpack("v",substr($data,8,2)); $extralen = $extralen[1]; if ($len - $headerlen - 2 - $extralen < 8) { return false; // Invalid format } $extra = substr($data,10,$extralen); $headerlen += 2 + $extralen; } $filenamelen = 0; $filename = ""; if ($flags & 8) { // C-style string file NAME data in header if ($len - $headerlen - 1 < 8) { return false; // Invalid format } $filenamelen = strpos(substr($data,8+$extralen),chr(0)); if ($filenamelen === false || $len - $headerlen - $filenamelen - 1 < 8) { return false; // Invalid format } $filename = substr($data,$headerlen,$filenamelen); $headerlen += $filenamelen + 1; } $commentlen = 0; $comment = ""; if ($flags & 16) { // C-style string COMMENT data in header if ($len - $headerlen - 1 < 8) { return false; // Invalid format } $commentlen = strpos(substr($data,8+$extralen+$filenamelen),chr(0)); if ($commentlen === false || $len - $headerlen - $commentlen - 1 < 8) { return false; // Invalid header format } $comment = substr($data,$headerlen,$commentlen); $headerlen += $commentlen + 1; } $headercrc = ""; if ($flags & 1) { // 2-bytes (lowest order) of CRC32 on header present if ($len - $headerlen - 2 < 8) { return false; // Invalid format } $calccrc = crc32(substr($data,0,$headerlen)) & 0xffff; $headercrc = unpack("v", substr($data,$headerlen,2)); $headercrc = $headercrc[1]; if ($headercrc != $calccrc) { return false; // Bad header CRC } $headerlen += 2; } // GZIP FOOTER - These be negative due to PHP's limitations $datacrc = unpack("V",substr($data,-8,4)); $datacrc = $datacrc[1]; $isize = unpack("V",substr($data,-4)); $isize = $isize[1]; // Perform the decompression: $bodylen = $len-$headerlen-8; if ($bodylen < 1) { // This should never happen - IMPLEMENTATION BUG! return null; } $body = substr($data,$headerlen,$bodylen); $data = ""; if ($bodylen > 0) { switch ($method) { case 8: // Currently the only supported compression method: $data = gzinflate($body); break; default: // Unknown compression method return false; } } else { // I'm not sure if zero-byte body content is allowed. // Allow it for now... Do nothing... } // Verifiy decompressed size and CRC32: // NOTE: This may fail with large data sizes depending on how // PHP's integer limitations affect strlen() since $isize // may be negative for large sizes. if ($isize != strlen($data) || crc32($data) != $datacrc) { // Bad format! Length or CRC doesn't match! return false; } return $data; }
使用:
$html=file_get_contents('http://www.jb51.net/'); $html=gzdecode($html);
就介绍这三个方法,应该能解决大部分gzip引起的抓取乱码问题了。

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



How to use PHP to develop cache to improve the user experience of the website Summary: Caching is one of the important means to improve the user experience in website development. This article will introduce how to use PHP to develop cache to improve the response speed of the website and reduce the server load. Specifically, it includes page caching, data caching and static resource caching, and corresponding code examples are given. Introduction With the rapid development of the Internet, users have higher and higher requirements for websites. A fast and responsive website plays a vital role in improving user experience. The cache is to achieve this

With the rapid development of the Internet, websites are becoming more and more important to businesses and individuals. In order to attract more traffic and improve user experience, website optimization and SEO have become an indispensable part. In this regard, the Pagoda Panel is a very useful tool that can easily carry out website optimization and SEO. The following will introduce in detail how to use the Pagoda Panel for website optimization and SEO. 1. Install the Pagoda panel. If you have not installed the Pagoda panel, you can download it from the Pagoda official website (https://www.bt.cn/)

A Deep Dive into HTTP Status Code 301: Why It Matters in Website Optimization In the world of the internet, website performance and user experience are crucial. As part of website optimization, it is crucial to understand the role of HTTP status codes. One of the most important status codes is 301, also known as a permanent redirect. This article will explore the meaning of HTTP status code 301 and explain why it is crucial in website optimization. HTTP status code is a digital code returned by the server to the client. These codes communicate to the client the

Nginx load balancing algorithm configuration, efficient optimization of website service distribution Overview: In large-scale web applications, in order to increase the fault tolerance and scalability of the system, load balancing is usually used to distribute network requests. As a high-performance reverse proxy server, Nginx has a powerful load balancing function and can distribute requests according to different algorithm strategies. This article will introduce the load balancing algorithm configuration of Nginx and give corresponding code examples. 1. Introduction to load balancing algorithms Nginx provides a variety of load balancing algorithms

JavaScript errors can affect website performance. In order to fix these errors: use web development tools to view the errors. Check the error trace for detailed error information. Check whether variables in your code are initialized or have values. Use static analysis to find syntax and logic problems. Enable error handling to provide friendly error messages. Monitor the website to detect persistent errors.

How to improve website performance and speed through optimization With the rapid development of the Internet, websites have become an important channel for corporate promotion, product sales, and information exchange. However, as user expectations have grown higher, website performance and speed have become important indicators of user experience. A website with good performance and fast loading speeds can improve user satisfaction, increase conversion rates, and improve search engine rankings. Below we will explore in detail how to improve the performance and speed of your website through optimization. Compress and optimize images: Images often take up the majority of web page load time

Detailed explanation of key indicators for optimizing website performance: How to improve your website user experience through indicator analysis? With the rapid development of the Internet, websites have become an important channel for enterprises to display their brand image and provide products and services. However, as users' requirements for online experience continue to increase, the importance of website performance has become increasingly prominent. Optimizing website performance can not only improve user experience, but also increase user stickiness and conversion rate. This article will introduce in detail the key indicators for optimizing website performance and explain how to improve your website user experience through indicator analysis. one

As a very popular programming language, PHP is widely used in website development. However, due to the flaws of PHP itself, it has certain shortcomings in performance. For example, PHP needs to parse and compile each request, which will cause the website to respond slowly and affect the user experience. Therefore, in order to solve this problem, PHP caching technology came into being. PHP caching technology is an optimization for the parsing and compilation process in the PHP interpreter. Its essence is to cache PHP scripts that have been parsed and compiled.
