Home > Backend Development > PHP Tutorial > http://fitness.39.net/food用file_get_contents为什么不能抓取?

http://fitness.39.net/food用file_get_contents为什么不能抓取?

WBOY
Release: 2016-06-06 20:46:30
Original
1302 people have browsed it

直接echo file_get_contents('http://fitness.39.net/food/');
显示:

<code>Warning: file_get_contents(http://fitness.39.net/food/) [function.file-get-contents]: failed to open stream: HTTP request failed!
</code>
Copy after login
Copy after login
Copy after login
Copy after login

怀疑是服务器验证了的浏览器UA,于是在php.ini中设置:

<code>allow_url_fopen =on
user_agent=”Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)”
</code>
Copy after login
Copy after login

重启apache,然后还是成功的失败了,依旧:

<code>Warning: file_get_contents(http://fitness.39.net/food/) [function.file-get-contents]: failed to open stream: HTTP request failed!
</code>
Copy after login
Copy after login
Copy after login
Copy after login

求高手解答

回复内容:

直接echo file_get_contents('http://fitness.39.net/food/');
显示:

<code>Warning: file_get_contents(http://fitness.39.net/food/) [function.file-get-contents]: failed to open stream: HTTP request failed!
</code>
Copy after login
Copy after login
Copy after login
Copy after login

怀疑是服务器验证了的浏览器UA,于是在php.ini中设置:

<code>allow_url_fopen =on
user_agent=”Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)”
</code>
Copy after login
Copy after login

重启apache,然后还是成功的失败了,依旧:

<code>Warning: file_get_contents(http://fitness.39.net/food/) [function.file-get-contents]: failed to open stream: HTTP request failed!
</code>
Copy after login
Copy after login
Copy after login
Copy after login

求高手解答

问题找到了。事先说明,我是用 Node.js 来测试的。

初试

首先我用了下面的代码:

<code class="lang-javascript">var spidex = require("spidex");

spidex.get("http://fitness.39.net/food/", function(html, status, respHeader) {
    console.log(html);
}, "utf8").on("error", function(err) {
    console.log(err.message);
});
</code>
Copy after login

传回来的是访问失败,连接错误。

假设

然后我用 Chrome 来查看我们正常访问时的一些 header 逐个去试。

http://fitness.39.net/food用file_get_contents为什么不能抓取?

<code class="lang-javascript">var spidex = require("spidex");

var headers = {
    "connection"    : "keep-alive"
};

spidex.get("http://fitness.39.net/food/", function(html, status, respHeader) {
    console.log(html);
}, headers, "utf8").on("error", function(err) {
    console.log(err.message);
});
</code>
Copy after login

还是连接错误——直到我添加上了 accept 时:

<code class="lang-javascript">var spidex = require("spidex");

var headers = {
    "connection"    : "keep-alive",
    "accept"        : "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
};

spidex.get("http://fitness.39.net/food/", function(html, status, respHeader) {
    console.log(html);
}, headers, "utf8").on("error", function(err) {
    console.log(err.message);
});
</code>
Copy after login

结果出来了。

结论

目测是服务端做了对 accept 什么的的验证吧,总之在请求头上面添加一个 accept 字段,并且值设置为 text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 即可。

Related labels:
php
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template