对知乎内容使用爬虫爬取数据,为什么会遇到403问题?

WBOY
Freigeben: 2016-08-17 10:01:25
Original
4229 Leute haben es durchsucht

我想抓取知乎上用户的关注信息,如查看A关注了哪些人,通过www.zhihu.com/people/XXX/followees这个页面来获得followee的列表,但是在抓取中遇到了403问题。
1.爬虫仅仅是为了搜集用户关注信息,用于学术研究,绝非商业或其他目的
2.使用PHP,利用curl构造请求,使用simple_html_dom来解析文档
3.在用户的关注者(Followees)列表,应该是使用Ajax进行动态加载更多的followees,于是我想直接爬接口的数据,通过firebug查看到,加载更多的关注者似乎是通过zhihu.com/node/ProfileF 进行的,并且post的数据有_xsrf,method,parmas,于是我在模拟保持登录的情况下,对这个链接提交请求,并带有post过去的所需要的参数,但是返回的是403。
4.但是我同样模拟登录的情况下,可以解析到如赞同数、感谢数这些不需要Ajax的数据
5.我使用curl_setopt($ch, CURLOPT_HTTPHEADER, $header );来设置请求头,使其与我在浏览器中提交的请求的请求头一致,但是这样任然导致403错误
6.我尝试打印出curl的请求头与浏览器发出的请求头进行比较,但是没有找到正确的方式(百度出的curl_getinfo()似乎打印出的相应报文)
7.有许多人曾因为没有设置User-Agent或者X-Requested-With遭遇403,但是我在5中描述设置请求头时都设置了
8.如果叙述不详需要贴出代码,我可以贴出代码
9.这个爬虫是我毕设的一部分,需要获取数据来进行接下来的工作,如1所说,爬取数据纯粹是为了学术研究

回复内容:

如果带有防火墙功能的服务器,连续抓取可能被干掉,除非你有很多代理服务器。或者最简单用adsl不断重新拨号更换ip 你先找个浏览器,研究一下request的HTTP Header再来抓 这两天刚好做了一个抓取用户的关注着和追随者的的爬虫在抓数据,使用的是Python。这里给你一段python的代码,你可以对着代码看一下你的代码问题。
403应该就是请求的时候一些数据发错了,下面的代码中涉及到一个打开的文本,文本中的内容是用户的id,文本里面的内容样式我截了图放在最后面。
<code class="language-python3"><span class="c">#encoding=utf8</span>
<span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="k">import</span> <span class="n">BeautifulSoup</span>

<span class="n">Default_Header</span> <span class="o">=</span> <span class="p">{</span><span class="s">'X-Requested-With'</span><span class="p">:</span> <span class="s">'XMLHttpRequest'</span><span class="p">,</span>
                  <span class="s">'Referer'</span><span class="p">:</span> <span class="s">'http://www.zhihu.com'</span><span class="p">,</span>
                  <span class="s">'User-Agent'</span><span class="p">:</span> <span class="s">'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; '</span>
                                <span class="s">'rv:39.0) Gecko/20100101 Firefox/39.0'</span><span class="p">,</span>
                  <span class="s">'Host'</span><span class="p">:</span> <span class="s">'www.zhihu.com'</span><span class="p">}</span>
<span class="n">_session</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">session</span><span class="p">()</span>
<span class="n">_session</span><span class="o">.</span><span class="n">headers</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> 
<span class="n">resourceFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/UserId.text'</span><span class="p">,</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">resourceLines</span> <span class="o">=</span> <span class="n">resourceFile</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span>
<span class="n">resultFollowerFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowees.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span>
<span class="n">resultFolloweeFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowers.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span>

<span class="n">BASE_URL</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/'</span>
<span class="n">CAPTURE_URL</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="o">+</span><span class="s">'captcha.gif?r=1466595391805&type=login'</span>
<span class="n">PHONE_LOGIN</span> <span class="o">=</span> <span class="n">BASE_URL</span> <span class="o">+</span> <span class="s">'login/phone_num'</span>

<span class="k">def</span> <span class="nf">login</span><span class="p">():</span>
    <span class="sd">'''登录知乎'''</span>
    <span class="n">username</span> <span class="o">=</span> <span class="s">''</span><span class="c">#用户名</span>
    <span class="n">password</span> <span class="o">=</span> <span class="s">''</span><span class="c">#密码,注意我这里用的是手机号登录,用邮箱登录需要改一下下面登录地址</span>
    <span class="n">cap_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">CAPTURE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
    <span class="n">cap_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/cap.gif'</span><span class="p">,</span><span class="s">'wb'</span><span class="p">)</span>
    <span class="n">cap_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cap_content</span><span class="p">)</span>
    <span class="n">cap_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
    <span class="n">captcha</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">(</span><span class="s">'capture:'</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"phone_num"</span><span class="p">:</span><span class="n">username</span><span class="p">,</span><span class="s">"password"</span><span class="p">:</span><span class="n">password</span><span class="p">,</span><span class="s">"captcha"</span><span class="p">:</span><span class="n">captcha</span><span class="p">}</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">PHONE_LOGIN</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
    <span class="nb">print</span> <span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span>
    
<span class="k">def</span> <span class="nf">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">):</span>
    <span class="sd">'''读取每一位用户的关注者和追随者,根据type进行判断'''</span>
    <span class="nb">print</span> <span class="n">followerId</span>
    <span class="n">personUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/people/'</span> <span class="o">+</span> <span class="n">followerId</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
    <span class="n">xsrf</span> <span class="o">=</span><span class="n">getXsrf</span><span class="p">()</span>
    <span class="n">hash_id</span> <span class="o">=</span> <span class="n">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span>
    <span class="n">headers</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span>
    <span class="n">headers</span><span class="p">[</span><span class="s">'Referer'</span><span class="p">]</span><span class="o">=</span> <span class="n">personUrl</span> <span class="o">+</span> <span class="s">'/follow'</span><span class="o">+</span><span class="n">followType</span>
    <span class="n">followerUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/node/ProfileFollow'</span><span class="o">+</span><span class="n">followType</span><span class="o">+</span><span class="s">'ListV2'</span>
    <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">"offset"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="s">"order_by"</span><span class="p">:</span><span class="s">"created"</span><span class="p">,</span><span class="s">"hash_id"</span><span class="p">:</span><span class="n">hash_id</span><span class="p">}</span>
    <span class="n">params_encode</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"method"</span><span class="p">:</span><span class="s">"next"</span><span class="p">,</span><span class="s">"params"</span><span class="p">:</span><span class="n">params_encode</span><span class="p">,</span><span class="s">'_xsrf'</span><span class="p">:</span><span class="n">xsrf</span><span class="p">}</span>
    
    <span class="n">signIndex</span> <span class="o">=</span> <span class="mi">20</span>
    <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="n">signIndex</span> <span class="o">==</span> <span class="mi">20</span><span class="p">:</span>
        <span class="n">params</span><span class="p">[</span><span class="s">'offset'</span><span class="p">]</span> <span class="o">=</span> <span class="n">offset</span>
        <span class="n">data</span><span class="p">[</span><span class="s">'params'</span><span class="p">]</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
        <span class="n">followerUrlJSON</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">followerUrl</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">headers</span><span class="p">)</span>
        <span class="n">signIndex</span> <span class="o">=</span> <span class="nb">len</span><span class="p">((</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">])</span>
        <span class="n">offset</span> <span class="o">=</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">signIndex</span>
        <span class="n">followerHtml</span> <span class="o">=</span>  <span class="p">(</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">everHtml</span> <span class="ow">in</span> <span class="n">followerHtml</span><span class="p">:</span>
            <span class="n">everHtmlSoup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">everHtml</span><span class="p">)</span>
            <span class="n">personId</span> <span class="o">=</span>  <span class="n">everHtmlSoup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span>
            <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">personId</span><span class="o">+</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
            <span class="nb">print</span> <span class="n">personId</span>
            
    
<span class="k">def</span> <span class="nf">getXsrf</span><span class="p">():</span>
    <span class="sd">'''获取用户的xsrf这个是当前用户的'''</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">BASE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
    <span class="n">_xsrf</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'input'</span><span class="p">,</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'name'</span><span class="p">:</span><span class="s">'_xsrf'</span><span class="p">})[</span><span class="s">'value'</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">_xsrf</span>
    
<span class="k">def</span> <span class="nf">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">):</span>
    <span class="sd">'''这个是需要抓取的用户的hashid,不是当前登录用户的hashid'''</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
    <span class="n">hashIdText</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'data-name'</span><span class="p">:</span> <span class="s">'current_people'</span><span class="p">})</span>
    <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">hashIdText</span><span class="o">.</span><span class="n">text</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">login</span><span class="p">()</span>
    <span class="n">followType</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s">'请配置抓取类别:0-抓取关注了谁 其它-被哪些人关注'</span><span class="p">)</span>
    <span class="n">followType</span> <span class="o">=</span> <span class="s">'ees'</span> <span class="k">if</span> <span class="n">followType</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'ers'</span>
    <span class="k">for</span> <span class="n">followerId</span> <span class="ow">in</span> <span class="n">resourceLines</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">)</span>
            <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
   
<span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">'__main__'</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code>
Nach dem Login kopieren
无非就是那些, useragent,referer,token,cookie 觉得可能会是 2 个原因造成的:
  1. 没带 cookies
  2. _xsrf 或 hash_id 错误
这个问题我来回答下吧,知乎在“_xsrf”这个字段搞了个小动作,并不是首页页面取到的那个_xsrf 的值,而是在登录成功后通过cookie返回的那个“_xsrf ”的值,所以你需要获取正确的这个值,不然一直会报403错误(我是在Post提问时发现的,相信你遇到的问题类似,直接上代码):

///
/// 知乎提问
///

/// 提问标题
/// 详细内容
/// 登录后获取的cookie
public void ZhiHuFaTie(string question_title,string question_detail,CookieContainer cookie)
{
question_title=“提问内容”;
question_detail=“问题详细描述”;

//遍历cookie,获取_xsrf 的值
var list = GetAllCookies(cookie);
foreach (var item in list)
{
if (item.Name == "_xsrf")
{
xsrf = item.Value;
break;
}
}
//发帖
var FaTiePostUrl = "zhihu.com/question/add";
var dd = topicStr.ToCharArray();
var FaTiePostStr = "question_title=" + HttpUtility.UrlEncode(question_title) + "&question_detail=" + HttpUtility.UrlEncode(question_detail) + "&anon=0&topic_ids=" + topicId + "&new_topics=&_xsrf="+xsrf;
var FaTieResult = nhp.PostResultHtml(FaTiePostUrl, cookie, "http://www.zhihu.com/", FaTiePostStr);
}


///
/// 遍历CookieContainer
///

///
///
public static List GetAllCookies(CookieContainer cc)
{
List lstCookies = new List();

Hashtable table = (Hashtable)cc.GetType().InvokeMember("m_domainTable",
System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField |
System.Reflection.BindingFlags.Instance, null, cc, new object[] { });

foreach (object pathList in table.Values)
{
SortedList lstCookieCol = (SortedList)pathList.GetType().InvokeMember("m_list",
System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField
| System.Reflection.BindingFlags.Instance, null, pathList, new object[] { });
foreach (CookieCollection colCookies in lstCookieCol.Values)
foreach (Cookie c in colCookies) lstCookies.Add(c);
}
return lstCookies;
} 修改header的X-Forwarded-For字段伪装ip 真的是很巧,昨天晚上刚刚遇到了这个问题。原因可能有有很多,我只说自己遇到的,仅供参考,提供一种思路。我爬取的是新浪微博,使用了代理。出现403是因为访问时网站拒绝,我在浏览器上操作也是一样,随便看里面几个网页就会出现403,不过刷新几次就好了。在代码中实现就是多请求几次。 看了楼上的答案,瞬间被镇住了。大牛真多,不过我建议题主去问问李开复好了~哈哈 话说接口是怎么抓到的...为何我用firebug抓不到接口..chrome的network也抓不到接口
话说直接请求followees也可以直接获取到,剩下的也就是正则了
Verwandte Etiketten:
Quelle:php.cn
Erklärung dieser Website
Der Inhalt dieses Artikels wird freiwillig von Internetnutzern beigesteuert und das Urheberrecht liegt beim ursprünglichen Autor. Diese Website übernimmt keine entsprechende rechtliche Verantwortung. Wenn Sie Inhalte finden, bei denen der Verdacht eines Plagiats oder einer Rechtsverletzung besteht, wenden Sie sich bitte an admin@php.cn
Beliebte Tutorials
Mehr>
Neueste Downloads
Mehr>
Web-Effekte
Quellcode der Website
Website-Materialien
Frontend-Vorlage
Über uns Haftungsausschluss Sitemap
Chinesische PHP-Website:Online-PHP-Schulung für das Gemeinwohl,Helfen Sie PHP-Lernenden, sich schnell weiterzuentwickeln!