A가 누구를 팔로우하고 있는지 확인하는 등 Zhihu 사용자의 팔로우 정보를 캡처하고 www.zhihu.com/people/XXX/followees 페이지를 통해 팔로우 목록을 가져오고 싶지만 캡처 중에 403 문제가 발생했습니다. .
1. 크롤러는 사용자가 관심 있는 정보를 수집하는 용도로만 사용되며 상업적 또는 기타 목적으로 사용되지 않습니다.
2. PHP를 사용하고, 요청을 구성하려면 컬을 사용하고, 문서를 구문 분석하려면 simple_html_dom을 사용하세요.
3. 사용자가 팔로어 목록(Followees)을 동적으로 더 많이 로드하려면 Ajax를 사용해야 하는데, Firebug를 통해 인터페이스의 데이터를 직접 크롤링하려는 경우 http:// www.zhihu.com/node/ProfileFolloweesListV2이고 게시물 데이터에는 _xsrf, method, parmas가 포함되어 있으므로 저는 시뮬레이션 로그인된 상태에서 게시에 필요한 매개변수를 사용하여 이 링크에 요청을 제출했지만 403이 반환됩니다. 4. 하지만 로그인도 시뮬레이션할 때 좋아요 및 감사 횟수와 같은 Ajax가 필요하지 않은 데이터를 구문 분석할 수 있습니다.
5. 브라우저에 제출한 요청 헤더와 일치하도록 요청 헤더를 설정했지만 여전히 403 오류가 발생했습니다.
6. 컬이 보낸 요청 헤더와 비교하기 위해 컬의 요청 헤더를 인쇄해 보았습니다. browser, but Nothing 올바른 방법 찾기 (Baidu의curl_getinfo() 해당 메시지가 출력되는 것 같습니다)
7. User-Agent 또는 X-Requested-With가 설정되지 않아 403이 발생하는 경우가 많지만 설정을 설명합니다. 5에서 요청 헤더가 모두 설정되었습니다.
8. 설명이 불분명하고 코드를 게시해야 하는 경우 코드를 게시할 수 있습니다.
9. 이 크롤러는 내 최종 디자인의 일부이므로 데이터를 얻어야 합니다. 1 등의 다음 작업을 위해 말씀드린 대로 데이터 크롤링은 순전히 학문적 연구를 위한 것입니다
답글 내용:
<code class="language-python3"><span class="c">#encoding=utf8</span> <span class="kn">import</span> <span class="nn">urllib2</span> <span class="kn">import</span> <span class="nn">json</span> <span class="kn">import</span> <span class="nn">requests</span> <span class="kn">from</span> <span class="nn">bs4</span> <span class="k">import</span> <span class="n">BeautifulSoup</span> <span class="n">Default_Header</span> <span class="o">=</span> <span class="p">{</span><span class="s">'X-Requested-With'</span><span class="p">:</span> <span class="s">'XMLHttpRequest'</span><span class="p">,</span> <span class="s">'Referer'</span><span class="p">:</span> <span class="s">'http://www.zhihu.com'</span><span class="p">,</span> <span class="s">'User-Agent'</span><span class="p">:</span> <span class="s">'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; '</span> <span class="s">'rv:39.0) Gecko/20100101 Firefox/39.0'</span><span class="p">,</span> <span class="s">'Host'</span><span class="p">:</span> <span class="s">'www.zhihu.com'</span><span class="p">}</span> <span class="n">_session</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">session</span><span class="p">()</span> <span class="n">_session</span><span class="o">.</span><span class="n">headers</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> <span class="n">resourceFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/UserId.text'</span><span class="p">,</span><span class="s">'r'</span><span class="p">)</span> <span class="n">resourceLines</span> <span class="o">=</span> <span class="n">resourceFile</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span> <span class="n">resultFollowerFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowees.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span> <span class="n">resultFolloweeFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowers.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span> <span class="n">BASE_URL</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/'</span> <span class="n">CAPTURE_URL</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="o">+</span><span class="s">'captcha.gif?r=1466595391805&type=login'</span> <span class="n">PHONE_LOGIN</span> <span class="o">=</span> <span class="n">BASE_URL</span> <span class="o">+</span> <span class="s">'login/phone_num'</span> <span class="k">def</span> <span class="nf">login</span><span class="p">():</span> <span class="sd">'''登录知乎'''</span> <span class="n">username</span> <span class="o">=</span> <span class="s">''</span><span class="c">#用户名</span> <span class="n">password</span> <span class="o">=</span> <span class="s">''</span><span class="c">#密码,注意我这里用的是手机号登录,用邮箱登录需要改一下下面登录地址</span> <span class="n">cap_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">CAPTURE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="n">cap_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/cap.gif'</span><span class="p">,</span><span class="s">'wb'</span><span class="p">)</span> <span class="n">cap_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cap_content</span><span class="p">)</span> <span class="n">cap_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span> <span class="n">captcha</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">(</span><span class="s">'capture:'</span><span class="p">)</span> <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"phone_num"</span><span class="p">:</span><span class="n">username</span><span class="p">,</span><span class="s">"password"</span><span class="p">:</span><span class="n">password</span><span class="p">,</span><span class="s">"captcha"</span><span class="p">:</span><span class="n">captcha</span><span class="p">}</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">PHONE_LOGIN</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span> <span class="nb">print</span> <span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span> <span class="k">def</span> <span class="nf">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">):</span> <span class="sd">'''读取每一位用户的关注者和追随者,根据type进行判断'''</span> <span class="nb">print</span> <span class="n">followerId</span> <span class="n">personUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/people/'</span> <span class="o">+</span> <span class="n">followerId</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span> <span class="n">xsrf</span> <span class="o">=</span><span class="n">getXsrf</span><span class="p">()</span> <span class="n">hash_id</span> <span class="o">=</span> <span class="n">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span> <span class="n">headers</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> <span class="n">headers</span><span class="p">[</span><span class="s">'Referer'</span><span class="p">]</span><span class="o">=</span> <span class="n">personUrl</span> <span class="o">+</span> <span class="s">'/follow'</span><span class="o">+</span><span class="n">followType</span> <span class="n">followerUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/node/ProfileFollow'</span><span class="o">+</span><span class="n">followType</span><span class="o">+</span><span class="s">'ListV2'</span> <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">"offset"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="s">"order_by"</span><span class="p">:</span><span class="s">"created"</span><span class="p">,</span><span class="s">"hash_id"</span><span class="p">:</span><span class="n">hash_id</span><span class="p">}</span> <span class="n">params_encode</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"method"</span><span class="p">:</span><span class="s">"next"</span><span class="p">,</span><span class="s">"params"</span><span class="p">:</span><span class="n">params_encode</span><span class="p">,</span><span class="s">'_xsrf'</span><span class="p">:</span><span class="n">xsrf</span><span class="p">}</span> <span class="n">signIndex</span> <span class="o">=</span> <span class="mi">20</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">while</span> <span class="n">signIndex</span> <span class="o">==</span> <span class="mi">20</span><span class="p">:</span> <span class="n">params</span><span class="p">[</span><span class="s">'offset'</span><span class="p">]</span> <span class="o">=</span> <span class="n">offset</span> <span class="n">data</span><span class="p">[</span><span class="s">'params'</span><span class="p">]</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="n">followerUrlJSON</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">followerUrl</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">headers</span><span class="p">)</span> <span class="n">signIndex</span> <span class="o">=</span> <span class="nb">len</span><span class="p">((</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">])</span> <span class="n">offset</span> <span class="o">=</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">signIndex</span> <span class="n">followerHtml</span> <span class="o">=</span> <span class="p">(</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span> <span class="k">for</span> <span class="n">everHtml</span> <span class="ow">in</span> <span class="n">followerHtml</span><span class="p">:</span> <span class="n">everHtmlSoup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">everHtml</span><span class="p">)</span> <span class="n">personId</span> <span class="o">=</span> <span class="n">everHtmlSoup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span> <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">personId</span><span class="o">+</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span> <span class="nb">print</span> <span class="n">personId</span> <span class="k">def</span> <span class="nf">getXsrf</span><span class="p">():</span> <span class="sd">'''获取用户的xsrf这个是当前用户的'''</span> <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">BASE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span> <span class="n">_xsrf</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'input'</span><span class="p">,</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'name'</span><span class="p">:</span><span class="s">'_xsrf'</span><span class="p">})[</span><span class="s">'value'</span><span class="p">]</span> <span class="k">return</span> <span class="n">_xsrf</span> <span class="k">def</span> <span class="nf">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">):</span> <span class="sd">'''这个是需要抓取的用户的hashid,不是当前登录用户的hashid'''</span> <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span> <span class="n">hashIdText</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'data-name'</span><span class="p">:</span> <span class="s">'current_people'</span><span class="p">})</span> <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">hashIdText</span><span class="o">.</span><span class="n">text</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span> <span class="n">login</span><span class="p">()</span> <span class="n">followType</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s">'请配置抓取类别:0-抓取关注了谁 其它-被哪些人关注'</span><span class="p">)</span> <span class="n">followType</span> <span class="o">=</span> <span class="s">'ees'</span> <span class="k">if</span> <span class="n">followType</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'ers'</span> <span class="k">for</span> <span class="n">followerId</span> <span class="ow">in</span> <span class="n">resourceLines</span><span class="p">:</span> <span class="k">try</span><span class="p">:</span> <span class="n">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">)</span> <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span> <span class="k">except</span><span class="p">:</span> <span class="k">pass</span> <span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">'__main__'</span><span class="p">:</span> <span class="n">main</span><span class="p">()</span> </code>