Why do you encounter 403 problems when using a crawler to crawl data on Zhihu content?-PHP Tutorial-php.cn

Home

Backend Development

PHP Tutorial

Why do you encounter 403 problems when using a crawler to crawl data on Zhihu content?

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Aug 17, 2016 am 10:01 AM

curl post

I want to capture the follow information of users on Zhihu, such as checking who A follows, and get the list of followees through the page www.zhihu.com/people/XXX/followees, but I encountered a 403 problem during the capture.
1. The crawler is only for collecting user attention information, for academic research, and is not for commercial or other purposes.
2. Use PHP, use curl to construct requests, and use simple_html_dom to parse documents. 3. In the user's followers (Followees) The list should use Ajax to dynamically load more followees, so I want to directly crawl the interface data. I can see through firebug that loading more followers seems to be done through
http://www.zhihu.com/ node/ProfileFolloweesListV2, and the post data includes _xsrf, method, parmas, so I submitted a request to this link while simulating that I stayed logged in, with the parameters required for the post. , but 403 is returned. 4. But when I also simulate login, I can parse data that does not require Ajax, such as the number of likes and thanks.
5. I use curl_setopt($ch, CURLOPT_HTTPHEADER, $header); to set the request header so that The request headers are consistent with the request headers I submitted in the browser, but this still resulted in a 403 error
6. I tried to print out the curl request headers and compare them with the request headers sent by the browser, but did not find the correct way (Baidu The curl_getinfo() seems to print out the corresponding message)
7. Many people have encountered 403 because User-Agent or X-Requested-With was not set, but I set it when setting the request header as described in 5
8 .If the description is unclear and you need to post the code, I can post the code
9. This crawler is part of my graduation project and I need to obtain data for the next work. As 1 said, crawling data is purely for academic research
Reply content:

If the server has a firewall function, continuous crawling may be killed unless you have many proxy servers. Or the easiest way is to use adsl to constantly redial and change the IP address. You first find a browser and study the HTTP Header of the request and then capture it. In the past two days, I just made a crawler to capture users' followings and followers, using Python. Here is a piece of python code for you. You can look at the code to see your code problem.

403 should be that some data was sent incorrectly during the request. The following code involves an open text. The content in the text is the user's ID. I took a screenshot of the content style in the text and put it at the end.

<span class="c">#encoding=utf8</span>
<span class="kn">import</span> <span class="nn">urllib2</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">requests</span>
<span class="kn">from</span> <span class="nn">bs4</span> <span class="k">import</span> <span class="n">BeautifulSoup</span>

<span class="n">Default_Header</span> <span class="o">=</span> <span class="p">{</span><span class="s">'X-Requested-With'</span><span class="p">:</span> <span class="s">'XMLHttpRequest'</span><span class="p">,</span>
                  <span class="s">'Referer'</span><span class="p">:</span> <span class="s">'http://www.zhihu.com'</span><span class="p">,</span>
                  <span class="s">'User-Agent'</span><span class="p">:</span> <span class="s">'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; '</span>
                                <span class="s">'rv:39.0) Gecko/20100101 Firefox/39.0'</span><span class="p">,</span>
                  <span class="s">'Host'</span><span class="p">:</span> <span class="s">'www.zhihu.com'</span><span class="p">}</span>
<span class="n">_session</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">session</span><span class="p">()</span>
<span class="n">_session</span><span class="o">.</span><span class="n">headers</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> 
<span class="n">resourceFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/UserId.text'</span><span class="p">,</span><span class="s">'r'</span><span class="p">)</span>
<span class="n">resourceLines</span> <span class="o">=</span> <span class="n">resourceFile</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span>
<span class="n">resultFollowerFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowees.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span>
<span class="n">resultFolloweeFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowers.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span>

<span class="n">BASE_URL</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/'</span>
<span class="n">CAPTURE_URL</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="o">+</span><span class="s">'captcha.gif?r=1466595391805&type=login'</span>
<span class="n">PHONE_LOGIN</span> <span class="o">=</span> <span class="n">BASE_URL</span> <span class="o">+</span> <span class="s">'login/phone_num'</span>

<span class="k">def</span> <span class="nf">login</span><span class="p">():</span>
    <span class="sd">'''登录知乎'''</span>
    <span class="n">username</span> <span class="o">=</span> <span class="s">''</span><span class="c">#用户名</span>
    <span class="n">password</span> <span class="o">=</span> <span class="s">''</span><span class="c">#密码，注意我这里用的是手机号登录，用邮箱登录需要改一下下面登录地址</span>
    <span class="n">cap_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">CAPTURE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
    <span class="n">cap_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/cap.gif'</span><span class="p">,</span><span class="s">'wb'</span><span class="p">)</span>
    <span class="n">cap_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cap_content</span><span class="p">)</span>
    <span class="n">cap_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
    <span class="n">captcha</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">(</span><span class="s">'capture:'</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"phone_num"</span><span class="p">:</span><span class="n">username</span><span class="p">,</span><span class="s">"password"</span><span class="p">:</span><span class="n">password</span><span class="p">,</span><span class="s">"captcha"</span><span class="p">:</span><span class="n">captcha</span><span class="p">}</span>
    <span class="n">r</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">PHONE_LOGIN</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span>
    <span class="nb">print</span> <span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span>
    
<span class="k">def</span> <span class="nf">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">):</span>
    <span class="sd">'''读取每一位用户的关注者和追随者，根据type进行判断'''</span>
    <span class="nb">print</span> <span class="n">followerId</span>
    <span class="n">personUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/people/'</span> <span class="o">+</span> <span class="n">followerId</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
    <span class="n">xsrf</span> <span class="o">=</span><span class="n">getXsrf</span><span class="p">()</span>
    <span class="n">hash_id</span> <span class="o">=</span> <span class="n">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span>
    <span class="n">headers</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span>
    <span class="n">headers</span><span class="p">[</span><span class="s">'Referer'</span><span class="p">]</span><span class="o">=</span> <span class="n">personUrl</span> <span class="o">+</span> <span class="s">'/follow'</span><span class="o">+</span><span class="n">followType</span>
    <span class="n">followerUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/node/ProfileFollow'</span><span class="o">+</span><span class="n">followType</span><span class="o">+</span><span class="s">'ListV2'</span>
    <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">"offset"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="s">"order_by"</span><span class="p">:</span><span class="s">"created"</span><span class="p">,</span><span class="s">"hash_id"</span><span class="p">:</span><span class="n">hash_id</span><span class="p">}</span>
    <span class="n">params_encode</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
    <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"method"</span><span class="p">:</span><span class="s">"next"</span><span class="p">,</span><span class="s">"params"</span><span class="p">:</span><span class="n">params_encode</span><span class="p">,</span><span class="s">'_xsrf'</span><span class="p">:</span><span class="n">xsrf</span><span class="p">}</span>
    
    <span class="n">signIndex</span> <span class="o">=</span> <span class="mi">20</span>
    <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="k">while</span> <span class="n">signIndex</span> <span class="o">==</span> <span class="mi">20</span><span class="p">:</span>
        <span class="n">params</span><span class="p">[</span><span class="s">'offset'</span><span class="p">]</span> <span class="o">=</span> <span class="n">offset</span>
        <span class="n">data</span><span class="p">[</span><span class="s">'params'</span><span class="p">]</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span>
        <span class="n">followerUrlJSON</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">followerUrl</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">headers</span><span class="p">)</span>
        <span class="n">signIndex</span> <span class="o">=</span> <span class="nb">len</span><span class="p">((</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">])</span>
        <span class="n">offset</span> <span class="o">=</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">signIndex</span>
        <span class="n">followerHtml</span> <span class="o">=</span>  <span class="p">(</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span>
        <span class="k">for</span> <span class="n">everHtml</span> <span class="ow">in</span> <span class="n">followerHtml</span><span class="p">:</span>
            <span class="n">everHtmlSoup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">everHtml</span><span class="p">)</span>
            <span class="n">personId</span> <span class="o">=</span>  <span class="n">everHtmlSoup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span>
            <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">personId</span><span class="o">+</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span>
            <span class="nb">print</span> <span class="n">personId</span>
            
    
<span class="k">def</span> <span class="nf">getXsrf</span><span class="p">():</span>
    <span class="sd">'''获取用户的xsrf这个是当前用户的'''</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">BASE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
    <span class="n">_xsrf</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'input'</span><span class="p">,</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'name'</span><span class="p">:</span><span class="s">'_xsrf'</span><span class="p">})[</span><span class="s">'value'</span><span class="p">]</span>
    <span class="k">return</span> <span class="n">_xsrf</span>
    
<span class="k">def</span> <span class="nf">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">):</span>
    <span class="sd">'''这个是需要抓取的用户的hashid，不是当前登录用户的hashid'''</span>
    <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span>
    <span class="n">hashIdText</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'data-name'</span><span class="p">:</span> <span class="s">'current_people'</span><span class="p">})</span>
    <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">hashIdText</span><span class="o">.</span><span class="n">text</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span>

<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="n">login</span><span class="p">()</span>
    <span class="n">followType</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s">'请配置抓取类别：0-抓取关注了谁 其它-被哪些人关注'</span><span class="p">)</span>
    <span class="n">followType</span> <span class="o">=</span> <span class="s">'ees'</span> <span class="k">if</span> <span class="n">followType</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'ers'</span>
    <span class="k">for</span> <span class="n">followerId</span> <span class="ow">in</span> <span class="n">resourceLines</span><span class="p">:</span>
        <span class="k">try</span><span class="p">:</span>
            <span class="n">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">)</span>
            <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span>
        <span class="k">except</span><span class="p">:</span>
            <span class="k">pass</span>
   
<span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">'__main__'</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
Copy after login

It’s nothing more than those, useragent, referrer, token, cookie

I think it may be caused by 2 reasons: 
No cookies
_xsrf or hash_id error


Let me answer this question. Zhihu has made a small move in the "_xsrf" field. It is not the _xsrf value obtained from the home page, but the "_xsrf" value returned through the cookie after successful login. , so you need to get the correct value, otherwise a 403 error will always be reported (I found this out when Post asked a question, I believe you will encounter similar problems, so go directly to the code): 

/// 
 / // Zhihu question
 /// 
 /// Question title
 /// Details Content
 /// Cookie obtained after logging in
public void ZhiHuFaTie(string question_title,string question_detail,CookieContainer cookie)
 {
question_title="Question Content";
question_detail="Question Detailed Description";

//Traverse cookies and get the value of _xsrf
var list = GetAllCookies(cookie);
 foreach (var item in list)
 {
 if (item.Name = = "_xsrf")
 {
 xsrf = item.Value;
 break;
 }
 }
//Post 
 var FaTiePostUrl = "http://www.zhihu.com/question/add" ;
 var dd = topicStr.ToCharArray();
 var FaTiePostStr = "question_title=" + HttpUtility.UrlEncode(question_title) + "&question_detail=" + HttpUtility.UrlEncode(question_detail) + "&anon=0&topic_ids=" + topicId + "&new_topics =&_xsrf="+xsrf;
 var FaTieResult = nhp.PostResultHtml(FaTiePostUrl, cookie, "http://www.zhihu.com/", FaTiePostStr);
}


 /// 
 // / Traverse CookieContainer
 /// 
 /// 
 /// 
 public static List {
 List lstCookies = new List();

 Hashtable table = (Hashtable)cc.GetType().InvokeMember("m_domainTable",
 System.Reflection.BindingFlags. NonPublic | System.Reflection.BindingFlags.GetField |
 System.Reflection.BindingFlags.Instance, null, cc, new object[] { });

 foreach (object pathList in table.Values)
 {
 SortedList lstCookieCol = (SortedList )pathList.GetType().InvokeMember("m_list",
 System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField
 | System.Reflection.BindingFlags.Instance, null, pathList, new object[] { }) ;
 foreach (CookieCollection colCookies in lstCookieCol.Values)
 foreach (Cookie c in colCookies) lstCookies.Add(c);
 }
 return lstCookies;
 }

Modify the X-Forwarded-For field of the header to disguise the IP address

What a coincidence, I just encountered this problem last night. There may be many reasons, I will only tell you what I encountered, for reference only and to provide an idea. I crawled Sina Weibo and used a proxy. The 403 appears because the website is rejected when accessing. I do the same thing on the browser. If I just look at a few web pages, 403 will appear, but it will be fine if I refresh it a few times. The implementation in the code is to request multiple times.

After reading the answer above, I was instantly stunned. There are so many great people, but I suggest you ask Kai-Fu Lee ~ Haha

Let’s talk about how the interface is caught... Why can’t I catch the interface with firebug? Chrome’s network can’t catch the interface either
Speaking of which, you can get it directly by directly requesting followees. The rest is regular.

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055523 fails to install in Windows 11?

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How to fix KB5055612 fails to install in Windows 10?

3 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1664

CakePHP Tutorial

1423

Laravel Tutorial

1317

PHP Tutorial

1268

C# Tutorial

1246

Related knowledge

How to realize the mutual conversion between CURL and python requests in python May 03, 2023 pm 12:49 PM

Both curl and Pythonrequests are powerful tools for sending HTTP requests. While curl is a command-line tool that allows you to send requests directly from the terminal, Python's requests library provides a more programmatic way to send requests from Python code. The basic syntax for converting curl to Pythonrequestscurl command is as follows: curl[OPTIONS]URL When converting curl command to Python request, we need to convert the options and URL into Python code. Here is an example curlPOST command: curl-XPOST https://example.com/api

PHP8.1 released: Introducing curl for concurrent processing of multiple requests Jul 08, 2023 pm 09:13 PM

PHP8.1 released: Introducing curl for concurrent processing of multiple requests. Recently, PHP officially released the latest version of PHP8.1, which introduced an important feature: curl for concurrent processing of multiple requests. This new feature provides developers with a more efficient and flexible way to handle multiple HTTP requests, greatly improving performance and user experience. In previous versions, handling multiple requests often required creating multiple curl resources and using loops to send and receive data respectively. Although this method can achieve the purpose

From start to finish: How to use php extension cURL to make HTTP requests Jul 29, 2023 pm 05:07 PM

From start to finish: How to use php extension cURL for HTTP requests Introduction: In web development, it is often necessary to communicate with third-party APIs or other remote servers. Using cURL to make HTTP requests is a common and powerful way. This article will introduce how to use PHP to extend cURL to perform HTTP requests, and provide some practical code examples. 1. Preparation First, make sure that php has the cURL extension installed. You can execute php-m|grepcurl on the command line to check

Tutorial on updating curl version under Linux! Mar 07, 2024 am 08:30 AM

To update the curl version under Linux, you can follow the steps below: Check the current curl version: First, you need to determine the curl version installed in the current system. Open a terminal and execute the following command: curl --version This command will display the current curl version information. Confirm available curl version: Before updating curl, you need to confirm the latest version available. You can visit curl's official website (curl.haxx.se) or related software sources to find the latest version of curl. Download the curl source code: Using curl or a browser, download the source code file for the curl version of your choice (usually .tar.gz or .tar.bz2

How to handle 301 redirection of web pages in PHP Curl? Mar 08, 2024 am 11:36 AM

How to handle 301 redirection of web pages in PHPCurl? When using PHPCurl to send network requests, you will often encounter a 301 status code returned by the web page, indicating that the page has been permanently redirected. In order to handle this situation correctly, we need to add some specific options and processing logic to the Curl request. The following will introduce in detail how to handle 301 redirection of web pages in PHPCurl, and provide specific code examples. 301 redirect processing principle 301 redirect means that the server returns a 30

what is linux curl Apr 20, 2023 pm 05:05 PM

In Linux, curl is a very practical tool for transferring data to and from the server. It is a file transfer tool that uses URL rules to work under the command line; it supports file upload and download, and is a comprehensive transfer tool. . Curl provides a lot of very useful functions, including proxy access, user authentication, ftp upload and download, HTTP POST, SSL connection, cookie support, breakpoint resume and so on.

How to use python requests post Apr 29, 2023 pm 04:52 PM

Python simulates the browser sending post requests importrequests format request.postrequest.post(url,data,json,kwargs)#post request format request.get(url,params,kwargs)#Compared with get request, sending post request parameters are divided into forms ( x-www-form-urlencoded) json (application/json) data parameter supports dictionary format and string format. The dictionary format uses the json.dumps() method to convert the data into a legal json format string. This method requires

How does java initiate an http request and call the post and get interfaces? May 16, 2023 pm 07:53 PM

1. Java calls post interface 1. Use URLConnection or HttpURLConnection that comes with java. There is no need to download other jar packages. Call URLConnection. If the interface response code is modified by the server, the return message cannot be received. It can only be received when the response code is correct. to return publicstaticStringsendPost(Stringurl,Stringparam){OutputStreamWriterout=null;BufferedReaderin=null;StringBuilderresult=newSt

See all articles