對知乎內容使用爬蟲爬取數據,為什麼會遇到403問題?
我想抓取知乎上用戶的關注信息,如查看A關注了哪些人,通過www.zhihu.com/people/XXX/followees這個頁面來獲得followee的列表,但是在抓取中遇到了403問題。
1.爬蟲只是為了蒐集用戶關注信息,用於學術研究,絕非商業或其他目的
2.使用PHP,利用curl構造請求,使用simple_html_dom來解析文檔
3.在用戶的關注者(Followees)列表,應該是使用Ajax進行動態加載更多的followees,於是我想直接爬接口的數據,通過firebug查看到,加載更多的關注者似乎是通過http://www.zhihu.com/ node/ProfileFolloweesListV2 進行的,並且post的資料有_xsrf,method,parmas,於是我在模擬保持登入的情況下,對這個連結提交請求,並帶有post過去的所需要的參數,但是回傳的是403。
4.但是我同樣模擬登入的情況下,可以解析到如讚同數、感謝數這些不需要Ajax的資料
5.我使用curl_setopt($ch, CURLOPT_HTTPHEADER, $header );來設定請求頭,使其與我在瀏覽器中提交的請求的請求頭一致,但是這樣任然導致403錯誤
6.我嘗試打印出curl的請求頭與瀏覽器發出的請求頭進行比較,但是沒有找到正確的方式(百度出的curl_getinfo()似乎打印出的相應報文)
7.有許多人曾因為沒有設定User-Agent或X-Requested-With遭遇403,但是我在5中描述設定請求頭時都設定了
8 .如果敘述不詳需要貼出代碼,我可以貼出代碼
9.這個爬蟲是我畢設的一部分,需要獲取數據來進行接下來的工作,如1所說,爬取數據純粹是為了學術研究
回覆內容:
如果有防火牆功能的伺服器,連續抓取可能會被幹掉,除非你有很多代理伺服器。或最簡單用adsl不斷重新撥號更換ip 你先找個瀏覽器,研究一下request的HTTP Header再來抓 這兩天剛好做了一個抓取用戶的關注著和追隨者的的爬蟲在抓數據,使用的是Python。這裡給你一段python的程式碼,你可以對著程式碼看一下你的程式碼問題。403應該就是請求的時候一些數據發錯了,下面的代碼中涉及到一個打開的文本,文本中的內容是用戶的id,文本裡面的內容樣式我截了圖放在最後面。
<span class="c">#encoding=utf8</span> <span class="kn">import</span> <span class="nn">urllib2</span> <span class="kn">import</span> <span class="nn">json</span> <span class="kn">import</span> <span class="nn">requests</span> <span class="kn">from</span> <span class="nn">bs4</span> <span class="k">import</span> <span class="n">BeautifulSoup</span> <span class="n">Default_Header</span> <span class="o">=</span> <span class="p">{</span><span class="s">'X-Requested-With'</span><span class="p">:</span> <span class="s">'XMLHttpRequest'</span><span class="p">,</span> <span class="s">'Referer'</span><span class="p">:</span> <span class="s">'http://www.zhihu.com'</span><span class="p">,</span> <span class="s">'User-Agent'</span><span class="p">:</span> <span class="s">'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; '</span> <span class="s">'rv:39.0) Gecko/20100101 Firefox/39.0'</span><span class="p">,</span> <span class="s">'Host'</span><span class="p">:</span> <span class="s">'www.zhihu.com'</span><span class="p">}</span> <span class="n">_session</span> <span class="o">=</span> <span class="n">requests</span><span class="o">.</span><span class="n">session</span><span class="p">()</span> <span class="n">_session</span><span class="o">.</span><span class="n">headers</span><span class="o">.</span><span class="n">update</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> <span class="n">resourceFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/UserId.text'</span><span class="p">,</span><span class="s">'r'</span><span class="p">)</span> <span class="n">resourceLines</span> <span class="o">=</span> <span class="n">resourceFile</span><span class="o">.</span><span class="n">readlines</span><span class="p">()</span> <span class="n">resultFollowerFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowees.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span> <span class="n">resultFolloweeFile</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/userIdFollowers.text'</span><span class="p">,</span><span class="s">'a+'</span><span class="p">)</span> <span class="n">BASE_URL</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/'</span> <span class="n">CAPTURE_URL</span> <span class="o">=</span> <span class="n">BASE_URL</span><span class="o">+</span><span class="s">'captcha.gif?r=1466595391805&type=login'</span> <span class="n">PHONE_LOGIN</span> <span class="o">=</span> <span class="n">BASE_URL</span> <span class="o">+</span> <span class="s">'login/phone_num'</span> <span class="k">def</span> <span class="nf">login</span><span class="p">():</span> <span class="sd">'''登录知乎'''</span> <span class="n">username</span> <span class="o">=</span> <span class="s">''</span><span class="c">#用户名</span> <span class="n">password</span> <span class="o">=</span> <span class="s">''</span><span class="c">#密码,注意我这里用的是手机号登录,用邮箱登录需要改一下下面登录地址</span> <span class="n">cap_content</span> <span class="o">=</span> <span class="n">urllib2</span><span class="o">.</span><span class="n">urlopen</span><span class="p">(</span><span class="n">CAPTURE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span> <span class="n">cap_file</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'/root/Desktop/cap.gif'</span><span class="p">,</span><span class="s">'wb'</span><span class="p">)</span> <span class="n">cap_file</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">cap_content</span><span class="p">)</span> <span class="n">cap_file</span><span class="o">.</span><span class="n">close</span><span class="p">()</span> <span class="n">captcha</span> <span class="o">=</span> <span class="n">raw_input</span><span class="p">(</span><span class="s">'capture:'</span><span class="p">)</span> <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"phone_num"</span><span class="p">:</span><span class="n">username</span><span class="p">,</span><span class="s">"password"</span><span class="p">:</span><span class="n">password</span><span class="p">,</span><span class="s">"captcha"</span><span class="p">:</span><span class="n">captcha</span><span class="p">}</span> <span class="n">r</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">PHONE_LOGIN</span><span class="p">,</span> <span class="n">data</span><span class="p">)</span> <span class="nb">print</span> <span class="p">(</span><span class="n">r</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span> <span class="k">def</span> <span class="nf">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">):</span> <span class="sd">'''读取每一位用户的关注者和追随者,根据type进行判断'''</span> <span class="nb">print</span> <span class="n">followerId</span> <span class="n">personUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/people/'</span> <span class="o">+</span> <span class="n">followerId</span><span class="o">.</span><span class="n">strip</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span> <span class="n">xsrf</span> <span class="o">=</span><span class="n">getXsrf</span><span class="p">()</span> <span class="n">hash_id</span> <span class="o">=</span> <span class="n">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span> <span class="n">headers</span> <span class="o">=</span> <span class="nb">dict</span><span class="p">(</span><span class="n">Default_Header</span><span class="p">)</span> <span class="n">headers</span><span class="p">[</span><span class="s">'Referer'</span><span class="p">]</span><span class="o">=</span> <span class="n">personUrl</span> <span class="o">+</span> <span class="s">'/follow'</span><span class="o">+</span><span class="n">followType</span> <span class="n">followerUrl</span> <span class="o">=</span> <span class="s">'https://www.zhihu.com/node/ProfileFollow'</span><span class="o">+</span><span class="n">followType</span><span class="o">+</span><span class="s">'ListV2'</span> <span class="n">params</span> <span class="o">=</span> <span class="p">{</span><span class="s">"offset"</span><span class="p">:</span><span class="mi">0</span><span class="p">,</span><span class="s">"order_by"</span><span class="p">:</span><span class="s">"created"</span><span class="p">,</span><span class="s">"hash_id"</span><span class="p">:</span><span class="n">hash_id</span><span class="p">}</span> <span class="n">params_encode</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="n">data</span> <span class="o">=</span> <span class="p">{</span><span class="s">"method"</span><span class="p">:</span><span class="s">"next"</span><span class="p">,</span><span class="s">"params"</span><span class="p">:</span><span class="n">params_encode</span><span class="p">,</span><span class="s">'_xsrf'</span><span class="p">:</span><span class="n">xsrf</span><span class="p">}</span> <span class="n">signIndex</span> <span class="o">=</span> <span class="mi">20</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span> <span class="k">while</span> <span class="n">signIndex</span> <span class="o">==</span> <span class="mi">20</span><span class="p">:</span> <span class="n">params</span><span class="p">[</span><span class="s">'offset'</span><span class="p">]</span> <span class="o">=</span> <span class="n">offset</span> <span class="n">data</span><span class="p">[</span><span class="s">'params'</span><span class="p">]</span> <span class="o">=</span> <span class="n">json</span><span class="o">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">params</span><span class="p">)</span> <span class="n">followerUrlJSON</span> <span class="o">=</span> <span class="n">_session</span><span class="o">.</span><span class="n">post</span><span class="p">(</span><span class="n">followerUrl</span><span class="p">,</span><span class="n">data</span><span class="o">=</span><span class="n">data</span><span class="p">,</span><span class="n">headers</span> <span class="o">=</span> <span class="n">headers</span><span class="p">)</span> <span class="n">signIndex</span> <span class="o">=</span> <span class="nb">len</span><span class="p">((</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">])</span> <span class="n">offset</span> <span class="o">=</span> <span class="n">offset</span> <span class="o">+</span> <span class="n">signIndex</span> <span class="n">followerHtml</span> <span class="o">=</span> <span class="p">(</span><span class="n">followerUrlJSON</span><span class="o">.</span><span class="n">json</span><span class="p">())[</span><span class="s">'msg'</span><span class="p">]</span> <span class="k">for</span> <span class="n">everHtml</span> <span class="ow">in</span> <span class="n">followerHtml</span><span class="p">:</span> <span class="n">everHtmlSoup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">everHtml</span><span class="p">)</span> <span class="n">personId</span> <span class="o">=</span> <span class="n">everHtmlSoup</span><span class="o">.</span><span class="n">a</span><span class="p">[</span><span class="s">'href'</span><span class="p">]</span> <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">personId</span><span class="o">+</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">)</span> <span class="nb">print</span> <span class="n">personId</span> <span class="k">def</span> <span class="nf">getXsrf</span><span class="p">():</span> <span class="sd">'''获取用户的xsrf这个是当前用户的'''</span> <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">BASE_URL</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span> <span class="n">_xsrf</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'input'</span><span class="p">,</span><span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'name'</span><span class="p">:</span><span class="s">'_xsrf'</span><span class="p">})[</span><span class="s">'value'</span><span class="p">]</span> <span class="k">return</span> <span class="n">_xsrf</span> <span class="k">def</span> <span class="nf">getHashId</span><span class="p">(</span><span class="n">personUrl</span><span class="p">):</span> <span class="sd">'''这个是需要抓取的用户的hashid,不是当前登录用户的hashid'''</span> <span class="n">soup</span> <span class="o">=</span> <span class="n">BeautifulSoup</span><span class="p">(</span><span class="n">_session</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="n">personUrl</span><span class="p">)</span><span class="o">.</span><span class="n">content</span><span class="p">)</span> <span class="n">hashIdText</span> <span class="o">=</span> <span class="n">soup</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'script'</span><span class="p">,</span> <span class="n">attrs</span><span class="o">=</span><span class="p">{</span><span class="s">'data-name'</span><span class="p">:</span> <span class="s">'current_people'</span><span class="p">})</span> <span class="k">return</span> <span class="n">json</span><span class="o">.</span><span class="n">loads</span><span class="p">(</span><span class="n">hashIdText</span><span class="o">.</span><span class="n">text</span><span class="p">)[</span><span class="mi">3</span><span class="p">]</span> <span class="k">def</span> <span class="nf">main</span><span class="p">():</span> <span class="n">login</span><span class="p">()</span> <span class="n">followType</span> <span class="o">=</span> <span class="nb">input</span><span class="p">(</span><span class="s">'请配置抓取类别:0-抓取关注了谁 其它-被哪些人关注'</span><span class="p">)</span> <span class="n">followType</span> <span class="o">=</span> <span class="s">'ees'</span> <span class="k">if</span> <span class="n">followType</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="s">'ers'</span> <span class="k">for</span> <span class="n">followerId</span> <span class="ow">in</span> <span class="n">resourceLines</span><span class="p">:</span> <span class="k">try</span><span class="p">:</span> <span class="n">readFollowerNumbers</span><span class="p">(</span><span class="n">followerId</span><span class="p">,</span><span class="n">followType</span><span class="p">)</span> <span class="n">resultFollowerFile</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span> <span class="k">except</span><span class="p">:</span> <span class="k">pass</span> <span class="k">if</span> <span class="n">__name__</span><span class="o">==</span><span class="s">'__main__'</span><span class="p">:</span> <span class="n">main</span><span class="p">()</span>
- 沒帶 cookies
- _xsrf 或 hash_id 錯誤
///
// / 知乎提問
///
/// 提問標題
/// 詳細內容
/ //
//遍歷cookie,取得_xsrf 的值
var list = GetAllCookies(cookie);
foreach (var item in list)
{
if (item.Name == "_xsrf")
{
if (item.Name == "_xsrf")
=
x item.Value;
break;
}
}
//發文
var FaTiePostUrl = "
http://www.
zhihu.com/question/addvarzhihu.com/question/addvar
zhihu.com/question/addvar
"; ;
。 = nhp.PostResultHtml(FaTiePostUrl, cookie, "http://www.zhihu.com/", FaTiePostStr);
}
///
/// 遍歷CsumContainer
///
/// 遍歷CsumContainer
///
///
public static List
{
List
Hashtable table = (Hashtable)cc.GetType().InvokeMember("m_domainTable",
System.Reflection.BindingFlags.NonPublic | System.Reflection.BindingFlags.GetFieldp. Instance, null, cc, new object[] { });
foreach (object pathList in table.Values)
{
SortedList lstCookieCol = (SortedList)pathList.GetType().InvokeMember("m_Col = (SortedList)pathList.GetType().InvokeMember("m_p." .BindingFlags.NonPublic | System.Reflection.BindingFlags.GetField
| System.Reflection.BindingFlags.Instance, null, pathList, new object[] { });
foreach (CookieCollection colCookie inlst colCookies colCookies) lstCookies.Add(c);
}
return lstCookies;🎜 } 修改header的X-Forwarded-For字段偽裝ip 真的很巧,昨天晚上才剛遇到了這個問題。原因可能有很多,我只說我遇到的,僅供參考,提供一個想法。我爬取的是新浪微博,使用了代理。出現403是因為造訪時網站拒絕,我在瀏覽器上操作也是一樣,隨便看裡面幾個網頁就會出現403,不過刷新幾次就好了。在程式碼中實作就是多請求幾次。 看了樓上的答案,瞬間被鎮住了。大牛真多,不過我建議題主去問李開復好了~哈哈 話說接口是怎麼抓到的...為何我用firebug抓不到接口..chrome的network也抓不到接口🎜話說直接請求followees也可以直接獲取到,剩下的也就是正則了

熱AI工具

Undresser.AI Undress
人工智慧驅動的應用程序,用於創建逼真的裸體照片

AI Clothes Remover
用於從照片中去除衣服的線上人工智慧工具。

Undress AI Tool
免費脫衣圖片

Clothoff.io
AI脫衣器

Video Face Swap
使用我們完全免費的人工智慧換臉工具,輕鬆在任何影片中換臉!

熱門文章

熱工具

記事本++7.3.1
好用且免費的程式碼編輯器

SublimeText3漢化版
中文版,非常好用

禪工作室 13.0.1
強大的PHP整合開發環境

Dreamweaver CS6
視覺化網頁開發工具

SublimeText3 Mac版
神級程式碼編輯軟體(SublimeText3)

curl和Pythonrequests都是發送HTTP請求的強大工具。雖然curl是一種命令列工具,可讓您直接從終端機發送請求,但Python的請求庫提供了一種更具程式化的方式來從Python程式碼發送請求。將curl轉換為Pythonrequestscurl指令的基本語法如下所示:curl[OPTIONS]URL將curl指令轉換為Python請求時,我們需要將選項和URL轉換為Python程式碼。這是一個範例curlPOST指令:curl-XPOSThttps://example.com/api

PHP8.1發布:引入curl多個請求並發處理近日,PHP官方發布了最新版本的PHP8.1,其中引入了一個重要的特性:curl多個請求並發處理。這個新功能為開發者提供了一個更有效率和靈活的方式來處理多個HTTP請求,大大提升了效能和使用者體驗。在以往的版本中,處理多個請求往往需要透過建立多個curl資源,並使用循環來分別發送和接收資料。這種方式雖然能夠實現目

從頭到尾:如何使用php擴充cURL進行HTTP請求引言:在Web開發中,經常需要與第三方API或其他遠端伺服器進行通訊。而使用cURL進行HTTP請求是一種常見且強大的方式。本文將介紹如何使用php擴充cURL來執行HTTP請求,並提供一些實用的程式碼範例。一、準備工作首先,請確保php已安裝cURL擴充。可以在命令列執行php-m|grepcurl查

在Linux下更新curl版本,您可以按照以下步驟進行操作:檢查目前curl版本:首先,您需要確定目前系統中安裝的curl版本。開啟終端,並執行以下指令:curl--version該指令將顯示目前curl的版本資訊。確認可用的curl版本:在更新curl之前,您需要確定可用的最新版本。您可以造訪curl的官方網站(curl.haxx.se)或相關的軟體來源,尋找最新版本的curl。下載curl原始碼:使用curl或瀏覽器,下載您選擇的curl版本的原始碼檔案(通常為.tar.gz或.tar.bz2

PHPCurl中如何處理網頁的301重定向?使用PHPCurl發送網路請求時,常會遇到網頁回傳的301狀態碼,表示頁面被永久重定向。為了正確處理這種情況,我們需要在Curl請求中加入一些特定的選項和處理邏輯。以下將詳細介紹在PHPCurl中如何處理網頁的301重定向,並提供具體的程式碼範例。 301重定向處理原理301重定向是指伺服器回傳了一個30

在linux中,curl是一個非常實用的、用來與伺服器之間傳輸資料的工具,是一個利用URL規則在命令列下工作的檔案傳輸工具;它支援檔案的上傳和下載,是綜合傳輸工具。 curl提供了一大堆非常有用的功能,包括代理存取、使用者認證、ftp上傳下載、HTTP POST、SSL連線、cookie支援、斷點續傳等等。

python模擬瀏覽器發送post請求importrequests格式request.postrequest.post(url,data,json,kwargs)#post請求格式request.get(url,params,kwargs)#對比get請求發送post請求傳參分為表單( x-www-form-urlencoded)json(application/json)data參數支援字典格式和字串格式,字典格式用json.dumps()方法把data轉換為合法的json格式字串次方法需要

對於PHP開發者來說,使用POST帶參數跳轉頁面是一項基本技能。 POST是HTTP中一種發送資料的方法,它可以透過HTTP請求向伺服器提交數據,跳轉頁面則是在伺服器端進行頁面的處理和跳轉。在實際開發中,我們經常需要使用POST帶參數來跳轉頁面,以達到一定的功能目的。
