Many people who use Python may have written web crawlers. Automatically obtaining network data is indeed a pleasant thing, and Python is very helpful. We achieve this pleasure. However, crawlers often encounter various login and verification obstacles, which is frustrating (website: it is also very frustrating to encounter various crawlers grabbing our website every day~). Reptiles and anti-reptiles are like a game of cat and mouse. One foot is higher than the other, and the two are repeatedly entangled.
Due to the stateless nature of the http protocol, login verification is implemented by passing cookies. Once you log in through the browser, the cookie of the login information will be saved by the browser. The next time you open the website, the browser will automatically bring the saved cookies. As long as the cookies have not expired, you will still be logged in to the website.
The browsercookie module is such a tool to extract saved cookies from the browser. It is a very useful crawler tool that allows you to easily download web content that requires login by loading the cookies of your browser into a cookiejar object.
Installation
pip install browsercookie
On Windows systems, the built-in sqlite module throws an error when loading the FireFox database. The version of sqlite needs to be updated:
pip install pysqlite
Usage method
The following is an example of extracting the title from the web page:
>>> import re >>> get_title = lambda html: re.findall('<title>(.*?)</title>', html, flags=re.DOTALL)[0].strip()
The following is downloaded without logging in Title:
>>> import urllib2 >>> url = ' >>> public_html = urllib2.urlopen(url).read() >>> get_title(public_html)'Git and Mercurial code management for teams'
Next, use browsercookie to get the cookie from FireFox that has logged in to Bitbucket and then download it:
>>> import browsercookie >>> cj = browsercookie.firefox() >>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) >>> login_html = opener.open(url).read() >>> get_title(login_html)'richardpenman / home — Bitbucket'
The above is the code for Python2, try Python3 again:
>>> import urllib.request >>> public_html = urllib.request.urlopen(url).read() >>> opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
You can see that your username appears in the title, which means that the browsercookie module successfully loaded cookies from FireFox.
The following is an example of using requests. This time we load cookies from Chrome. Of course, you need to log in to Bitbucket with Chrome in advance:
>>> import requests >>> cj = browsercookie.chrome() >>> r = requests.get(url, cookies=cj) >>> get_title(r.content)'richardpenman / home — Bitbucket'
If you don’t know or don’t care which browser has you Required cookies, you can do this:
>>> cj = browsercookie.load() >>> r = requests.get(url, cookies=cj) >>> get_title(r.content)'richardpenman / home — Bitbucket'
Support
Currently, this module supports the following platforms:
Chrome: Linux, OSX, Windows
Firefox: Linux , OSX, Windows
Currently there are not many browser versions that this module has tested. You may encounter problems during use. You can submit questions to the author:
https://bitbucket .org/richardpenman/browsercookie/
##
The above is the detailed content of Python crawler uses browser cookies: browsercookie. For more information, please follow other related articles on the PHP Chinese website!