Home > Backend Development > Python Tutorial > Python crawler uses browser cookies: browsercookie

Python crawler uses browser cookies: browsercookie

JUAN
Release: 2019-02-18 12:01:30
Original
2943 people have browsed it


Many people who use Python may have written web crawlers. Automatically obtaining network data is indeed a pleasant thing, and Python is very helpful. We achieve this pleasure. However, crawlers often encounter various login and verification obstacles, which is frustrating (website: it is also very frustrating to encounter various crawlers grabbing our website every day~). Reptiles and anti-reptiles are like a game of cat and mouse. One foot is higher than the other, and the two are repeatedly entangled.

Due to the stateless nature of the http protocol, login verification is implemented by passing cookies. Once you log in through the browser, the cookie of the login information will be saved by the browser. The next time you open the website, the browser will automatically bring the saved cookies. As long as the cookies have not expired, you will still be logged in to the website.

The browsercookie module is such a tool to extract saved cookies from the browser. It is a very useful crawler tool that allows you to easily download web content that requires login by loading the cookies of your browser into a cookiejar object.

Installation

pip install browsercookie

On Windows systems, the built-in sqlite module throws an error when loading the FireFox database. The version of sqlite needs to be updated:
pip install pysqlite

Usage method

The following is an example of extracting the title from the web page:

>>> import re
>>> get_title = lambda html: re.findall(&#39;<title>(.*?)</title>&#39;, html, flags=re.DOTALL)[0].strip()
Copy after login

The following is downloaded without logging in Title:

>>> import urllib2
>>> url = &#39; 
>>> public_html = urllib2.urlopen(url).read()
>>> get_title(public_html)&#39;Git and Mercurial code management for teams&#39;
Copy after login

Next, use browsercookie to get the cookie from FireFox that has logged in to Bitbucket and then download it:

>>> import browsercookie
>>> cj = browsercookie.firefox()
>>> opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
>>> login_html = opener.open(url).read()
>>> get_title(login_html)&#39;richardpenman / home &mdash; Bitbucket&#39;
Copy after login

The above is the code for Python2, try Python3 again:

>>> import urllib.request
>>> public_html = urllib.request.urlopen(url).read()
>>> opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
Copy after login

You can see that your username appears in the title, which means that the browsercookie module successfully loaded cookies from FireFox.

The following is an example of using requests. This time we load cookies from Chrome. Of course, you need to log in to Bitbucket with Chrome in advance:

>>> import requests
>>> cj = browsercookie.chrome()
>>> r = requests.get(url, cookies=cj)
>>> get_title(r.content)&#39;richardpenman / home &mdash; Bitbucket&#39;
Copy after login

If you don’t know or don’t care which browser has you Required cookies, you can do this:

>>> cj = browsercookie.load()
>>> r = requests.get(url, cookies=cj)
>>> get_title(r.content)&#39;richardpenman / home &mdash; Bitbucket&#39;
Copy after login

Support

Currently, this module supports the following platforms:

Chrome: Linux, OSX, Windows
Firefox: Linux , OSX, Windows

Currently there are not many browser versions that this module has tested. You may encounter problems during use. You can submit questions to the author:

https://bitbucket .org/richardpenman/browsercookie/



##

The above is the detailed content of Python crawler uses browser cookies: browsercookie. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template