使用Python的urllib和urllib2模块制作爬虫的实例教程
urllib
学习python完基础,有些迷茫.眼睛一闭,一种空白的窒息源源不断而来.还是缺少练习,遂拿爬虫来练练手.学习完斯巴达python爬虫课程后,将心得整理如下,供后续翻看.整篇笔记主要分以下几个部分:
- 1.做一个简单的爬虫程序
- 2.小试牛刀--抓取百度贴吧图片
- 3.总结
1.做一个简单的爬虫程序
首先环境描述
- Device: Mba 2012 Yosemite 10.10.1
- Python: python 2.7.9
- 编辑器: Sublime Text 3
这个没有什么好说的,直接上代码吧!
''' @ urllib为python自带的一个网络库 @ urlopen为urllib的一个方法,用于打开一个连接并抓取网页, 然后通过read()方法把值赋给read() ''' import urllib url = "http://www.lifevc.com"#多嘴两句,为什么要选lifevc呢,主要是最近它很惹我. html = urllib.urlopen(url) content = html.read() html.close() #可以通过print打印出网页内容 print content
很简单,基本上没有可说的,这个也就是python的魅力,几行代码就完成.
当然我们仅仅抓取网页,没有实在的价值.接下来我们就开始做一点有意义的事情.
2.小试牛刀
抓取百度贴吧图片
其实也很简单,因为要抓取图片,还需要先分析一下网页源代码
(这里以知道基本html知识,浏览器以chrome为例)
如图,这里简要说下步骤,请参考.
打开网页,右键点击,选择"inspect Element"(最下面这一项)
点击下面弹起来的框框最左边那个问号,问号会变成蓝色
移动鼠标去点击我们想要抓取的图片(一个萌妹子)
如图,我们就可以图片在源码中的位置了
下面将源码相关拷贝出来
<img class="BDE_Image lazy" src="/static/imghw/default1.png" data-src="(.+?\.jpg)" style="max-width:90%" style="max-width:90%" style="cursor: url(http://tb2.bdstatic.com/tb/ static-pb/img/cur_zin.cur), pointer;">
经分析和对比(这里略掉),基本上可以看到要抓取的图片几个特征:
- 在img标签下
- 在名为BDE_Image的类下面
- 图片格式为jpg
正则表达式后续我会更新,请关注
依照上述判断,直接上代码
''' @本程序用来下载百度贴吧图片 @re 为正则说明库 ''' import urllib import re # 获取网页html信息 url = "http://tieba.baidu.com/p/2336739808" html = urllib.urlopen(url) content = html.read() html.close() # 通过正则匹配图片特征,并获取图片链接 img_tag = re.compile(r'') img_links = re.findall(img_tag, content) # 下载图片 img_counter为图片计数器(文件名) img_counter = 0 for img_link in img_links: img_name = '%s.jpg' % img_counter urllib.urlretrieve(img_link, "//Users//Sean//Downloads//tieba//%s" %img_name) img_counter += 1
如图,我们就抓取你懂的图片
3.总结
如上两节,我们就很轻松的就可以网页或者图片.
补充一点小技巧,如果遇到不是很明白的库或者方法,可以通过以下方法进行初步了解.
- dir(urllib) #查看当前库有哪些方法
- help(urllib.urlretrieve) #查看跟当前方法相关的作用或者参数,官方比较权威
或者https://docs.python.org/2/library/index.html进项相关搜索.
当然百度也可以,但是效率太低.建议使用 http://xie.lu 进行相关搜索(你懂了,绝对满意).
这里我们讲解如何抓取网页和下载图片,在下面我们会讲解如何抓取有限制抓取的网站.
urllib2
上面我们讲解如何抓取网页和下载图片,在下一节里面我们会讲解如何抓取有限制抓取的网站
首先,我们依然用我们上一节课的方法去抓取一个大家都用来举例的网站
- 1.抓取受限网页
- 2.对代码进行一些优化
1.抓取受限网页
首先使用我们上一节学到的知识测试一下:
''' @本程序用来抓取blog.csdn.net网页 ''' import urllib url = "http://blog.csdn.net/FansUnion" html = urllib.urlopen(url) #getcode()方法为返回Http状态码 print html.getcode() html.close() #输出
403
此处我们的输出为403,代表拒绝访问;同理200表示请求成功完成;404表示网址未找到.
可见csdn已做了相关屏蔽,通过第一节的方法是无法获取网页,在这里我们需要启动一个新的库:urllib2
但是我们也看到浏览器可以发那个文,是不是我们模拟浏览器操作,就可以获取网页信息.
老办法,我们先来看看浏览器是如何提交请求给csdn服务器的.首先简述一下方法:
- 打开网页,右键点击,选择"inspect Element"(最下面这一项)
- 点击下面弹起来的框框的Network选项卡
- 刷新网页,就可以看到Network选项卡抓取了很多信息
- 找到其中一个信息展开,就能看到请求包的Header
以下就是整理后的Header信息
Request Method:GET Host:blog.csdn.net Referer:http://blog.csdn.net/?ref=toolbar_logo User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36
然后根据提取的Header信息,利用urllib2的Request方法模拟浏览器向服务器提交请求,代码如下:
# coding=utf-8 ''' @本程序用来抓取受限网页(blog.csdn.net) @User-Agent:客户端浏览器版本 @Host:服务器地址 @Referer:跳转地址 @GET:请求方法为GET ''' import urllib2 url = "http://blog.csdn.net/FansUnion" #定制自定义Header,模拟浏览器向服务器提交请求 req = urllib2.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36') req.add_header('Host', 'blog.csdn.net') req.add_header('Referer', 'http://blog.csdn.net') req.add_header('GET', url) #下载网页html并打印 html = urllib2.urlopen(req) content = html.read() print content html.close()
呵呵,你限制我,我就跳过你的限制.据说只要浏览器能够访问的,就能够通过爬虫抓取.
2.对代码进行一些优化
简化提交Header方法
发现每次写那么多req.add_header对自己来说是一种折磨,有没有什么方法可以只要复制过来就使用.答案是肯定的.
#input: help(urllib2.Request) #output(因篇幅关系,只取__init__方法) __init__(self, url, data=None, headers={}, origin_req_host=None, unverifiable=False) 通过观察,我们发现headers={},就是说可以以字典的方式提交header信息.那就动手试试咯!! #只取自定义Header部分代码 csdn_headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36", "Host": "blog.csdn.net", 'Referer': 'http://blog.csdn.net', "GET": url } req = urllib2.Request(url,headers=csdn_headers)
发现是不是很简单,在这里感谢斯巴达的无私赐教.
提供动态头部信息
如果按照上述方法进行抓取,很多时候会因为提交信息过于单一,被服务器认为是机器爬虫进行拒绝.
那我们是不是有一些更为智能的方法提交一些动态的数据,答案肯定也是肯定的.而且很简单,直接上代码!
''' @本程序是用来动态提交Header信息 @random 动态库,详情请参考<https://docs.python.org/2/library/random.html> ''' # coding=utf-8 import urllib2 import random url = 'http://www.lifevc.com/' my_headers = [ 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; .NET CLR 3.0.04506.30; .NET CLR 3.0.04506.648)', 'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; .NET CLR 2.0.50727; InfoPath.1', 'Mozilla/4.0 (compatible; GoogleToolbar 5.0.2124.2070; Windows 6.0; MSIE 8.0.6001.18241)', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; Sleipnir/2.9.8)', #因篇幅关系,此处省略N条 ] random_header = random.choice(headers) # 可以通过print random_header查看提交的header信息 req = urllib2.Request(url) req.add_header("User-Agent", random_header) req.add_header('Host', 'blog.csdn.net') req.add_header('Referer', 'http://blog.csdn.net') req.add_header('GET', url) content = urllib2.urlopen(req).read() print content
其实很简单,这样我们就完成了对代码的一些优化.

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



In VS Code, you can run the program in the terminal through the following steps: Prepare the code and open the integrated terminal to ensure that the code directory is consistent with the terminal working directory. Select the run command according to the programming language (such as Python's python your_file_name.py) to check whether it runs successfully and resolve errors. Use the debugger to improve debugging efficiency.

VS Code can be used to write Python and provides many features that make it an ideal tool for developing Python applications. It allows users to: install Python extensions to get functions such as code completion, syntax highlighting, and debugging. Use the debugger to track code step by step, find and fix errors. Integrate Git for version control. Use code formatting tools to maintain code consistency. Use the Linting tool to spot potential problems ahead of time.

VS Code extensions pose malicious risks, such as hiding malicious code, exploiting vulnerabilities, and masturbating as legitimate extensions. Methods to identify malicious extensions include: checking publishers, reading comments, checking code, and installing with caution. Security measures also include: security awareness, good habits, regular updates and antivirus software.

VS Code can run on Windows 8, but the experience may not be great. First make sure the system has been updated to the latest patch, then download the VS Code installation package that matches the system architecture and install it as prompted. After installation, be aware that some extensions may be incompatible with Windows 8 and need to look for alternative extensions or use newer Windows systems in a virtual machine. Install the necessary extensions to check whether they work properly. Although VS Code is feasible on Windows 8, it is recommended to upgrade to a newer Windows system for a better development experience and security.

Python excels in automation, scripting, and task management. 1) Automation: File backup is realized through standard libraries such as os and shutil. 2) Script writing: Use the psutil library to monitor system resources. 3) Task management: Use the schedule library to schedule tasks. Python's ease of use and rich library support makes it the preferred tool in these areas.

VS Code is the full name Visual Studio Code, which is a free and open source cross-platform code editor and development environment developed by Microsoft. It supports a wide range of programming languages and provides syntax highlighting, code automatic completion, code snippets and smart prompts to improve development efficiency. Through a rich extension ecosystem, users can add extensions to specific needs and languages, such as debuggers, code formatting tools, and Git integrations. VS Code also includes an intuitive debugger that helps quickly find and resolve bugs in your code.

VS Code not only can run Python, but also provides powerful functions, including: automatically identifying Python files after installing Python extensions, providing functions such as code completion, syntax highlighting, and debugging. Relying on the installed Python environment, extensions act as bridge connection editing and Python environment. The debugging functions include setting breakpoints, step-by-step debugging, viewing variable values, and improving debugging efficiency. The integrated terminal supports running complex commands such as unit testing and package management. Supports extended configuration and enhances features such as code formatting, analysis and version control.

Yes, VS Code can run Python code. To run Python efficiently in VS Code, complete the following steps: Install the Python interpreter and configure environment variables. Install the Python extension in VS Code. Run Python code in VS Code's terminal via the command line. Use VS Code's debugging capabilities and code formatting to improve development efficiency. Adopt good programming habits and use performance analysis tools to optimize code performance.
