Community

Learn

Tools Library

AI Tools

Leisure

English

Home > Backend Development > Python Tutorial > Python实现抓取页面上链接的简单爬虫分享

Python实现抓取页面上链接的简单爬虫分享

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-06-10 15:18:15

Original

1563 people have browsed it

除了C/C++以外，我也接触过不少流行的语言，PHP、java、javascript、python，其中python可以说是操作起来最方便，缺点最少的语言了。

前几天想写爬虫，后来跟朋友商量了一下，决定过几天再一起写。爬虫里重要的一部分是抓取页面中的链接，我在这里简单的实现一下。

首先我们需要用到一个开源的模块，requests。这不是python自带的模块，需要从网上下载、解压与安装：

复制代码代码如下:

$ curl -OL https://github.com/kennethreitz/requests/zipball/master
$ python setup.py install

windows用户直接点击下载。解压后再本地使用命令python setup.py install安装即可。 https://github.com/kennethreitz/requests/zipball/master

这个模块的文档我也正在慢慢翻译，翻译完了就给大家传上来（英文版先发在附件里）。就像它的说明里面说的那样，built for human beings,为人类而设计。使用它很方便，自己看文档。最简单的，requests.get()就是发送一个get请求。

代码如下：

复制代码代码如下:

# coding:utf-8
import re
import requests

# 获取网页内容
r = requests.get('http://www.163.com')
data = r.text

# 利用正则查找所有连接
link_list =re.findall(r"(? for url in link_list:
print url

首先import进re和requests模块，re模块是使用正则表达式的模块。

data = requests.get('http://www.163.com')，向网易首页提交get请求，得到一个requests对象r，r.text就是获得的网页源代码，保存在字符串data中。

再利用正则查找data中所有的链接，我的正则写的比较粗糙，直接把href=""或href=''之间的信息获取到，这就是我们要的链接信息。

re.findall返回的是一个列表，用for循环遍历列表并输出：

这是我获取到的所有连接的一部分。

上面是获取网站里所有链接的一个简单的实现，没有处理任何异常，没有考虑到超链接的类型，代码仅供参考。requests模块文档见附件。

Related labels：

python

Previous article：Python中处理unchecked未捕获异常实例 Next article：Python中多线程及程序锁浅析

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Latest Articles by Author

How LLMs Work: Pre-Training to Post-Training, Neural Networks, Hallucinations, and Inference

2025-02-26 03:58:14
I Combined the Blockchain and AI to Generate Art. Here’s What Happened Next.

2025-02-26 03:38:10
Advanced Prompt Engineering: Chain of Thought (CoT)

2025-02-26 03:17:10
Retrieval Augmented Generation in SQLite

2025-02-26 02:49:09
How to Use an LLM-Powered Boilerplate for Building Your Own Node.js API

2025-02-26 01:08:13
LLMs for Coding in 2024: Price, Performance, and the Battle for the Best

2025-02-26 00:46:10
Prompting Vision Language Models

2025-02-25 23:42:08
How to Measure the Reliability of a Large Language Model's Response

2025-02-25 22:50:13
An Illusion of Life

2025-02-25 21:54:11
Scientists Go Serious About Large Language Models Mirroring Human Thinking

2025-02-25 20:45:11

Latest Issues

What are some popular Python libraries and their uses?

2025-03-21 18:46:29
What is pickling and unpickling in Python?

2025-03-21 18:45:34
What are your favorite Python resources for learning and development?

2025-03-21 13:19:29
How do you work with environment variables in Python?

2025-03-21 13:16:30
What is the purpose of the gc module in Python?

2025-03-21 13:13:27

Related Topics

More>

Popular Recommendations

Popular Tutorials

More>

Related Tutorials

Popular Recommendations

Latest courses

Latest Downloads

More>

Web Effects

Website Source Code

Website Materials

Front End Template