Table of Contents

This is a heading" > - 等：定义HTML 标题。    ：定义HTML 段落。
    ：定义HTML 链接。
    ：定义HTML 图像。
     :HTML分组标签，定义文档中的分区或节。
      This is a heading
								
																
									This is a heading

Home

Web Front-end

HTML Tutorial

爬虫的理论知识储备_html/css_WEB-ITnose

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Jun 21, 2016 am 08:54 AM

参考资料：汪海：Python网络爬虫W3School HTML教程《计算机网络第二版》谢希仁

网络爬虫，是一中按照一定的规则，自动地抓取万维网信息的程序或脚本。爬虫通过网页的链接地址来寻找网页并获取网页内容，再根据网页中其他链接不断循环爬取。

1 浏览网页的过程

浏览网页的过程其实就是浏览器作为一个浏览的“客户端”，向服务器端发送了一次请求，把服务器端的文件“抓”到本地，再进行解释、展现。

使用统一资源定位符URL来标志万维网上的各种文档，并使每一个文档在整个因特网的范围内具有唯一的标识符URL。
通过超文本传送协议HTTP来实现万维网上各种连接，使用TCP连接进行可靠的传送。
使用超文本标记语言HTML使得网页设计者可以很方便地用链接从本页面的某处链接到任意网页，并在自己主机屏幕上显示。

2 统一资源定位符URL

URL是用来表示从因特网上得到的资源位置和访问这些资源的方法。URL给资源的位置提供一种抽象的识别方法，并用这种方法给资源定位。只要能够对资源定位，系统就可以对资源进行各种操作，如存取、更新、替换和查找其属性。URL相当于一个文件名在网络范围的扩展。因此，URL是与因特网相连的机器上的任何可访问对象的指针。由于访问不同对象使用的协议不同，URL还能之处读取某个对象时所使用的协议。URL的一般形式为：

 <协议>://<主机>:<端口>/<路径>

Copy after login

协议是指用哪种协议获取该万维网文档，如http，ftp；主机是指该网络文档所在主机的域名；端口和路径有时可以省略。对万维网的网点访问使用HTTP协议，HTTP的默认端口号是80，通常可省略。若在省略文件的路径，则URL就指到因特网上的某个主页。如： www.baidu.com。

3 超文本传送协议HTTP

HTTP协议定义了浏览器怎样向万维网服务器请求万维网文档，以及服务器怎样把文档传送给浏览器。下图给出了万维网的大致工作过程。

万维网工作过程

HTTP规定在HTTP客户与HTTP服务器之间的每次交互，都由一个ASCII码穿构成的请求和一个“MIME-like”的响应组成，HTTP报文通常都使用TCP连接传送。

HTTP有两类报文：请求报文（从客户向服务器发送请求报文）和响应报文（从服务器到客户的回答）。HTTP请求报文和响应报文都是由三部分组成，两种报文格式的区别就是开始行不同。

开始行 用于区分是请求报文还是响应报文。开始行在两种报文中分别叫请求行和状态行。
首部行 用来说明浏览器或报文主题的一些信息。
实体主体 在请求报文中一般不用该字段，而在响应报文中也可能没有该字段。

请求行只有三个内容，即方法、请求资源URL和HTTP的版本。下表给出了请求报文中常用的几种方法。

方法	意义
GET	请求读取URL标志的信息
OPTION	请求一些选项的信息
HEAD	请求读取URL标志信息的首部
POST	给服务器添加信息，如注释
PUT	在致命的URL下存储一个文档
DELETE	删除致命的URL所标志的资源
CONNECT	用于代理服务器

GET http://www.bilibili.com/video/douga.html  HTTP/1.1

Copy after login

下面是一个请求报文的例子

请求报文

4 超文本标记语言HTML

HTML指的是超文本标记语言，是使用标记标签来描述网页的。

HTML标签是由尖括号包围的关键词，比如。HTML标签通常是成对出现的，标签对中的第一个标签是开始标签，第二个是结束标签，比如和。

HTML文档包含HTML标签和纯文本，也称为网页。Web 浏览器的作用是读取 HTML 文档，并以网页的形式显示出它们。浏览器不会显示 HTML 标签，而是使用标签来解释页面的内容。

四个基本的标签

-
等：定义HTML 标题。
：定义HTML 段落。
：定义HTML 链接。

：定义HTML 图像。

:HTML分组标签，定义文档中的分区或节。

<h1>This is a heading</h1><h2 id="This-is-a-heading">This is a heading</h2><h3 id="This-is-a-heading">This is a heading</h3><p>This is a paragraph.</p><p>This is another paragraph.</p><a href="http://www.w3school.com.cn">This is a link</a><img  src="/static/imghw/default1.png"  data-src="w3school.jpg"  class="lazy"    style="max-width:90%"  style="max-width:90%" / alt="爬虫的理论知识储备_html/css_WEB-ITnose" >

Copy after login

HTML 元素指的是从开始标签（start tag）到结束标签（end tag）的所有代码。元素的内容是开始标签与结束标签之间的内容。大多数 HTML 元素可以嵌套（可以包含其他 HTML 元素），HTML 文档由嵌套的 HTML 元素构成。如下例包含3个HTML元素。

<html>    <body>        <p>This is my first paragraph.</p>    </body></html>

Copy after login

HTML 属性：HTML 标签可以拥有属性，属性提供了有关 HTML 元素的更多的信息，属性总是以名称/值对的形式出现，比如：name="value"，属性总是在 HTML 元素的开始标签中规定；属性值应该始终被包括在引号内，双引号是最常用的，不过使用单引号也没有问题。

HTML 链接由标签定义，链接的地址在 href 属性中指定：This is a link

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

How Long Does It Take To Beat Split Fiction?

1 months ago By DDD

R.E.P.O. Best Graphic Settings

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows: Seashell Riddle Solution

1 weeks ago By DDD

R.E.P.O. How to Fix Audio if You Can't Hear Anyone

2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7409

Java Tutorial

1631

CakePHP Tutorial

1358

Laravel Tutorial

1268

PHP Tutorial

1218

Related knowledge

What is the purpose of the <datalist> element? Mar 21, 2025 pm 12:33 PM

The article discusses the HTML <datalist> element, which enhances forms by providing autocomplete suggestions, improving user experience and reducing errors.Character count: 159

What is the purpose of the <progress> element? Mar 21, 2025 pm 12:34 PM

The article discusses the HTML <progress> element, its purpose, styling, and differences from the <meter> element. The main focus is on using <progress> for task completion and <meter> for stati

What is the purpose of the <meter> element? Mar 21, 2025 pm 12:35 PM

The article discusses the HTML <meter> element, used for displaying scalar or fractional values within a range, and its common applications in web development. It differentiates <meter> from <progress> and ex

What is the purpose of the <iframe> tag? What are the security considerations when using it? Mar 20, 2025 pm 06:05 PM

The article discusses the <iframe> tag's purpose in embedding external content into webpages, its common uses, security risks, and alternatives like object tags and APIs.

What is the viewport meta tag? Why is it important for responsive design? Mar 20, 2025 pm 05:56 PM

The article discusses the viewport meta tag, essential for responsive web design on mobile devices. It explains how proper use ensures optimal content scaling and user interaction, while misuse can lead to design and accessibility issues.

How do I use HTML5 form validation attributes to validate user input? Mar 17, 2025 pm 12:27 PM

The article discusses using HTML5 form validation attributes like required, pattern, min, max, and length limits to validate user input directly in the browser.

What are the best practices for cross-browser compatibility in HTML5? Mar 17, 2025 pm 12:20 PM

Article discusses best practices for ensuring HTML5 cross-browser compatibility, focusing on feature detection, progressive enhancement, and testing methods.

How do I use the HTML5 <time> element to represent dates and times semantically? Mar 12, 2025 pm 04:05 PM

This article explains the HTML5 <time> element for semantic date/time representation. It emphasizes the importance of the datetime attribute for machine readability (ISO 8601 format) alongside human-readable text, boosting accessibilit

See all articles

This is a heading" > -

等：定义HTML 标题。 ：定义HTML 段落。

This is a heading

爬虫的理论知识储备_html/css_WEB-ITnose

1 浏览网页的过程

2 统一资源定位符URL

3 超文本传送协议HTTP

4 超文本标记语言HTML

-

等：定义HTML 标题。

Hot AI Tools

Undresser.AI Undress

AI Clothes Remover

Undress AI Tool

Clothoff.io

AI Hentai Generator

Hot Article

Hot Tools

Notepad++7.3.1

SublimeText3 Chinese version

Zend Studio 13.0.1

Dreamweaver CS6

SublimeText3 Mac version

Hot Topics

等：定义HTML 标题。
：定义HTML 段落。

：定义HTML 链接。

：定义HTML 图像。

:HTML分组标签，定义文档中的分区或节。

This is a heading

This is a heading