Python crawler code example method: first obtain the browser information and use urlencode to generate post data; then install pymysql and store the data in MySQL.
Python crawler code example method:
1, urllib and BeautifuluSoup
Get browser information
from urllib import request req = request.urlopen("http://www.baidu.com") print(req.read().decode("utf-8"))
Simulate a real browser: carry user-Agent header
(The purpose is to prevent the server from thinking it is a crawler. If this browser information is not carried, then An error may be reported)
req = request.Request(url) #此处url为某个网址 req.add_header(key,value) #key即user-Agent,value即浏览器的版本信息 resp = request.urlopen(req) print(resp.read().decode("utf-8"))
Related learning recommendations: python video tutorial
Use POST
to import parse under the urllib library
from urllib import parse
Use urlencode to generate post data
postData = parse.urlencode([ (key1,val1), (key2,val2), (keyn,valn) ])
Use post
request.urlopen(req,data=postData.encode("utf-8")) #使用postData发送post请求 resp.status #得到请求状态 resp.reason #得到服务器的类型
Complete code example (take crawling Wikipedia home page link as an example)
#-*- coding:utf-8 -*- from bs4 import BeautifulSoup as bs from urllib.request import urlopen import re import ssl #获取维基百科词条信息 ssl._create_default_https_context = ssl._create_unverified_context #全局取消证书验证 #请求URL,并把结果用utf-8编码 req = urlopen("https://en.wikipedia.org/wiki/Main page").read().decode("utf-8") #使用beautifulsoup去解析 soup = bs(req,"html.parser") # print(soup) #获取所有href属性以“/wiki/Special”开头的a标签 urllist = soup.findAll("a",href=re.compile("^/wiki/Special")) for url in urllist: #去除以.jpg或.JPG结尾的链接 if not re.search("\.(jpg|JPG)$",url["href"]): #get_test()输出标签下的所有内容,包括子标签的内容; #string只输出一个内容,若该标签有子标签则输出“none print(url.get_text()+"----->"+url["href"]) # print(url)
2. Store data in MySQL
Install pymysql
Install via pip:
$ pip install pymysql
or install the file:
$ python setup.py install
Use
#引入开发包 import pymysql.cursors #获取数据库链接 connection = pymysql.connect(host="localhost", user = 'root', password = '123456', db ='wikiurl', charset = 'utf8mb4') try: #获取会话指针 with connection.cursor() as cursor #创建sql语句 sql = "insert into `tableName`(`urlname`,`urlhref`) values(%s,%s)" #执行SQL语句 cursor.execute(sql,(url.get_text(),"https://en.wikipedia.org"+url["href"])) #提交 connection.commit() finally: #关闭 connection.close()
3. Precautions for crawlers
The full name of the Robots protocol (Robot protocol, also known as the crawler protocol) is the "Web crawler exclusion protocol". The website tells the search engine through the Robots protocol Which pages can be crawled and which pages cannot be crawled. Generally under the main page, such as https://en.wikipedia.org/robots.txt
Disallow:不允许访问 allow:允许访问
Related recommendations: Programming Video Course
The above is the detailed content of How to example crawler code in python. For more information, please follow other related articles on the PHP Chinese website!