This article mainly introduces Python's method of obtaining all links in the current page. It compares and analyzes four commonly used methods of obtaining page links in Python with examples. It also comes with the method of obtaining links within the iframe framework. Friends who need it You can refer to the following examples
This article describes the four methods of Python to obtain all links in the current page. Share it with everyone for your reference, the details are as follows:
''' 得到当前页面所有连接 ''' import requests import re from bs4 import BeautifulSoup from lxml import etree from selenium import webdriver url = 'http://www.testweb.com' r = requests.get(url) r.encoding = 'gb2312' # 利用 re (太黄太暴力!) matchs = re.findall(r"(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" , r.text) for link in matchs: print(link) print() # 利用 BeautifulSoup4 (DOM树) soup = BeautifulSoup(r.text,'lxml') for a in soup.find_all('a'): link = a['href'] print(link) print() # 利用 lxml.etree (XPath) tree = etree.HTML(r.text) for link in tree.xpath("//@href"): print(link) print() # 利用selenium(要开浏览器!) driver = webdriver.Firefox() driver.get(url) for link in driver.find_elements_by_tag_name("a"): print(link.get_attribute("href")) driver.close()
Note: If the page contains an iframe, all tags of the page contained in the iframe cannot be used. Four ways to get it! ! ! At this time:
# 再打开所有iframe查找全部的a标签 for iframe in soup.find_all('iframe'): url_ifr = iframe['src'] # 取得当前iframe的src属性值 rr = requests.get(url_ifr) rr.encoding = 'gb2312' soup_ifr = BeautifulSoup(rr.text,'lxml') for a in soup_ifr.find_all('a'): link = a['href'] m = re.match(r'http:\/\/.*?(?=\/)',link) #print(link) if m: all_urls.add(m.group(0))
The above is the detailed content of Python uses four methods to achieve comparative analysis of all links in the current page. For more information, please follow other related articles on the PHP Chinese website!