Python 크롤러에 일반적으로 사용되는 네 가지 요소 찾기 방법을 비교했는데, 어떤 방법을 더 선호하시나요?-파이썬 튜토리얼-php.cn

이 Python 크롤러를 사용하여 데이터를 수집할 때 매우 중요한 작업은 요청된 웹페이지에서 데이터를 추출하는 방법이며, 원하는 데이터를 올바르게 찾는 것이 첫 번째 단계입니다.

이 기사에서는 모든 사람이 배울 수 있도록 여러 Python 크롤러에서 웹 페이지 요소를 찾는 데 일반적으로 사용되는 방법을 비교합니다

“

TraditionalBeautifulSoup 작업 BeautifulSoup 操作

基于 BeautifulSoup 的 CSS 选择器（与 PyQuery 类似）

XPath

BeautifulSoup의 CSS 선택기(PyQuery 유사)

<div class="code" style="position:relative; padding:0px; margin:0px;"><pre class='brush:php;toolbar:false;'>http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1</pre><div class="contentsignin">로그인 후 복사</div></div><figure data-tool="mdnice编辑器" style="margin-top: 10px;margin-bottom: 10px;display: flex;flex-direction: column;justify-content: center;align-items: center;"><img src="/static/imghw/default1.png" data-src="https://img.php.cn/upload/article/001/267/443/21d313e128464b6c1113677cb281678c-1.jpg" class="lazy"/ alt="Python 크롤러에 일반적으로 사용되는 네 가지 요소 찾기 방법을 비교했는데, 어떤 방법을 더 선호하시나요?" ></figure><p data-tool="mdnice编辑器" style="max-width:90%"> 처음 20권의 제목을 예로 들어보겠습니다. 먼저 웹사이트에 크롤링 방지 조치가 설정되어 있지 않은지, 분석할 콘텐츠를 직접 반환할 수 있는지 확인하세요. </p><div class="code" style="position:relative; padding:0px; margin:0px;"><pre class='brush:php;toolbar:false;'>import requests url = &#39;http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1&#39; response = requests.get(url).text print(response)</pre><div class="contentsignin">로그인 후 복사</div></div><figure data-tool="mdnice编辑器" style="margin-top: 10px;margin-bottom: 10px;display: flex;flex-direction: column;justify-content: center;align-items: center;"><img src="/static/imghw/default1.png" data-src="https://img.php.cn/upload/article/001/267/443/21d313e128464b6c1113677cb281678c-2.png" class="lazy"/ alt="Python 크롤러에 일반적으로 사용되는 네 가지 요소 찾기 방법을 비교했는데, 어떤 방법을 더 선호하시나요?" ></figure><p data-tool="mdnice编辑器" style="max-width:90%">신중하게 검사한 결과 반환된 항목에 필요한 데이터가 모두 포함되어 있는 것으로 확인되었습니다. 크롤링 방지 조치를 고려할 필요가 없음을 나타내는 콘텐츠 </p><p data-tool="mdnice编辑器" style="padding-top: 8px;padding-bottom: 8px;line-height: 26px;font-size: 16px;">웹 페이지 요소 검토 서지 정보는 <code style="padding: 2px 4px;border-radius: 4px;margin"에 포함되어 있음을 나중에 확인할 수 있습니다. -오른쪽: 2px;여백-왼쪽: 2px;배경색: rgba(27, 31, 35, 0.05); 글꼴 계열: "Operator Mono", Consolas, Monaco, Menlo, monospace;단어 나누기: 모두 중단 ;color: rgb(255, 100, 65);font-size: 13px;">li</code > in, <code style="padding: 2px 4px;border-radius: 4px;margin-right: 2px에 종속) ;여백-왼쪽: 2px;배경-색상: rgba(27, 31, 35, 0.05);글꼴- 계열: "Operator Mono", Consolas, Monaco, Menlo, monospace;단어 나누기: break-all;색상: rgb (255, 100, 65);글꼴 크기: 13px;">클래스는 bang_listclearfix bang_list_mode의ul li 中，从属于 class 为 bang_list clearfix bang_list_mode 的 ul 中
进一步审查也可以发现书名在的相应位置，这是多种解析方法的重要基础
1. 传统 BeautifulSoup 操作
经典的 BeautifulSoup 方法借助 from bs4 import BeautifulSoup，然后通过 soup = BeautifulSoup(html, "lxml") 将文本转换为特定规范的结构，利用 find
추가 검사도 가능 다양한 분석 방법의 중요한 기반이 되는 책 제목의 해당 위치를 공개합니다
1. 전통적인 BeautifulSoup 작업
클래식 BeautifulSoup 메서드는 bs4에서 BeautifulSoup 가져오기를 가져온 다음 soup = BeautifulSoup(html, "lxml") find 일련의 메소드 분석, 코드는 다음과 같습니다:
import requests
from bs4 import BeautifulSoup

url = &#39;http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1&#39;
response = requests.get(url).text

def bs_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.find(&#39;ul&#39;, class_=&#39;bang_list clearfix bang_list_mode&#39;).find_all(&#39;li&#39;) # 锁定ul后获取20个li
    for li in li_list:
        title = li.find(&#39;div&#39;, class_=&#39;name&#39;).find(&#39;a&#39;)[&#39;title&#39;] # 逐个解析获取书名
        print(title)

if __name__ == &#39;__main__&#39;:
    bs_for_parse(response)
로그인 후 복사
🎜🎜 20권의 책 제목을 성공적으로 얻었습니다. 그 중 일부는 길고 정규식이나 다른 문자열 방법을 통해 처리할 수 있습니다. 이 기사에서는 자세히 소개하지 않습니다🎜
2. 基于 BeautifulSoup 的 CSS 选择器
这种方法实际上就是 PyQuery 中 CSS 选择器在其他模块的迁移使用，用法是类似的。关于 CSS 选择器详细语法可以参考：http://www.w3school.com.cn/cssref/css_selectors.asp由于是基于 BeautifulSoup 所以导入的模块以及文本结构转换都是一致的：
import requests
from bs4 import BeautifulSoup

url = &#39;http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1&#39;
response = requests.get(url).text
        
def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml") 
    print(soup)

if __name__ == &#39;__main__&#39;:
    css_for_parse(response)
로그인 후 복사
然后就是通过 soup.select 辅以特定的 CSS 语法获取特定内容，基础依旧是对元素的认真审查分析：
import requests
from bs4 import BeautifulSoup
from lxml import html

url = &#39;http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1&#39;
response = requests.get(url).text
        
def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.select(&#39;ul.bang_list.clearfix.bang_list_mode > li&#39;)
    for li in li_list:
        title = li.select(&#39;div.name > a&#39;)[0][&#39;title&#39;]
        print(title)

if __name__ == &#39;__main__&#39;:
    css_for_parse(response)
로그인 후 복사
3. XPath
XPath 即为 XML 路径语言，它是一种用来确定 XML 文档中某部分位置的计算机语言，如果使用 Chrome 浏览器建议安装 XPath Helper 插件，会大大提高写 XPath 的效率。
之前的爬虫文章基本都是基于 XPath，大家相对比较熟悉因此代码直接给出：
import requests
from lxml import html

url = &#39;http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1&#39;
response = requests.get(url).text

def xpath_for_parse(response):
    selector = html.fromstring(response)
    books = selector.xpath("//ul[@class=&#39;bang_list clearfix bang_list_mode&#39;]/li")
    for book in books:
        title = book.xpath(&#39;div[@class="name"]/a/@title&#39;)[0]
        print(title)

if __name__ == &#39;__main__&#39;:
    xpath_for_parse(response)
로그인 후 복사
4. 正则表达式
如果对 HTML 语言不熟悉，那么之前的几种解析方法都会比较吃力。这里也提供一种万能解析大法：正则表达式，只需要关注文本本身有什么特殊构造文法，即可用特定规则获取相应内容。依赖的模块是 re
首先重新观察直接返回的内容中，需要的文字前后有什么特殊：
import requests
import re

url = &#39;http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1&#39;
response = requests.get(url).text
print(response)
로그인 후 복사
观察几个数目相信就有答案了：<div class="name"><a href="http://product.dangdang.com/xxxxxxxx.html" target="_blank" title="xxxxxxx"> 书名就藏在上面的字符串中，蕴含的网址链接中末尾的数字会随着书名而改变。
分析到这里正则表达式就可以写出来了：
import requests
import re

url = &#39;http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1&#39;
response = requests.get(url).text

def re_for_parse(response):
    reg = &#39;<div class="name"><a href="http://product.dangdang.com/\d+.html" target="_blank" title="(.*?)">&#39;
    for title in re.findall(reg, response):
        print(title)

if __name__ == &#39;__main__&#39;:
    re_for_parse(response)
로그인 후 복사
可以发现正则写法是最简单的，但是需要对于正则规则非常熟练。所谓正则大法好！
当然，不论哪种方法都有它所适用的场景，在真实操作中我们也需要在分析网页结构来判断如何高效的定位元素，最后附上本文介绍的四种方法的完整代码，大家可以自行操作一下来加深体会
import requests
from bs4 import BeautifulSoup
from lxml import html
import re

url = &#39;http://bang.dangdang.com/books/bestsellers/01.00.00.00.00.00-24hours-0-0-1-1&#39;
response = requests.get(url).text

def bs_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.find(&#39;ul&#39;, class_=&#39;bang_list clearfix bang_list_mode&#39;).find_all(&#39;li&#39;)
    for li in li_list:
        title = li.find(&#39;div&#39;, class_=&#39;name&#39;).find(&#39;a&#39;)[&#39;title&#39;]
        print(title)

def css_for_parse(response):
    soup = BeautifulSoup(response, "lxml")
    li_list = soup.select(&#39;ul.bang_list.clearfix.bang_list_mode > li&#39;)
    for li in li_list:
        title = li.select(&#39;div.name > a&#39;)[0][&#39;title&#39;]
        print(title)

def xpath_for_parse(response):
    selector = html.fromstring(response)
    books = selector.xpath("//ul[@class=&#39;bang_list clearfix bang_list_mode&#39;]/li")
    for book in books:
        title = book.xpath(&#39;div[@class="name"]/a/@title&#39;)[0]
        print(title)

def re_for_parse(response):
    reg = &#39;<div class="name"><a href="http://product.dangdang.com/\d+.html" target="_blank" title="(.*?)">&#39;
    for title in re.findall(reg, response):
        print(title)

if __name__ == &#39;__main__&#39;:
    # bs_for_parse(response)
    # css_for_parse(response)
    # xpath_for_parse(response)
    re_for_parse(response)
로그인 후 복사