Crawling Strategy

Python抓取百度百科数据
Determine the target: Determine which part of the data of which pages of which website to crawl. This example captures the title and introduction of the Baidu Encyclopedia Python entry page and the Python related entry page.
Analysis goal: analyze the format of the URL to be crawled and limit the crawling scope. Analyze the format of the data to be captured. In this example, we need to analyze the format of the tags where the two data, title and introduction, are located. Analyze the format of the page encoding to be crawled. In the web page parser section, you must specify the web page encoding before correct parsing can be performed.
Writing code: In the web page parser part, the results obtained by analyzing the target are used.
Execute crawler: perform data capture.

Analysis objectives

1. URL format
Enter the Baidu Encyclopedia python entry page. The links to related entries on the page are relatively uniform, mostly /view/xxx.htm .
Python抓取百度百科数据

2. Data format
The title is located in the h1 sub-tag under the class lemmaWgt-lemmaTitle-title, and the introduction is located under the class lemma-summary.
Python抓取百度百科数据

3. Encoding format
Check the page encoding format, which is utf-8.
Python抓取百度百科数据

After the above analysis, the results are as follows:
Python抓取百度百科数据

Code writing

Project structure

In sublime Next, create a new folder baike-spider as the project root directory.
Create new spider_main.py as the crawler scheduler.
Create a new url_manger.py as the url manager.
Create a new html_downloader.py as an html downloader.
Create new html_parser.py as the html parser.
Create a new html_outputer.py as a tool for writing data.
The final project structure is as follows:
Python抓取百度百科数据

spider_main.py

# coding:utf-8
import url_manager, html_downloader, html_parser, html_outputer

class SpiderMain(object):
    def __init__(self):
        self.urls = url_manager.UrlManager()
        self.downloader = html_downloader.HtmlDownloader()
        self.parser = html_parser.HtmlParser()
        self.outputer = html_outputer.HtmlOutputer()

    def craw(self, root_url):
        count = 1
        self.urls.add_new_url(root_url)
        while self.urls.has_new_url():
            try:
                new_url = self.urls.get_new_url()
                print('craw %d : %s' % (count, new_url))
                html_cont = self.downloader.download(new_url)
                new_urls, new_data = self.parser.parse(new_url, html_cont)
                self.urls.add_new_urls(new_urls)
                self.outputer.collect_data(new_data)

                if count == 10:
                    break

                count = count + 1
            except:
                print('craw failed')

        self.outputer.output_html()


if __name__=='__main__':
    root_url = 'http://baike.baidu.com/view/21087.htm'
    obj_spider = SpiderMain()
    obj_spider.craw(root_url)

Copy after login

url_manger.py

# coding:utf-8
class UrlManager(object):
    def __init__(self):
        self.new_urls = set()
        self.old_urls = set()

    def add_new_url(self, url):
        if url is None:
            return
        if url not in self.new_urls and url not in self.old_urls:
            self.new_urls.add(url)

    def add_new_urls(self, urls):
        if urls is None or len(urls) == 0:
            return
        for url in urls:
            self.add_new_url(url)

    def has_new_url(self):
        return len(self.new_urls) != 0

    def get_new_url(self):
        new_url = self.new_urls.pop()
        self.old_urls.add(new_url)
        return new_url

Copy after login

html_downloader.py

# coding:utf-8
import urllib.request

class HtmlDownloader(object):
    def download(self, url):
        if url is None:
            return None
        response = urllib.request.urlopen(url)
        if response.getcode() != 200:
            return None
        return response.read()

Copy after login

html_parser.py

# coding:utf-8
from bs4 import BeautifulSoup
import re
from urllib.parse import urljoin

class HtmlParser(object):
    def _get_new_urls(self, page_url, soup):
        new_urls = set()
        # /view/123.htm
        links = soup.find_all('a', href=re.compile(r'/view/\d+\.htm'))
        for link in links:
            new_url = link['href']
            new_full_url = urljoin(page_url, new_url)
            # print(new_full_url)
            new_urls.add(new_full_url)
        #print(new_urls)
        return new_urls

    def _get_new_data(self, page_url, soup):
        res_data = {}
        # url
        res_data['url'] = page_url
        # <dd class="lemmaWgt-lemmaTitle-title"> <h1>Python</h1>
        title_node = soup.find('dd', class_='lemmaWgt-lemmaTitle-title').find('h1')
        res_data['title'] = title_node.get_text()
        # <p class="lemma-summary" label-module="lemmaSummary">
        summary_node = soup.find('p', class_='lemma-summary')
        res_data['summary'] = summary_node.get_text()
        # print(res_data)
        return res_data

    def parse(self, page_url, html_cont):
        if page_url is None or html_cont is None:
            return
        soup = BeautifulSoup(html_cont, 'html.parser')
        # print(soup.prettify())
        new_urls = self._get_new_urls(page_url, soup)
        new_data = self._get_new_data(page_url, soup)
        # print('mark')
        return new_urls, new_data

Copy after login

html_outputer.py

# coding:utf-8
class HtmlOutputer(object):
    def __init__(self):
        self.datas = []

    def collect_data(self, data):
        if data is None:
            return
        self.datas.append(data)

    def output_html(self):
        fout = open('output.html','w', encoding='utf-8')

        fout.write('<html>')
        fout.write('<body>')
        fout.write('<table>')

        for data in self.datas:
            fout.write('<tr>')
            fout.write('<td>%s</td>' % data['url'])
            fout.write('<td>%s</td>' % data['title'])
            fout.write('<td>%s</td>' % data['summary'])
            fout.write('</tr>')

        fout.write('</table>')
        fout.write('</body>')
        fout.write('</html>')

        fout.close()

Copy after login

Run

At the command line, execute python spider_main.py.

Encoding problem

Problem description: UnicodeEncodeError: 'gbk' codec can't encode character 'xa0' in position ...

When writing a file using Python, or When writing network data streams to local files, you will encounter this problem in most cases. There are many similar articles on the Internet about how to solve this problem, but they are nothing more than encoding and decoding. Is this the real cause of this problem? no. Many times, we use decode and encode, and try various encodings, such as utf8, utf-8, gbk, gb2312, etc. We have tried all the encodings, but the error still occurs, which is frustrating.

When writing python scripts under windows, the coding problem is very serious. When writing network data streams to files, we will encounter several encodings:
1, #encoding='XXX'
The encoding here (that is, the content of the first line of the python file) refers to the python script The encoding of the file itself, doesn't matter. As long as the encoding of XXX and the file itself are the same, it will be fine.
For example, various encodings can be set in the "Format" menu of notepad++. In this case, you need to ensure that the encoding set in the menu is the same as encoding XXX. If they are different, an error will be reported.

2. Encoding of network data stream
For example, if you obtain a web page, then the encoding of the network data stream is the encoding of the web page. Need to use decode to decode into unicode encoding.

3. Encoding of the target file
Write the network data stream to the new file. The file writing code is as follows:

fout = open('output.html','w')
fout.write(str)

Copy after login

Under windows, the default encoding of the new file is gbk, python The interpreter will use gbk encoding to parse our network data stream str. However, str is a decoded unicode encoding. This will cause the parsing to fail and the above problems will occur. The solution is to change the encoding of the target file:

fout = open('output.html','w', encoding='utf-8')

Copy after login

运行结果

Python抓取百度百科数据

更多Python抓取百度百科数据相关文章请关注PHP中文网！

Python crawls Baidu Encyclopedia data

Crawling Strategy

Analysis objectives

Code writing

Project structure

spider_main.py

url_manger.py

html_downloader.py

html_parser.py

html_outputer.py

Run

Encoding problem

运行结果