This article mainly introduces the method of Python to capture HTML web pages and save them in the form of PDF files. It analyzes the installation of the PyPDF2 module and the related operating skills of Python to capture HTML pages and generate PDF files based on the PyPDF2 module in the form of examples. , Friends who need it can refer to
The example in this article describes how Python can capture HTML web pages and save them as PDF files. Share it with everyone for your reference, the details are as follows:
1. Preface
Today I will introduce how to capture HTML web pages and save them as PDF , without further ado, go directly to the tutorial.
2. Preparation
1. Installation and use of PyPDF2 (used to merge PDF):
PyPDF2 version: 1.25.1
Installation:
pip install PyPDF2
Usage example:
from PyPDF2 import PdfFileMerger merger = PdfFileMerger() input1 = open("hql_1_20.pdf", "rb") input2 = open("hql_21_40.pdf", "rb") merger.append(input1) merger.append(input2) # Write to an output PDF document output = open("hql_all.pdf", "wb") merger.write(output)
2. requests and beautifulsoup are two major artifacts of crawlers, reuqests is used for network requests, and beautifulsoup is used to operate html data. With these two shuttles, work can be done quickly. We don’t need crawler frameworks like scrapy. Using it on such a small program is a bit overkill. In addition, since you are converting html files to pdf, you must also have corresponding library support. wkhtmltopdf is a very useful tool that can convert html to pdf suitable for multiple platforms. pdfkit is the Python package of wkhtmltopdf. First install the following dependency packages
pip install requests pip install beautifulsoup4 pip install pdfkit
3. Install wkhtmltopdf
Windows platform directly at http:// wkhtmltopdf.org/downloads.html Download the stable version of wkhtmltopdf and install it. After the installation is completed, add the execution path of the program to the system environment $PATH variable. Otherwise, pdfkit cannot find wkhtmltopdf and the error "No wkhtmltopdf executable found" will appear. Ubuntu and CentOS can be installed directly using the command line
$ sudo apt-get install wkhtmltopdf # ubuntu $ sudo yum intsall wkhtmltopdf # centos
3. Data preparation
1. Get the url of each article
def get_url_list(): """ 获取所有URL目录列表 :return: """ response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000") soup = BeautifulSoup(response.content, "html.parser") menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1] urls = [] for li in menu_tag.find_all("li"): url = "http://www.liaoxuefeng.com" + li.a.get('href') urls.append(url) return urls
2. Save the HTML of each article using a template through the article url File
html template:
html_template = """ <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> </head> <body> {content} </body> </html> """
Save:
def parse_url_to_html(url, name): """ 解析URL,返回HTML内容 :param url:解析的url :param name: 保存的html文件名 :return: html """ try: response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # 正文 body = soup.find_all(class_="x-wiki-content")[0] # 标题 title = soup.find('h4').get_text() # 标题加入到正文的最前面,居中显示 center_tag = soup.new_tag("center") title_tag = soup.new_tag('h1') title_tag.string = title center_tag.insert(1, title_tag) body.insert(1, center_tag) html = str(body) # body中的img标签的src相对路径的改成绝对路径 pattern = "(<img .*?src=\")(.*?)(\")" def func(m): if not m.group(3).startswith("http"): rtn = m.group(1) + "http://www.liaoxuefeng.com" + m.group(2) + m.group(3) return rtn else: return m.group(1)+m.group(2)+m.group(3) html = re.compile(pattern).sub(func, html) html = html_template.format(content=html) html = html.encode("utf-8") with open(name, 'wb') as f: f.write(html) return name except Exception as e: logging.error("解析错误", exc_info=True)
3. Convert html to pdf
def save_pdf(htmls, file_name): """ 把所有html文件保存到pdf文件 :param htmls: html文件列表 :param file_name: pdf文件名 :return: """ options = { 'page-size': 'Letter', 'margin-top': '0.75in', 'margin-right': '0.75in', 'margin-bottom': '0.75in', 'margin-left': '0.75in', 'encoding': "UTF-8", 'custom-header': [ ('Accept-Encoding', 'gzip') ], 'cookie': [ ('cookie-name1', 'cookie-value1'), ('cookie-name2', 'cookie-value2'), ], 'outline-depth': 10, } pdfkit.from_file(htmls, file_name, options=options)
4. Merge the converted single PDFs into one PDF
merger = PdfFileMerger() for pdf in pdfs: merger.append(open(pdf,'rb')) print u"合并完成第"+str(i)+'个pdf'+pdf
Full source code:
# coding=utf-8 import os import re import time import logging import pdfkit import requests from bs4 import BeautifulSoup from PyPDF2 import PdfFileMerger html_template = """ <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> </head> <body> {content} </body> </html> """ def parse_url_to_html(url, name): """ 解析URL,返回HTML内容 :param url:解析的url :param name: 保存的html文件名 :return: html """ try: response = requests.get(url) soup = BeautifulSoup(response.content, 'html.parser') # 正文 body = soup.find_all(class_="x-wiki-content")[0] # 标题 title = soup.find('h4').get_text() # 标题加入到正文的最前面,居中显示 center_tag = soup.new_tag("center") title_tag = soup.new_tag('h1') title_tag.string = title center_tag.insert(1, title_tag) body.insert(1, center_tag) html = str(body) # body中的img标签的src相对路径的改成绝对路径 pattern = "(<img .*?src=\")(.*?)(\")" def func(m): if not m.group(3).startswith("http"): rtn = m.group(1) + "http://www.liaoxuefeng.com" + m.group(2) + m.group(3) return rtn else: return m.group(1)+m.group(2)+m.group(3) html = re.compile(pattern).sub(func, html) html = html_template.format(content=html) html = html.encode("utf-8") with open(name, 'wb') as f: f.write(html) return name except Exception as e: logging.error("解析错误", exc_info=True) def get_url_list(): """ 获取所有URL目录列表 :return: """ response = requests.get("http://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000") soup = BeautifulSoup(response.content, "html.parser") menu_tag = soup.find_all(class_="uk-nav uk-nav-side")[1] urls = [] for li in menu_tag.find_all("li"): url = "http://www.liaoxuefeng.com" + li.a.get('href') urls.append(url) return urls def save_pdf(htmls, file_name): """ 把所有html文件保存到pdf文件 :param htmls: html文件列表 :param file_name: pdf文件名 :return: """ options = { 'page-size': 'Letter', 'margin-top': '0.75in', 'margin-right': '0.75in', 'margin-bottom': '0.75in', 'margin-left': '0.75in', 'encoding': "UTF-8", 'custom-header': [ ('Accept-Encoding', 'gzip') ], 'cookie': [ ('cookie-name1', 'cookie-value1'), ('cookie-name2', 'cookie-value2'), ], 'outline-depth': 10, } pdfkit.from_file(htmls, file_name, options=options) def main(): start = time.time() file_name = u"liaoxuefeng_Python3_tutorial" urls = get_url_list() for index, url in enumerate(urls): parse_url_to_html(url, str(index) + ".html") htmls =[] pdfs =[] for i in range(0,124): htmls.append(str(i)+'.html') pdfs.append(file_name+str(i)+'.pdf') save_pdf(str(i)+'.html', file_name+str(i)+'.pdf') print u"转换完成第"+str(i)+'个html' merger = PdfFileMerger() for pdf in pdfs: merger.append(open(pdf,'rb')) print u"合并完成第"+str(i)+'个pdf'+pdf output = open(u"廖雪峰Python_all.pdf", "wb") merger.write(output) print u"输出PDF成功!" for html in htmls: os.remove(html) print u"删除临时文件"+html for pdf in pdfs: os.remove(pdf) print u"删除临时文件"+pdf total_time = time.time() - start print(u"总共耗时:%f 秒" % total_time) if __name__ == '__main__': main()
Related recommendations:
Python implements simple crawler sharing to capture links on the page
Python implements crawling the website title information of Baidu search results page
The above is the detailed content of Python implements a method to capture HTML web pages and save them as PDF files. For more information, please follow other related articles on the PHP Chinese website!