How to convert HTML to PDF? Method introduction-JS Tutorial-php.cn

How to convert HTML to PDF? Method introduction

In this article, I will show you how to generate a PDF document from a complexly styled React page using Node.js, Puppeteer, headless Chrome, and Docker.

Related recommendations: "nodejs Tutorial"

Background: A few months ago, a customer asked us to develop a function so that users can get React in PDF format Page content. This page is basically a report and data visualization of a patient case, with lots of SVGs included. There are also special requests to manipulate layout and perform some rearrangements of HTML elements. So there should be different styling and extra content in the PDF compared to the original React page.

Since this task is much more complex than solving it with simple CSS rules, we first explore possible ways to achieve it. We found 3 main solutions. This blog post will guide you through their possibilities and eventual implementation.

Directory:

Is it generated on the client side or on the server side?
Option 1: Make screenshots from the DOM
Option 2: Just use the PDF library
Final option 3: Node.js, Puppeteer and Headless Chrome
- Style Control
- Send the file to the client and save it
Using Puppeteer in Docker
Option 3 1: CSS printing rules
Summary

Are they generated on the client side or on the server side?

PDF files can be generated on both the client and server sides. But it probably makes more sense to let the backend handle it, since you don't want to use up all the resources the user's browser can provide.

Even so, I will still show the solution for both methods.

Option 1: Make Screenshot from DOM

At first glance, this solution seems to be the simplest, and it turns out that it is, but it has its own limitations. This is an easy-to-use method if you don't have special needs, such as selecting text in a PDF or performing a search on text.

This method is simple and straightforward: create a screenshot from the page and put it into a PDF file. Very straightforward. We can use two packages to achieve this:

Html2canvas, which generates screenshots based on DOM
jsPdf, a library that generates PDF

Start coding:

npm install html2canvas jspdf

import html2canvas from &#39;html2canvas&#39;
import jsPdf from &#39;jspdf&#39;
 
function printPDF () {
    const domElement = document.getElementById(&#39;your-id&#39;)
    html2canvas(domElement, { onclone: (document) => {
      document.getElementById(&#39;print-button&#39;).style.visibility = &#39;hidden&#39;
}})
    .then((canvas) => {
        const img = canvas.toDataURL(&#39;image/png&#39;)
        const pdf = new jsPdf()
        pdf.addImage(imgData, &#39;JPEG&#39;, 0, 0, width, height)
        pdf.save(&#39;your-filename.pdf&#39;)
})

Copy after login

That’s it!

Please note the onclone method of html2canvas. It's very handy when you need to manipulate the DOM before taking a screenshot (e.g. hiding the print button). I've seen many projects using this package. But unfortunately, this is not what we want because we need to do the creation of the PDF on the backend.

Option 2: Just use the PDF library

There are several libraries on NPM, such as jsPDF (as mentioned above) or PDFKit. The problem with them is that if I want to use these libraries, I will have to restructure the page. This definitely hurts maintainability as I would need to apply all subsequent changes to both the PDF template and the React page.

Please see the code below. You need to manually create the PDF document yourself. You need to walk through the DOM and find each element and convert it to PDF format, which is a tedious job. An easier way must be found.

doc = new PDFDocument
doc.pipe fs.createWriteStream(&#39;output.pdf&#39;)
doc.font(&#39;fonts/PalatinoBold.ttf&#39;)
   .fontSize(25)
   .text(&#39;Some text with an embedded font!&#39;, 100, 100)
 
doc.image(&#39;path/to/image.png&#39;, {
   fit: [250, 300],
   align: &#39;center&#39;,
   valign: &#39;center&#39;
});
 
doc.addPage()
   .fontSize(25)
   .text(&#39;Here is some vector graphics...&#39;, 100, 100)
 
doc.end()

Copy after login

This code snippet comes from the PDFKit documentation. But it can still be useful if your goal is to generate a PDF file directly, rather than converting an existing (and ever-changing) HTML page.

Final Solution 3: Puppeteer and Headless Chrome based on Node.js

What is Puppeteer? Its documentation reads:

Puppeteer is a Node library that provides a high-level API to control Chrome or Chromium on the DevTools protocol. Puppeteer runs Chrome or Chromium in headless mode by default, but it can also be configured to run in full (non-headless) mode.

It is essentially a browser that can be run from Node.js. If you read its documentation, the first thing mentioned is that you can use Puppeteer to generate screenshots and PDF of pages. excellent! This is exactly what we want.

First use npmi i puppeteer to install Puppeteer and implement our functions.

const puppeteer = require(&#39;puppeteer&#39;)
 
async function printPDF() {
  const browser = await puppeteer.launch({ headless: true });
  const page = await browser.newPage();
  await page.goto(&#39;https://blog.risingstack.com&#39;, {waitUntil: &#39;networkidle0&#39;});
  const pdf = await page.pdf({ format: &#39;A4&#39; });
 
  await browser.close();
  return pdf
})

Copy after login

This is a simple function that navigates to a URL and generates a PDF file of the site.

First, we launch the browser (PDF generation is only supported in headless mode), then open a new page, set the viewport, and navigate to the provided URL.

设置 waitUntil:'networkidle0' 选项意味着当至少500毫秒没有网络连接时，Puppeteer 会认为导航已完成。（可以从 API docs 获取更多信息。）

之后，我们将 PDF 保存为变量，关闭浏览器并返回 PDF。

注意：page.pdf 方法接收 options 对象，你可以使用 'path' 选项将文件保存到磁盘。如果未提供路径，则 PDF 将不会被保存到磁盘，而是会得到缓冲区。（稍后我将讨论如何处理它。）

如果需要先登录才能从受保护的页面生成 PDF，首先你要导航到登录页面，检查表单元素的 ID 或名称，填写它们，然后提交表单：

await page.type(&#39;#email&#39;, process.env.PDF_USER)
await page.type(&#39;#password&#39;, process.env.PDF_PASSWORD)
await page.click(&#39;#submit&#39;)

Copy after login

要始终将登录凭据保存在环境变量中，不要硬编码！

样式控制

Puppeteer 也有这种样式操作的解决方案。你可以在生成 PDF 之前插入样式标记，Puppeteer 将生成具有已修改样式的文件。

await page.addStyleTag({ content: &#39;.nav { display: none} .navbar { border: 0px} #print-button {display: none}&#39; })

Copy after login

将文件发送到客户端并保存

好的，现在你已经在后端生成了一个 PDF 文件。接下来做什么？

如上所述，如果你不把文件保存到磁盘，将会得到一个缓冲区。你只需要把含有适当内容类型的缓冲区发送到前端即可。

printPDF.then(pdf => {
    res.set({ &#39;Content-Type&#39;: &#39;application/pdf&#39;, &#39;Content-Length&#39;: pdf.length })
    res.send(pdf)

Copy after login

现在，你只需在浏览器向服务器发送请求即可得到生成的 PDF。

function getPDF() {
 return axios.get(`${API_URL}/your-pdf-endpoint`, {
   responseType: &#39;arraybuffer&#39;,
   headers: {
     &#39;Accept&#39;: &#39;application/pdf&#39;
   }
 })

Copy after login

一旦发送了请求，缓冲区的内容就应该开始下载了。最后一步是将缓冲区数据转换为 PDF 文件。

savePDF = () => {
    this.openModal(‘Loading…’) // open modal
   return getPDF() // API call
     .then((response) => {
       const blob = new Blob([response.data], {type: &#39;application/pdf&#39;})
       const link = document.createElement(&#39;a&#39;)
       link.href = window.URL.createObjectURL(blob)
       link.download = `your-file-name.pdf`
       link.click()
       this.closeModal() // close modal
     })
   .catch(err => /** error handling **/)
 }
<button onClick={this.savePDF}>Save as PDF</button>

Copy after login

就这样！如果单击“保存”按钮，那么浏览器将会保存 PDF。

在 Docker 中使用 Puppeteer

我认为这是实施中最棘手的部分 —— 所以让我帮你节省几个小时的百度时间。

官方文档指出“在 Docker 中使用 headless Chrome 并使其运行起来可能会非常棘手”。官方文档有疑难解答部分，你可以找到有关用 Docker 安装 puppeteer 的所有必要信息。

如果你在 Alpine 镜像上安装 Puppeteer，请确保在看到页面的这一部分时再向下滚动一点。否则你可能会忽略一个事实：你无法运行最新的 Puppeteer 版本，并且你还需要用一个标记禁用 shm ：

const browser = await puppeteer.launch({
  headless: true,
  args: [&#39;--disable-dev-shm-usage&#39;]
});

Copy after login

否则，Puppeteer 子进程可能会在正常启动之前耗尽内存。

方案 3 + 1：CSS 打印规则

可能有人认为从开发人员的角度来看，简单地使用 CSS 打印规则很容易。没有 NPM 模块，只有纯 CSS。但是在跨浏览器兼容性方面，它的表现如何呢？

在选择 CSS 打印规则时，你必须在每个浏览器中测试结果，以确保它提供的布局是相同的，并且它不是100％能做到这一点。

例如，在给定元素后面插入一个 break-after 并不是一个多么高深的技术，但是你可能会惊讶的发现要在 Firefox 中使用它需要使用变通方法。

除非你是一位经验丰富的 CSS 大师，在创建可打印页面方面有很多的经验，否则这可能会非常耗时。

如果你可以使打印样式表保持简单，打印规则是很好用的。

让我们来看一个例子吧。

@media print {
    .print-button {
        display: none;
    }
    
    .content div {
        break-after: always;
    }
}

Copy after login

上面的 CSS 隐藏了打印按钮，并在每个 div 之后插入一个分页符，其中包含content 类。有一篇很棒的文章总结了你可以用打印规则做什么，以及它们有什么问题，包括浏览器兼容性。

考虑到所有因素，如果你想从不那么复杂的页面生成 PDF，CSS打印规则非常有效。

总结

让我们快速回顾前面介绍的方案，以便从 HTML 页面生成 PDF 文件：

Generating screenshots from the DOM: Might be useful when you need to create a snapshot from a page (such as creating a thumbnail), but it can be a bit of a stretch when you need to process large amounts of data.
Only PDF Library: If you plan to create PDF files programmatically from scratch, this is a perfect solution. Otherwise, you need to maintain both HTML and PDF templates, which is a definite no-no.
Puppeteer: Although working on Docker is relatively difficult, it provided the best results for our implementation and was the easiest to code.
CSS Printing Rules: If your users are educated enough to know how to print page content to a file, and your page is relatively simple, then it may be the easiest solution . As you can see in our case, this is not the case.

Happy printing!

English original address: https://blog.risingstack.com/pdf-from-html-node-js-puppeteer/

More programming related knowledge, Please visit: Introduction to Programming! !

The above is the detailed content of How to convert HTML to PDF? Method introduction. For more information, please follow other related articles on the PHP Chinese website!