I received an HTML file and would like to convert it to an in-memory PDF file. During conversion I don't want to use any external location for this. All I want is to keep it in memory.
So far I have tried a few Java libraries for conversion, but they always create a temporary file somewhere and then read/write from it. I don't want to do any I/O during conversion.
HTMLWorker class was deprecated many years ago. The goal of HTMLWorker is to convert small and simple HTML fragments into iText objects. It was never intended to convert complete HTML pages to PDF, but that's how many developers try to use it. This resulted in a lot of frustration because HTMLWorker didn't support all HTML tags, didn't parse CSS files, etc. To avoid this frustration, HTMLWorker has been removed from the latest version of iText.
In 2011, iText Group released XML Worker as a universal XML to PDF tool, built on iText 5. The default implementation converts XHTML (data) and CSS (styles) to PDF, mapping HTML tags, e.g.
,
, and
to iText 5 objects such as Paragraph, Image, and ListItem. We don't know of any implementations that used XML Worker for any other XML formats, but many developers used XML Worker in combination with jsoup as an HTML2PDF converter.
XML Worker wasn't a URL2PDF tool though. XML Worker expected predictable HTML created for the sole purpose of converting that HTML to PDF. A common use case was the creation of invoices. Rather than programming the design of an invoice in Java or C#, developers chose to create a simple HTML template defining the structure of the document, and some CSS defining the styles. They then populated the HTML with data, and used XML Worker to create the invoices as PDF documents, throwing away the original HTML . We'll take a closer look at this use case in chapter 4, converting XML to HTML in memory using XSLT, then converting that HTML to PDF using the pdfHTML add-on.
iText 5 When it was originally created, it was designed as a tool to generate PDFs as quickly as possible, flushing the page to an OutputStream once it was complete. When iText was first released in 2000, some very meaningful design choices still appear in iText 5 16 years later. Unfortunately, some of these choices make it very difficult, if not impossible, to extend the functionality of XML Workers to the level of quality that many developers expect. If we really wanted to create a great HTML to PDF converter, we would have to rewrite iText from scratch. We did it.
In 2016 we released iText 7, a completely new version of iText that is no longer compatible with previous versions but was created with pdfHTML in mind. A lot of work went into the new renderer framework. When you create a document using iText 7, a tree of renderers and their sub-renderers is built. Layouts are created by traversing this tree, which is better suited for handling HTML to PDF conversion. The iText object has been completely redesigned to better match HTML tags and allow for styling "the CSS way".
For instance: in iText 5, you had a PdfPTable and a PdfPCell object to create a table and its cells. If you wanted every cell to contain text in a font different from the default font, you needed to set that font for the content of every separate cell. In iText 7, you have a Table and Cell object, and when you set a different font for the complete table, this font is inherited as the default font for every cell. That was a major step forward in terms of architectural design, especially if the goal is to convert HTML to PDF.
But let's not dwell on the past, let's see what pdfHTML can do for us. In the first chapter, we'll take a look at different variations of the convertToPdf()/ConvertToPdf() method, and we'll discover how the converter is configured.