Preserving Line Breaks in HTML-to-Text Conversion Using Jsoup
When converting HTML to plain text using jsoup, preserving line breaks can be crucial for maintaining the readability and structure of the output. By default, jsoup's text() method does not retain line breaks present in the HTML code.
Solution:
To preserve line breaks effectively, utilize the br2nl() method, which incorporates the following enhancements:
Tags:
Line breaks are introduced by appending n to the contents oftags to signify new paragraphs.
Usage:
<code class="java">import org.jsoup.Jsoup; import org.jsoup.nodes.Document; public class LineBreakPreserver { public static String br2nl(String html) { if (html == null) { return html; } Document document = Jsoup.parse(html); document.outputSettings(new Document.OutputSettings().prettyPrint(false)); document.select("br").append("\n"); document.select("p").prepend("\n\n"); String s = document.html().replaceAll("\\n", "\n"); return Jsoup.clean(s, "", Whitelist.none(), new Document.OutputSettings().prettyPrint(false)); } public static void main(String[] args) { String html = "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN \">" + "<HTML> <HEAD> <TITLE></TITLE> <style>body{ font-size: 12px;font-family: verdana, arial, helvetica, sans-serif;}</style> </HEAD> <BODY><p><b>hello world</b></p><p><br><b>yo</b> <a href=\"http://google.com\">googlez</a></p></BODY> </HTML> "; String result = br2nl(html); System.out.println(result); } }</code>
Output:
hello world yo googlez
The above is the detailed content of How to Preserve Line Breaks When Converting HTML to Text Using Jsoup?. For more information, please follow other related articles on the PHP Chinese website!