Detailed explanation of using HTMLParser (2)
HTMLParser saves the parsed information as a tree structure. Node is the basis of data type for information storage.
Please look at the definition of Node:
public interface Node extends Cloneable;
There are several types of methods included in Node:
For functions that traverse tree structures, these functions are the easiest to understand:
Node getParent ():取得父节点 NodeList getChildren ():取得子节点的列表 Node getFirstChild ():取得第一个子节点 Node getLastChild ():取得最后一个子节点 Node getPreviousSibling ():取得前一个兄弟(不好意思,英文是兄弟姐妹,直译太麻烦而且不符合习惯,对不起女同胞了) Node getNextSibling ():取得下一个兄弟节点
Function to obtain Node content:
String getText ():取得文本 String toPlainTextString():取得纯文本信息。 String toHtml () :取得HTML信息(原始HTML) String toHtml (boolean verbatim):取得HTML信息(原始HTML) String toString ():取得字符串信息(原始HTML) Page getPage ():取得这个Node对应的Page对象 int getStartPosition ():取得这个Node在HTML页面中的起始位置 int getEndPosition ():取得这个Node在HTML页面中的结束位置
Function used for Filter filtering:
void collectInto (NodeList list, NodeFilter filter):基于filter的条件对于这个节点进行过滤,符合条件的节点放到list中。
Function used for Visitor traversal:
void accept (NodeVisitor visitor):对这个Node应用visitor
Function used to modify content, this type is rarely used:
void setPage (Page page):设置这个Node对应的Page对象 void setText (String text):设置文本 void setChildren (NodeList children):设置子节点列表
Other functions:
void doSemanticAction ():执行这个Node对应的操作(只有少数Tag有对应的操作) Object clone ():接口Clone的抽象函数。
Actual We use HTMLParser most to process HTML pages. Filter or Visitor related functions are necessary, and the first and second types of functions are the most used. The first type of function is easier to understand. Let’s use an example to illustrate the second type of function.
The following is the HTML file used for testing:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> <html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html>
Test code:
/** * @author www.baizeju.com */ package com.baizeju.htmlparsertester; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.FileInputStream; import java.io.File; import java.net.HttpURLConnection; import java.net.URL; import org.htmlparser.Node; import org.htmlparser.util.NodeIterator; import org.htmlparser.Parser; /** * @author www.baizeju.com */ public class Main { private static String ENCODE = "GBK"; private static void message( String szMsg ) { try{ System.out.println(new String(szMsg.getBytes(ENCODE), System.getProperty("file.encoding"))); } catch(Exception e ){} } public static String openFile( String szFileName ) { try { BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)), ENCODE) ); String szContent=""; String szTemp; while ( (szTemp = bis.readLine()) != null) { szContent+=szTemp+"/n"; } bis.close(); return szContent; } catch( Exception e ) { return ""; } } public static void main(String[] args) { try{ Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() ); for (NodeIterator i = parser.elements (); i.hasMoreNodes(); ) { Node node = i.nextNode(); message("getText:"+node.getText()); message("getPlainText:"+node.toPlainTextString()); message("toHtml:"+node.toHtml()); message("toHtml(true):"+node.toHtml(true)); message("toHtml(false):"+node.toHtml(false)); message("toString:"+node.toString()); message("================================================="); } } catch( Exception e ) { System.out.println( "Exception:"+e ); } } }
Output result:
getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" getPlainText: toHtml:<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toHtml(true):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toHtml(false):<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> toString:Doctype Tag : !DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd; begins at : 0; ends at : 121 ================================================= getText: getPlainText: toHtml: toHtml(true): toHtml(false): toString:Txt (121[0,121],123[1,0]): /n ================================================= getText:head getPlainText:白泽居-www.baizeju.com toHtml:<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toHtml(true):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toHtml(false):<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title></head> toString:HEAD: Tag (123[1,0],129[1,6]): head Tag (129[1,6],197[1,74]): meta http-equiv="Content-Type" content="text/html; ... Tag (197[1,74],204[1,81]): title Txt (204[1,81],223[1,100]): 白泽居-www.baizeju.com End (223[1,100],231[1,108]): /title End (231[1,108],238[1,115]): /head ================================================= getText: getPlainText: toHtml: toHtml(true): toHtml(false): toString:Txt (238[1,115],240[2,0]): /n ================================================= getText:html xmlns="http://www.w3.org/1999/xhtml" getPlainText: 白泽居-www.baizeju.com 白泽居-www.baizeju.com 白泽居-www.baizeju.com toHtml:<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toHtml(true):<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toHtml(false):<html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html> toString:Tag (240[2,0],283[2,43]): html xmlns="http://www.w3.org/1999/xhtml" Txt (283[2,43],285[3,0]): /n Tag (285[3,0],292[3,7]): body Txt (292[3,7],294[4,0]): /n Tag (294[4,0],313[4,19]): div id="top_main" Txt (313[4,19],316[5,1]): /n/t Tag (316[5,1],336[5,21]): div id="logoindex" Txt (336[5,21],340[6,2]): /n/t/t Rem (340[6,2],351[6,13]): 这是注释 Txt (351[6,13],376[8,0]): /n/t/t白泽居-www.baizeju.com/n Tag (376[8,0],409[8,33]): a href="http://www.baizeju.com" Txt (409[8,33],428[8,52]): 白泽居-www.baizeju.com End (428[8,52],432[8,56]): /a Txt (432[8,56],435[9,1]): /n/t End (435[9,1],441[9,7]): /div Txt (441[9,7],465[11,0]): /n/t白泽居-www.baizeju.com/n End (465[11,0],471[11,6]): /div Txt (471[11,6],473[12,0]): /n End (473[12,0],480[12,7]): /body Txt (480[12,7],482[13,0]): /n End (482[13,0],489[13,7]): /html
== ===============================================
For the content of the first Node, the corresponding line is , this is easier to understand.
From this output result, you can also see the tree structure of the content. Or rather the structure of the woods. The first-level tags in the Page content, such as DOCTYPE, head and html, respectively form a top-level Node node (many people may be a little strange about the content of the second and fourth Node. In fact, these two Nodes are Two newline symbols. HTMLParser converts all line breaks, spaces, tabs, etc. in the HTML page content into corresponding Tags, so there is a Node like this. Although it has less content, it has a high level, haha)
getPlainTextString is Everything the user can see is included. There are two interesting points. One is that the Title content in the
In addition, you may find that there is no difference between the results of toHtml, toHtml(true) and toHtml(false). This is actually the case. If you trace the code of HTMLParser, you can find that the subclass of Node is AbstractNode, which implements the code of toHtml() and directly calls toHtml(false). Among the three subclasses of AbstractNode, RemarkNode, TagNode and TextNode, In the implementation of toHtml(boolean verbatim), the verbatim parameter is not processed, so the results of the three functions are exactly the same. If you don't need to implement any special processing of your own, simply use toHtml.
The above is the detailed explanation of using HTMLParser (2). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

This is a guide to Nested Table in HTML. Here we discuss how to create a table within the table along with the respective examples.

Guide to Table Border in HTML. Here we discuss multiple ways for defining table-border with examples of the Table Border in HTML.

Guide to HTML margin-left. Here we discuss a brief overview on HTML margin-left and its Examples along with its Code Implementation.

Guide to HTML Table Layout. Here we discuss the Values of HTML Table Layout along with the examples and outputs n detail.

Guide to the HTML Ordered List. Here we also discuss introduction of HTML Ordered list and types along with their example respectively

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

Guide to Moving Text in HTML. Here we discuss an introduction, how marquee tag work with syntax and examples to implement.

Guide to HTML Input Placeholder. Here we discuss the Examples of HTML Input Placeholder along with the codes and outputs.
