After HTMLParser traverses the content of the web page, it saves the results in a tree (forest) structure. There are two ways for HTMLParser to access the result content. Use Filter and use Visitor.
(1) Filter class
As the name suggests, Filter is to filter the results and obtain the required content. HTMLParser defines a total of 16 different Filters in the org.htmlparser.filters package, which can also be divided into several categories.
Judgment class Filter:
TagNameFilter HasAttributeFilter HasChildFilter HasParentFilter HasSiblingFilter IsEqualFilter
Logical operation Filter:
AndFilter NotFilter OrFilter XorFilter 其他Filter: NodeClassFilter StringFilter LinkStringFilter LinkRegexFilter RegexFilter CssSelectorNodeFilter
All Filter classes implement the org.htmlparser.NodeFilter interface. This interface has only one main function:
boolean accept (Node node);
(2) Getting started with the judgment class FilterHTMLParser (2) - Node content, add the import part yourself)
public static void main(String[] args) { try{ Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() ); // 这里是控制测试的部分,后面的例子修改的就是这个地方。 NodeFilter filter = new TagNameFilter ("DIV"); NodeList nodes = parser.extractAllNodesThatMatch(filter); if(nodes!=null) { for (int i = 0; i < nodes.size(); i++) { Node textnode = (Node) nodes.elementAt(i); message("getText:"+textnode.getText()); message("================================================="); } } } catch( Exception e ) { e.printStackTrace(); } }
Output results:
getText:div id="top_main" ================================================= getText:div id="logoindex" =================================================
It can be seen that both Div nodes in the file have been taken out. The following operations can be performed on these two DIV nodes
2.2 HasChildFilter
Let us take a look at HasChildFilter. When I just saw this Filter, I took it for granted that this Filter returned a Tag with Child. Directly initialized a
NodeFilter filter = new HasChildFilter();
Modify the code:
NodeFilter innerFilter = new TagNameFilter ("DIV"); NodeFilter filter = new HasChildFilter(innerFilter); NodeList nodes = parser.extractAllNodesThatMatch(filter);
Output result:
getText:body ================================================= getText:div id="top_main" =================================================
You can see that the output is two Tag node of DIV sub-Tag. (The body has the child node DIV "top_main", and "top_main" has the child node "logoindex".
Note that HasChildFilter also has a constructor:
public HasChildFilter (NodeFilter filter, boolean recursive)
If recursive is If false, only the first-level child nodes will be filtered. For example, in the previous example, both body and top_main have DIV nodes in the first-level child nodes, so they match. If we call it using the following method:
NodeFilter filter = new HasChildFilter( innerFilter, true );
Output result:
getText:html xmlns="http://www.w3.org/1999/xhtml" ================================================= getText:body ================================================= getText:div id="top_main" =================================================
You can see that there is an additional html xmlns="http://www.w3.org/1999/ in the output result xhtml", this is the node (root node) of the entire HTML page. Although there is no DIV node directly under this node, there is a DIV node under its child node body, so it is also matched.
2.3 HasAttributeFilter
HasAttributeFilter has 3 constructors:
public HasAttributeFilter (); public HasAttributeFilter (String attribute); public HasAttributeFilter (String attribute, String value);
This Filter can match the attribute containing the specified name, or specify the attribute to the node with the specified value. ## It is easier to illustrate.
#Calling method 1:
NodeFilter filter = new HasAttributeFilter(); NodeList nodes = parser.extractAllNodesThatMatch(filter);
什么也没有输出。
Calling method 2:
NodeFilter filter = new HasAttributeFilter( "id" ); NodeList nodes = parser.extractAllNodesThatMatch(filter);
getText:div id="top_main" ================================================= getText:div id="logoindex" =================================================
NodeFilter filter = new HasAttributeFilter( "id", "logoindex" ); NodeList nodes = parser.extractAllNodesThatMatch(filter);
getText:div id="logoindex" =================================================
##2.4 Other judgment column Filter
HasParentFilter and HasSiblingFilter have similar functions to HasChildFilter. You should understand it by trying it yourself
The constructor parameter of IsEqualFilter is a Node:
public IsEqualFilter (Node node) { mNode = node; } accept函数也很简单: public boolean accept (Node node) { return (mNode == node); }
NodeFilter filter = new NodeClassFilter(RemarkNode.class); NodeList nodes = parser.extractAllNodesThatMatch(filter);
getText:这是注释 ================================================= 可以看到只有RemarkNode(注释)被输出了。
This Filter is used to filter the displayed characters. The tag contains the specified content. Note that it is a displayable string, and the content in the non-displayable string (such as comments, links, etc.) will not be displayed.
Modify the example code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-title-www.baizeju.com</title></head> <html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释 白泽居-www.baizeju.com --> 白泽居-字符串1-www.baizeju.com <a href="http://www.baizeju.com">白泽居-链接文本-www.baizeju.com</a> </div> 白泽居-字符串2-www.baizeju.com </div> </body> </html>
NodeFilter filter = new StringFilter("www.baizeju.com"); NodeList nodes = parser.extractAllNodesThatMatch(filter);
getText:白泽居-title-www.baizeju.com ================================================= getText: 白泽居-字符串1-www.baizeju.com ================================================= getText:白泽居-链接文本-www.baizeju.com ================================================= getText: 白泽居-字符串2-www.baizeju.com =================================================
4.3 LinkStringFilter
Test code:
NodeFilter filter = new LinkStringFilter("www.baizeju.com"); NodeList nodes = parser.extractAllNodesThatMatch(filter);
getText:a href="http://www.baizeju.com" =================================================
Several other Filters also perform operations on different domains based on strings Judgment, the main difference from the previous ones is that it supports regular expressions. This is beyond the scope of this article, you can experiment it yourself.
3.1 AndFilter
AndFilter can combine two types of Filters. Only Nodes that meet the conditions at the same time will be filtered.
Test code:
NodeFilter filterID = new HasAttributeFilter( "id" ); NodeFilter filterChild = new HasChildFilter(filterA); NodeFilter filter = new AndFilter(filterID, filterChild);
getText:div id="logoindex" =================================================
Replace the previous AndFilter with OrFilter
Test code:
NodeFilter filterID = new HasAttributeFilter( "id" ); NodeFilter filterChild = new HasChildFilter(filterA); NodeFilter filter = new OrFilter(filterID, filterChild);
getText:div id="top_main" ================================================= getText:div id="logoindex" =================================================
Replace the previous AndFilter with NotFilter
Test code:
NodeFilter filterID = new HasAttributeFilter( "id" ); NodeFilter filterChild = new HasChildFilter(filterA); NodeFilter filter = new NotFilter(new OrFilter(filterID, filterChild));
getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" ================================================= getText: ================================================= getText:head ================================================= getText:meta http-equiv="Content-Type" content="text/html; charset=gb2312" ================================================= getText:title ================================================= getText:白泽居-www.baizeju.com ================================================= getText:/title ================================================= getText:/head ================================================= getText: ================================================= getText:html xmlns="http://www.w3.org/1999/xhtml" ================================================= getText: ================================================= getText:body ================================================= getText: ================================================= getText: ================================================= getText: ================================================= getText:这是注释 ================================================= getText: 白泽居-www.baizeju.com ================================================= getText:a href="http://www.baizeju.com" ================================================= getText:白泽居-www.baizeju.com ================================================= getText:/a ================================================= getText: ================================================= getText:/div ================================================= getText: 白泽居-www.baizeju.com ================================================= getText:/div ================================================= getText: ================================================= getText:/body ================================================= getText: ================================================= getText:/html ================================================= getText: =================================================
3.4 XorFilter
Test code:
NodeFilter filterID = new HasAttributeFilter( "id" ); NodeFilter filterChild = new HasChildFilter(filterA); NodeFilter filter = new XorFilter(filterID, filterChild);
getText:div id="top_main" =================================================
This Filter is used to determine whether the node type is a specific Node type. In
2.1 TagNameFilter
The following is the HTML file used for testing:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title>< /head> <html xmlns="http://www.w3.org/1999/xhtml"> <body > <div id="top_main"> <div id="logoindex"> <!--这是注释--> 白泽居-www.baizeju.com <a href="http://www.baizeju.com">白泽居-www.baizeju.com</a> </div> 白泽居-www.baizeju.com </div> </body> </html>
以上就是HTMLParser使用详解(3)的内容,更多相关内容请关注PHP中文网(www.php.cn)!