Home > Web Front-end > HTML Tutorial > Detailed explanation of HTMLParser usage (3)

Detailed explanation of HTMLParser usage (3)

黄舟
Release: 2016-12-29 15:57:20
Original
1223 people have browsed it

After HTMLParser traverses the content of the web page, it saves the results in a tree (forest) structure. There are two ways for HTMLParser to access the result content. Use Filter and use Visitor.

(1) Filter class
As the name suggests, Filter is to filter the results and obtain the required content. HTMLParser defines a total of 16 different Filters in the org.htmlparser.filters package, which can also be divided into several categories.
Judgment class Filter:

TagNameFilter
HasAttributeFilter
HasChildFilter
HasParentFilter
HasSiblingFilter
IsEqualFilter
Copy after login

Logical operation Filter:

AndFilter
NotFilter
OrFilter
XorFilter
其他Filter:
NodeClassFilter
StringFilter
LinkStringFilter
LinkRegexFilter
RegexFilter
CssSelectorNodeFilter
Copy after login

All Filter classes implement the org.htmlparser.NodeFilter interface. This interface has only one main function:

boolean accept (Node node);
Copy after login

(2) Getting started with the judgment class FilterHTMLParser (2) - Node content, add the import part yourself)

public static void main(String[] args) {
try{
Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() );
// 这里是控制测试的部分,后面的例子修改的就是这个地方。
NodeFilter filter = new TagNameFilter ("DIV");
NodeList nodes = parser.extractAllNodesThatMatch(filter); 
if(nodes!=null) {
for (int i = 0; i < nodes.size(); i++) {
Node textnode = (Node) nodes.elementAt(i);
message("getText:"+textnode.getText());
message("=================================================");
}
} 
}
catch( Exception e ) { 
e.printStackTrace();
}
}
Copy after login

Output results:

getText:div id="top_main"
=================================================
getText:div id="logoindex"
=================================================
Copy after login
Copy after login

It can be seen that both Div nodes in the file have been taken out. The following operations can be performed on these two DIV nodes

2.2 HasChildFilter
Let us take a look at HasChildFilter. When I just saw this Filter, I took it for granted that this Filter returned a Tag with Child. Directly initialized a

NodeFilter filter = new HasChildFilter();
Copy after login

Modify the code:

NodeFilter innerFilter = new TagNameFilter ("DIV");
NodeFilter filter = new HasChildFilter(innerFilter);
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Copy after login

Output result:

getText:body 
=================================================
getText:div id="top_main"
=================================================
Copy after login

You can see that the output is two Tag node of DIV sub-Tag. (The body has the child node DIV "top_main", and "top_main" has the child node "logoindex".

Note that HasChildFilter also has a constructor:

public HasChildFilter (NodeFilter filter, boolean recursive)
Copy after login


If recursive is If false, only the first-level child nodes will be filtered. For example, in the previous example, both body and top_main have DIV nodes in the first-level child nodes, so they match. If we call it using the following method:

NodeFilter filter = new HasChildFilter( innerFilter, true );
Copy after login


Output result:

getText:html xmlns="http://www.w3.org/1999/xhtml"
=================================================
getText:body 
=================================================
getText:div id="top_main"
=================================================
Copy after login

You can see that there is an additional html xmlns="http://www.w3.org/1999/ in the output result xhtml", this is the node (root node) of the entire HTML page. Although there is no DIV node directly under this node, there is a DIV node under its child node body, so it is also matched.

2.3 HasAttributeFilter
HasAttributeFilter has 3 constructors:

public HasAttributeFilter ();
public HasAttributeFilter (String attribute);
public HasAttributeFilter (String attribute, String value);
Copy after login

This Filter can match the attribute containing the specified name, or specify the attribute to the node with the specified value. ## It is easier to illustrate.
#Calling method 1:

NodeFilter filter = new HasAttributeFilter();
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Copy after login

Output result:



什么也没有输出。
Copy after login


Calling method 2:

NodeFilter filter = new HasAttributeFilter( "id" );
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Copy after login

Output result:


getText:div id="top_main"
=================================================
getText:div id="logoindex"
=================================================
Copy after login
Copy after login

Call method 3:


NodeFilter filter = new HasAttributeFilter( "id", "logoindex" );
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Copy after login

Output result:


getText:div id="logoindex"
=================================================
Copy after login

It’s very simple.

##2.4 Other judgment column Filter
HasParentFilter and HasSiblingFilter have similar functions to HasChildFilter. You should understand it by trying it yourself

The constructor parameter of IsEqualFilter is a Node:

public IsEqualFilter (Node node) {
mNode = node;
}
accept函数也很简单:
public boolean accept (Node node) {
return (mNode == node);
}
Copy after login

No need to explain too much. (3) Logical operation Filter (4) Other Filters: Getting started with HTMLParser (2) - Node content We have already learned about the different types of Node. This Filter can filter based on the type.

Test code:


NodeFilter filter = new NodeClassFilter(RemarkNode.class);
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Copy after login

Output result:

getText:这是注释
=================================================
可以看到只有RemarkNode(注释)被输出了。
Copy after login

4.2 StringFilter

This Filter is used to filter the displayed characters. The tag contains the specified content. Note that it is a displayable string, and the content in the non-displayable string (such as comments, links, etc.) will not be displayed.
Modify the example code:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-title-www.baizeju.com</title></head>
<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">
<!--这是注释 白泽居-www.baizeju.com -->
白泽居-字符串1-www.baizeju.com
<a href="http://www.baizeju.com">白泽居-链接文本-www.baizeju.com</a>
</div>
白泽居-字符串2-www.baizeju.com
</div>
</body>
</html>
Copy after login

Test code:

NodeFilter filter = new StringFilter("www.baizeju.com");
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Copy after login

Output result:


getText:白泽居-title-www.baizeju.com
=================================================
getText:
白泽居-字符串1-www.baizeju.com
=================================================
getText:白泽居-链接文本-www.baizeju.com
=================================================
getText:
白泽居-字符串2-www.baizeju.com
=================================================
Copy after login

You can see the Tag containing the title, two content strings and the linked text string All are output, but the comments and link tags themselves are not output.


4.3 LinkStringFilter

This Filter is used to determine whether the link contains a specific string, and can be used to filter out links pointing to a specific website.

Test code:

NodeFilter filter = new LinkStringFilter("www.baizeju.com");
NodeList nodes = parser.extractAllNodesThatMatch(filter);
Copy after login

Output result:

getText:a href="http://www.baizeju.com"
=================================================
Copy after login

4.4 Several other Filters

Several other Filters also perform operations on different domains based on strings Judgment, the main difference from the previous ones is that it supports regular expressions. This is beyond the scope of this article, you can experiment it yourself.

The ones introduced earlier are simple Filters, which can only filter for a single type of condition. HTMLParser supports the combination of simple types of Filters to implement complex conditions. The principle is the same as the logical operation of general programming languages.

3.1 AndFilter
AndFilter can combine two types of Filters. Only Nodes that meet the conditions at the same time will be filtered.
Test code:

NodeFilter filterID = new HasAttributeFilter( "id" );
NodeFilter filterChild = new HasChildFilter(filterA);
NodeFilter filter = new AndFilter(filterID, filterChild);
Copy after login

Output result:

getText:div id="logoindex"
=================================================
Copy after login

3.2 OrFilter

Replace the previous AndFilter with OrFilter
Test code:

NodeFilter filterID = new HasAttributeFilter( "id" );
NodeFilter filterChild = new HasChildFilter(filterA);
NodeFilter filter = new OrFilter(filterID, filterChild);
Copy after login

Output result:

getText:div id="top_main"
=================================================
getText:div id="logoindex"
=================================================
Copy after login

3.3 NotFilter

Replace the previous AndFilter with NotFilter
Test code:

NodeFilter filterID = new HasAttributeFilter( "id" );
NodeFilter filterChild = new HasChildFilter(filterA);
NodeFilter filter = new NotFilter(new OrFilter(filterID, filterChild));
Copy after login

Output result :

getText:!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"
=================================================
getText:
=================================================
getText:head
=================================================
getText:meta http-equiv="Content-Type" content="text/html; charset=gb2312"
=================================================
getText:title
=================================================
getText:白泽居-www.baizeju.com
=================================================
getText:/title
=================================================
getText:/head
=================================================
getText:
=================================================
getText:html xmlns="http://www.w3.org/1999/xhtml"
=================================================
getText:
=================================================
getText:body 
=================================================
getText:
=================================================
getText:
=================================================
getText:
=================================================
getText:这是注释
=================================================
getText:
白泽居-www.baizeju.com
=================================================
getText:a href="http://www.baizeju.com"
=================================================
getText:白泽居-www.baizeju.com
=================================================
getText:/a
=================================================
getText:
=================================================
getText:/div
=================================================
getText:
白泽居-www.baizeju.com
=================================================
getText:/div
=================================================
getText:
=================================================
getText:/body
=================================================
getText:
=================================================
getText:/html
=================================================
getText:
=================================================
Copy after login

Except for the several Tags output in the previous 3.2, the rest of the Tags are here.


3.4 XorFilter

Replace the previous AndFilter with NotFilter

Test code:

NodeFilter filterID = new HasAttributeFilter( "id" );
NodeFilter filterChild = new HasChildFilter(filterA);
NodeFilter filter = new XorFilter(filterID, filterChild);
Copy after login

Output result:

getText:div id="top_main"
=================================================
Copy after login

4.1 NodeClassFilter

This Filter is used to determine whether the node type is a specific Node type. In
2.1 TagNameFilter

TabNameFilter is the easiest to understand Filter, filtering based on the name of the Tag.


The following is the HTML file used for testing:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<head><meta http-equiv="Content-Type" content="text/html; charset=gb2312"><title>白泽居-www.baizeju.com</title>< /head>
<html xmlns="http://www.w3.org/1999/xhtml">
<body >
<div id="top_main">
<div id="logoindex">
<!--这是注释-->
白泽居-www.baizeju.com
<a href="http://www.baizeju.com">白泽居-www.baizeju.com</a>
</div>
白泽居-www.baizeju.com
</div>
</body>
</html>
Copy after login

 以上就是HTMLParser使用详解(3)的内容,更多相关内容请关注PHP中文网(www.php.cn)!


Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template