Detailed explanation of using HTMLParser (1)
In researching the development of search engines, the processing of HTML web pages is a core link. There are many open source codes on the Internet. For Java, HTMLParser is a well-known and widely used one. The homepage of HTMLParser is http://htmlparser.sourceforge.net/, and the last update was version 1.6 in September 2006. But it doesn't matter, the content of HTML has not changed significantly for a long time, and HTMLParser basically has no problem processing it. HTMLParser has the advantages of being compact and fast. The disadvantage is that there are relatively few relevant documents (and few in English), and many functions need to be explored by yourself. For beginners, it still takes some effort, but once you get started, you will find that the structural design of HTMLParser is very clever and very practical, and it can basically meet your various needs.
Here I wrote some introductory stuff based on my experience in the past few months. I hope it will be helpful to friends who are new to HTMLParser. (However, my Chinese language score in the college entrance examination was only one point higher than passing, so I hope everyone will bear with the grammatical issues)
The core module of HTMLParser is the org.htmlparser.Parser class. This class actually completes the processing of HTML pages. analysis work. This class has the following constructors:
public Parser (); public Parser (Lexer lexer, ParserFeedback fb); public Parser (URLConnection connection, ParserFeedback fb) throws ParserException; public Parser (String resource, ParserFeedback feedback) throws ParserException; public Parser (String resource) throws ParserException; public Parser (Lexer lexer); public Parser (URLConnection connection) throws ParserException; 和一个静态类 public static Parser createParser (String html, String charset);
For most users, the most commonly used method is to initialize Parser through a URLConnection or a string that holds web page content, or use static Function to generate a Parser object. The code of ParserFeedback is very simple and is designed for debugging and tracking analysis processes, and generally does not need to be changed. Using Lexer is a relatively advanced topic and will be discussed later.
The interesting point here is that if you need to set the encoding method of the page, there is only a static function without using Lexer. For most Chinese pages, it seems that this is a method that should be used more often.
The following is an example of initializing Parser.
/** * @author www.baizeju.com */ package com.baizeju.htmlparsertester; import java.io.BufferedReader; import java.io.InputStreamReader; import java.io.FileInputStream; import java.io.File; import java.net.HttpURLConnection; import java.net.URL; import org.htmlparser.visitors.TextExtractingVisitor; import org.htmlparser.Parser; /** * @author www.baizeju.com */ public class Main { private static String ENCODE = "GBK"; private static void message( String szMsg ) { try{ System.out.println(new String(szMsg.getBytes(ENCODE), System.getProperty("file.encoding"))); } catch(Exception e ){} } public static String openFile( String szFileName ) { try { BufferedReader bis = new BufferedReader(new InputStreamReader(new FileInputStream( new File(szFileName)), ENCODE) ); String szContent=""; String szTemp; while ( (szTemp = bis.readLine()) != null) { szContent+=szTemp+"/n"; } bis.close(); return szContent; } catch( Exception e ) { return ""; } } public static void main(String[] args) { String szContent = openFile( "E:/My Sites/HTMLParserTester.html"); try{ //Parser parser = Parser.createParser(szContent, ENCODE); //Parser parser = new Parser( szContent ); Parser parser = new Parser( (HttpURLConnection) (new URL("http://127.0.0.1:8080/HTMLParserTester.html")).openConnection() ); TextExtractingVisitor visitor = new TextExtractingVisitor(); parser.visitAllNodesWith(visitor); String textInPage = visitor.getExtractedText(); message(textInPage); } catch( Exception e ) { } } }
The emphasized part tests several different initialization methods, and the results are shown later. As long as you can see that Parser can output content, we will discuss how to access Parser content later.
The above is the detailed explanation of using HTMLParser (1). For more related content, please pay attention to the PHP Chinese website (www.php.cn)!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Guide to Table Border in HTML. Here we discuss multiple ways for defining table-border with examples of the Table Border in HTML.

Guide to HTML margin-left. Here we discuss a brief overview on HTML margin-left and its Examples along with its Code Implementation.

This is a guide to Nested Table in HTML. Here we discuss how to create a table within the table along with the respective examples.

Guide to HTML Table Layout. Here we discuss the Values of HTML Table Layout along with the examples and outputs n detail.

Guide to the HTML Ordered List. Here we also discuss introduction of HTML Ordered list and types along with their example respectively

Guide to HTML Input Placeholder. Here we discuss the Examples of HTML Input Placeholder along with the codes and outputs.

Guide to Moving Text in HTML. Here we discuss an introduction, how marquee tag work with syntax and examples to implement.

Guide to HTML onclick Button. Here we discuss their introduction, working, examples and onclick Event in various events respectively.
