htmlparser is a pure html parsing library written in java; htmlparser does not depend on other java library files. It is mainly used to transform or extract html. It can parse HTML in a linear or nested manner and can be understood as a Web information scraping tool.
The operating environment of this tutorial: Windows 10 system, HTML5 version, Dell G3 computer.
What does htmlparser mean?
htmlparser is a pure java-written html parsing library. It does not depend on other java library files. , mainly used to transform or extract html. It can parse html at super high speed without errors. The latest version of htmlparser is now 2.1. It is no exaggeration to say that htmlparser is currently the best tool for html parsing and analysis.
HTML Parser is a Java library for parsing HTML in a linear or nested manner. Mainly used for conversion or extraction, it has filters, visitors, custom tags and easy-to-use JavaBeans. It is a fast, powerful and well-tested package.
The two basic use cases handled by the parser are extraction and transformation (the synthesis use case, creating an HTML page from scratch, is best handled by other tools closer to the data source). While previous versions focused on extracting data from web pages, version 1.4 of HTMLParser has substantial improvements in converting web pages, simplifying the creation and editing of tags, and verbatim output of the toHtml() method.
In general, to use HTMLParser, you need to be able to write code in the Java programming language. Although some sample programs are provided that may be useful, you will most likely need (or want) to create your own or modify the provided programs to match your intended application.
To use this library, you need to add htmllexer.jar or htmlparser.jar to your classpath when compiling and running. htmllexer.jar provides low-level access to common string, comment, and label nodes on the page in a linear, flat, sequential manner. htmlparser.jar, which contains classes in htmllexer.jar, provides access to pages as nested distinguishing markup sequences containing strings, comments, and other markup nodes. Therefore, the output of calling the lexer nextNode() method may be:
The output of the parser NodeIterator will nest tags as ,
and others The children of the node (indicated here by indentation):The parser tries to balance the opening and closing tags to present the structure of the page, while the lexer simply spits out node. If your application requires only modest knowledge of page structure and is primarily concerned with a single independent node, you should consider using a lightweight lexer. But if your application needs to understand the nested structure of the page, such as processing tables, you may want to use a full parser.
Recommended tutorial: "html video tutorial"
The above is the detailed content of what is htmlparser. For more information, please follow other related articles on the PHP Chinese website!