Detailed explanation of a perfect HTML parsing engine (Jumony)

零下一度
Release: 2017-05-04 14:57:37
Original
7343 people have browsed it

Perhaps many people will think that the current HTML parser is enough, and even simple regular expressions can already meet the needs of manipulating HTML documents. Yes, for the vast majority of HTML documents on the Internet, in fact, most of them meet the XHTML specifications, and their parsing does not require a powerful parser. But a powerful parser is one thing, and a perfect parser is another.

Jumony Core first provides a nearly perfect HTML parsing engine, and its parsing results are infinitely close to those of the browser. Whether it is elements without end tags, elements with optional end tags, tag attributes, or CSS selectors and styles, all legal and illegal HTML documents will be parsed by the browser, and Jumony will parse them into whatever they are. Sample. In other words, the results of Jumony parsing are the same as the results of browser parsing, so you no longer have to worry about whether the HTML document can be recognized. If the browser can read it, Jumony can understand it.

There is only one step between perfection and power, but a perfect parser allows you to never have to care about the HTML source document.

The following is an incomplete list of features supported by the Jumony parser

##< a href='#'>Use double quotes for attribute valuesDo not use quotation marks for attribute valuesThe attribute value is missing (but there is an equal sign)There are spaces in front of the attribute valueParsing


Not only can it parse HTML from text, Jumony's API can directly grab documents for analysis from the Internet, and automatically identify encodings based on HTTP headers:

It is currently second only to Jumony The HTML parsing open source project HtmlAgilityPack has stopped updating for a long time. After so many years, there are still problems with the parsing of the most basic
elements.

2. CSS style setting support

Just perfectly parsing HTML does not bring much benefit. As mentioned above, in fact, most HTML documents can be parsed with second-rate parsing. It can analyze even simple regular expressions, so why do we need Jumony?

The answer is that an HTML engine is more than just parsing the DOM structure.

Consider this scenario: I need to set a none value to the display style of an element. In the browser, we only need a simple element.style.display = "none" to meet our requirements. Now, we have obtained the DOM we need through the parser, but do we still need to concatenate strings to set the style?

No need, Jumony supports CSS style parsing, and even some CSS style abbreviation rules can be recognized. In Jumony, setting a style for an element is as simple as in the browser:

We Let's look at this example again:

, what will happen if we set padding-left: 0px on this element?

In Jumony, the result will be:

<p style="padding-left: 0px; padding-right: 5px; padding-top:5px; padding-bottom: 5px"></p>
Copy after login

Look, the padding attribute is magically expanded automatically.

3. CSS 3 selector support

CSS selector is a popular query language in the HTML world. It is simple and powerful and is supported by many browsers. Jumony also supports almost complete CSS3 selectors (except runtime pseudo-classes and pseudo-objects). With the help of selectors, we can easily find the objects we are interested in in HTML. For example, grab all the article titles on the homepage of the blog park:

new JumonyParser().LoadDocument( "www.php.cn/" ).Find( ".post_item a.titlelnk" )
Copy after login
Copy after login

Capture, analyze, select, all in one go. With just a simple code, we can output the data we captured on the console:

 foreach( var title = new JumonyParser().LoadDocument( "www.php.cn/" ).Find( ".post_item a.titlelnk" ) )
  Console.WriteLine( title.InnerText() );
Copy after login

List of CSS3 selectors supported by Jumony:

特性 例子
孤立的<解析为文本< a应当解析为< a
孤立的>解析为文本 >应当解析为>
标记属性(没有值的属性)
元素丢失结束标签

测试链接

可选结束标签元素
"body", "colgroup", "dd", "dt", "head", "html", "li", "option", "p", "tbody", "td", "tfoot", "th", "thead", "tr"

abc

123

无结束标签元素
"area", "base", "basefont", "br", "col", "frame", "hr", "img", "input", "isindex", "link", "meta" , "param", "wbr", "bgsound", "spacer", "keygen"
CDataElement ##<script>if ( 1<a ) alert( "< p>" );</script>
"script", "style", "textarea", "title"  
Preformatted elements
 <span style="font-family:courier new,courier;font-size:12px;">There is a space in front<span class="font5"></span>
Use single quotes for attribute values
HTMLDeclaration
##p~aSelect subsequent elements##[attr][attr=value][ attr~=value][attr^=value][attr*=value][attr$=value][attr!=value]: not:only-child:only-of-type:empty: nth-child:nth-last-child##:nth-of-type
Selector Description
* Select all elements
p a Select descendant elements
##p>a Select child elements
p+a Select adjacent elements
Attribute existence selection
Exact match of attribute value
Attribute value approximate match
The attribute value starts with matching
The attribute value contains Match
Attribute value ends with match
Attribute value negative matching
Negative pseudo-class
Unique sub-element pseudo-class
only-of-type pseudo-class
Empty element pseudo-class
Structured pseudo-class
Structured pseudo-class
structured pseudo-class
:nth-last-of-type Structured pseudo-class
:first-child Structured pseudo-class
:last-child Structured pseudo-class
:first-of-type Structured pseudo-class
:last-of-type ##Structured pseudo-class


4. Powerful scalability

In Jumony Core 3, it provides users with the greatest scalability. You can customize HTML specifications, implement your own parser, graft other DOM models to the Jumony API, invent your own CSS selector pseudo-class, or even change your own API, such as jQuery style.

Jumony Core has many derivative projects, such as crawling websites, providing jQuery-style APIs, developing websites, making MHT files, adding CSS selector support for HAP parsing results, etc. These projects all require Benefiting from the powerful scalability of Jumony Core, it can exert powerful functions.


【Related recommendations】

1. Free html online video tutorial

2.

html development manual

3.

php.cn original html5 video tutorial

The above is the detailed content of Detailed explanation of a perfect HTML parsing engine (Jumony). For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template