Input and output
Input and output should be said to be the basic functions of many websites. Users input data and the website outputs the data for others to browse.
Take the currently popular Blog as an example. The input and output here are that the author edits the article and generates a blog article page for others to read.
There is a problem here, that is, user input is usually uncontrolled, it may contain incorrect formats or code with security risks; but the final output of the website must be correct HTML code. This requires error correction and filtering of user input.
Never trust user input
You may say: There are WYSIWYG editors (WYSIWYG) everywhere now, FCKeditor, TinyMCE... You may name a lot. Yes, they all can automatically generate standard XHTML code, but as a web developer, you must have heard "never trust user-submitted data".
Therefore it is necessary to correct and filter user input data.
Need better error correction and filtering
So far, I have not seen any relevant implementation that satisfies me. The ones I can come across are usually inefficient and less than ideal, and they are like this or that. obvious flaws. To give a well-known example: WordPress is a very widely used blog system. It is simple to operate, powerful and has rich plug-in support. However, its integrated TinyMCE and a bunch of clever error correction and filtering codes in the background are quite a headache. , forced replacement of half-width characters, overly conservative replacement rules, etc... make it difficult to achieve the requirement of pasting a piece of code to display it correctly.
I would like to complain here by the way. This blog is hosted by WordPress. In order to make these articles display the code correctly, I searched a lot online and tried some plug-ins. Finally, I went through its code and filtered some The rules can barely be displayed decently until they are commented out -.-b
Of course, I don’t want to criticize it (wordpress) too much, I just want to show that it can do better.
What is Tidy and how does it work?
The description taken from Tidy ManPage describes it this way:
Tidy reads HTML, XHTML and XML files and writes cleaned up markup. that is both W3C compliant and works on most browsers. A common use of Tidy is to convert plain HTML to XHTML. For generic XML files, Tidy is limited to correcting basic well-formedness errors and pretty printing. It is said that Tidy cleans HTML code, generates clean HTML code that conforms to W3C standards, and supports HTML, XHTML, and XML. Tidy provides a library TidyLib to facilitate the use of Tidy's powerful functions in other applications. Fortunately, PHP has the corresponding tidy module to use.
Dude, why PHP again?
Uh, this question... I'm ashamed, because I only know a little bit about PHP -.-v
But fortunately, it's not what I'm talking about here. Pure code, at least some analysis process, sharing these things is much more useful than posting code.
Using Tidy in PHP
To use Tidy in PHP, you need to install the Tidy module, which means loading the PHP extension tidy.so. The specific process is omitted, it is purely physical work. Finally, if you can see "Tidy support enabled" in phpinfo(), it's OK.
With the support of this module, almost all the functions provided by Tidy can be used in PHP. Commonly used HTML cleaning is extremely easy. You can even generate a parse tree of the document and operate each node of HTML like operating DOM on the client. There will be specific code instructions below, and you can also look at the official PHP manual.
PHP+Tidy implementation of error correction and filtering
The above mentioned so much background material seems too confusing, the specific code to solve the problem is the most direct.
1. Simple error correction implementation
function HtmlFix($html)
{
if(!function_exists('tidy_repair_string'))
return $ html;
//use tidy to repair html code
//repair
$str = tidy_repair_string($html,
array('output-xhtml'=>true),
'utf8');
$s = '';
$nodes = @tidy_get_body($str)->child;
if(!is_array($nodes)){
$returnVal = 0 ;
return $s;
}
foreach($nodes as $n){
$s .= $n->value;
}
return $ s;
}
The above code is to clean up and correct XHTML codes that may not be standardized, and output standard XHTML codes (both input and output are UTF-8 encoded). The implementation code is not the most streamlined, because in order to cooperate with the filtering function below, I wrote it as detailed as possible.
2. Advanced implementation: Error correction + filtering
Function:
XHTML error correction, output standard XHTML code.
Filters unsafe codes but does not affect content display. It only clears unsafe codes in style/javascript.
Insert the
function HtmlFixSafe($html)
{
if(!function_exists('tidy_repair_string'))
return $html;
//use tidy to repair html code
// tidy parameter settings
$conf = array(
'output-xhtml'=>true
,'drop-empty-paras'=>FALSE
, ,' join-classes'=>TRUE
, ,'show-body-only'=>TRUE
, );
//repair
$str = tidy_repair_string($html,$conf ,'utf8');
//Generate parse tree
$str = tidy_parse_string($str,$conf,'utf8');
$s ='';
//Get body node
$body = @tidy_get_body($str);
//Function _dumpnode, check each node, filter and output
function _dumpnode($node,&$s ){
// Check the node name, if it is