Home > Backend Development > PHP Tutorial > Summary of some methods of converting HTML into text with PHP_PHP Tutorial

Summary of some methods of converting HTML into text with PHP_PHP Tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB
Release: 2016-07-13 16:57:38
Original
1020 people have browsed it

Converting HTML to text in PHP provides the built-in function strip_tags, but sometimes this function is not enough. Here is a summary of some user-defined functions for your reference.

The most commonly used php function strip_tags

The code is as follows Copy code
 代码如下 复制代码


$mystr=<< 此处省略几十行HTML代码^_^
SATO;
$str=strip_tags($mystr);
//到这里就已经达到我的HTML转为TXT文本的目的了,哈哈,使用这个函数真方便
//下面是插件的一些切词等操作,这里就不多说了
?>


$mystr=<< Dozens of lines of HTML code are omitted here^_^
SATO;
$str=strip_tags($mystr);
//At this point, I have achieved my purpose of converting HTML to TXT text, haha, it is so convenient to use this function
//The following are some word segmentation and other operations of the plug-in, so I won’t go into details here
?>


Custom function

The code is as follows Copy code
 代码如下 复制代码

// $document 应包含一个 HTML 文档。
// 本例将去掉 HTML 标记,javascript 代码
// 和空白字符。还会将一些通用的
// HTML 实体转换成相应的文本。

$search = array ("']*?>.*?'si",  // 去掉 javascript
                 "'<[/!]*?[^<>]*?>'si",           // 去掉 HTML 标记
                 "'([rn])[s]+'",                 // 去掉空白字符
                 "'&(quot|#34);'i",                 // 替换 HTML 实体
                 "'&(amp|#38);'i",
                 "'&(lt|#60);'i",
                 "'&(gt|#62);'i",
                 "'&(nbsp|#160);'i",
                 "'&(iexcl|#161);'i",
                 "'&(cent|#162);'i",
                 "'&(pound|#163);'i",
                 "'&(copy|#169);'i",
                 "'&#(d+);'e");                    // 作为 PHP 代码运行

$replace = array ("",
                  "",
                  "1",
                  """,
                  "&",
                  "<",
">",
                  " ",
                  chr(161),
                  chr(162),
                  chr(163),
                  chr(169),
                  "chr(1)");

$text = preg_replace ($search, $replace, $document);
?>

// $document should contain an HTML document. <🎜> // This example will remove HTML tags and javascript code <🎜> // and whitespace characters. We will also add some common <🎜> // HTML entities are converted into corresponding text. <🎜> <🎜>$search = array ("']*?>.*?'si", // Remove javascript "'<[/!]*?[^<>]*?>'si", "'<[/!]*?[^<>]*?>'si", "'                "'([rn])[s]+'",                                                                         "'&(quot|#34);'i",                                   "'&(quot|#34);'i",                                                            "'&(amp|#38);'i", "'&(lt|#60);'i", "'&(gt|#62);'i", "'&(nbsp|#160);'i", "'&(iexcl|#161);'i", "'&(cent|#162);'i", "'&(pound|#163);'i", "'&(copy|#169);'i", "'(d+);'e"); "'(d+);'e"); // Run as PHP code $replace = array ("", "", "1", """, "&", "<",<🎜> ">", " ", chr(161), chr(162), chr(163), chr(169), "chr(1)"); $text = preg_replace ($search, $replace, $document); ?>

Later I saw a method written in PHP from the Internet. This method can also be used to convert HTML to TXT text. I personally think it is quite practical. I will share it here. The code is as follows:

 代码如下 复制代码
function HtmlToText($str){
  $str=preg_replace("/||/isU","",$str);//去除CSS样式、JS脚本、HTML注释
  $alltext="";//用于保存TXT文本的变量
  $start=1;//用于检测<左、>右标签的控制开关
  for($i=0;$i if(($start==0)&&($str[$i]==">")){//如果检测到>右标签,则使用$start=1;开启截取功能
      $start=1;
    }else if($start==1){//截取功能
      if($str[$i]=="<"){//如果字符是<左标签,则使用|替换
        $start=0;
        $alltext.="|";
      }else if(ord($str[$i])>31){//如果字符是ASCII大于31的有效字符,则将字符添加到$alltext变量中
        $alltext.=$str[$i];
      }
    }
}
//下方是去除空格和一些特殊字符的操作
$alltext = str_replace(" "," ",$alltext);
$alltext = preg_replace("/&([^;&]*)(;|&)/","",$alltext);
$alltext = preg_replace("/[ ]+/s"," ",$alltext);
return $alltext;
}

使用上面这个方法也可以实现将简答的HTML代码转换为TXT文本。

例3

 代码如下 复制代码

function html2text($str,$encode = 'GB2312')
{

  $str = preg_replace("/