忽略标签时截断包含 HTML 的文本
尝试截断包含 HTML 的文本时,通常会遇到标签未正确关闭的问题,导致截断结果失真。为了克服这个问题,有必要有效地解析 HTML 并处理标签。
这是一种基于 PHP 的方法,可确保标签在截断期间正确关闭:
function printTruncated($maxLength, $html, $isUtf8=true) { $printedLength = 0; $position = 0; $tags = array(); // Regex pattern for matching HTML tags, entities, and UTF-8 characters $re = $isUtf8 ? '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;|[\x80-\xFF][\x80-\xBF]*}' : '{</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;}'; while ($printedLength < $maxLength && preg_match($re, $html, $match, PREG_OFFSET_CAPTURE, $position)) { // 1. Handle text leading up to the tag $str = substr($html, $position, $match[0][1] - $position); if ($printedLength + strlen($str) <= $maxLength) { print($str); $printedLength += strlen($str); } else { print(substr($str, 0, $maxLength - $printedLength)); $printedLength = $maxLength; break; } // 2. Handle the tag $tag = $match[0][0]; if ($tag[0] == '&' || ord($tag) >= 0x80) { // Pass the entity or UTF-8 character through unchanged print($tag); $printedLength++; } else { $tagName = $match[1][0]; if ($tag[1] == '/') { // Closing tag $openingTag = array_pop($tags); assert($openingTag == $tagName); // Ensure proper tag nesting print($tag); } else if ($tag[strlen($tag) - 2] == '/') { // Self-closing tag print($tag); } else { // Opening tag print($tag); $tags[] = $tagName; } } $position = $match[0][1] + strlen($tag); } // 3. Print remaining text if ($printedLength < $maxLength && $position < strlen($html)) print(substr($html, $position, $maxLength - $printedLength)); // 4. Close any open tags while (!empty($tags)) printf('</%s>', array_pop($tags)); }
为了说明其功能:
printTruncated(10, '<b><Hello></b> <img src="world.png" alt="" /> world!'); // Output: <b><Hello></b> <img printTruncated(10, '<table><tr><td>Heck, </td><td>throw</td></tr><tr><td>in a</td><td>table</td></tr></table>'); // Output: <table><tr><td>Heck printTruncated(10, "<em><b>Hello</b>&#20;w\xC3\xB8rld!</em>"); // Output: <em><b>Hello</b> w
以上是如何截断包含 HTML 的文本,同时确保正确的标签闭合?的详细内容。更多信息请关注PHP中文网其他相关文章!