


PHP extracts web page titles and removes irrelevant seo keywords_PHP tutorial
场景描述:
过往我们在抽取网页标题的时候,都会直接抽取 之间的内容. 但实际情况是这样,例如javaeye 的一篇文章 http://www.iteye.com/news/21643 , 的内容为 "10年软件开发教会我最重要的10件事 - 非技术 - ITeye资讯", 但实际引用中我们期望的标题应该为 "10年软件开发教会我最重要的10件事". 所以标题后面堆砌了很多不相关的关键字(应该是为了 seo 吧). 所以我们希望过滤掉这些关键字. 有下面的方法可以参考:
1. 查找 h1 等标签.(分析sina news 一些网站之后, 觉得不可行,会有很多干扰)
2. 从全文去标题后,将 之间的内容切割(按 _ | -)为 a1,a2,a3,a4,然后从最长的词组a3开始从全文查找. 如果查找成功,那么开始向左边迭代查询 a2,a1,直到查询失败为止 。左侧失败后,再继续向右迭代,同理. (这里我采用的是这种方法)
Php代码
/**
* @author pqcc
* @date: 2011-06-18
* Description: Given a web page content, extract the title of the web page. The extracted title does not Including seo keywords.
* e.g: The result of a news title extracted directly from
* But the result we hope for is: "The College English CET-4 and CET-6 will be tested this Saturday for 9.09 million people’s reference."
* Scope of application: Extraction of the title of the final page of the article, excluding topic pages, etc.
*/
class TitlePurify{
private $matches_preg = [-_s|—];
function getTitle($contents){/*{{{*/
$preg = "/
preg_match($preg, $contents, $matches);
if(count($matches)<=1){
return "标题抽取失败";
}
$title = $matches[1];
return $this->trimTitle($title, $contents);
}/*}}}*/
function trimMeta($contents){/*{{{*/
// 首先去除
$preg = "/
$contents = preg_replace($preg, , $contents);
$preg = "/]*>/i";
$contents = preg_replace($preg, , $contents);
return $contents;
}/*}}}*/
// 获取长度最长的 item 所处的index.
function getMaxIndex($titles){/*{{{*/
$maxItemIndex = 0;
$maxLength = 0;
$loop = 0;
foreach($titles as $item){
if(strlen($item)>$maxLength){
$maxLength = strlen($item);
$maxItemIndex = $loop;
}
$loop++;
}
return $maxItemIndex;
}/*}}}*/
function trim($title, $titles, $contents, $maxItemIndex){/*{{{*/
//@todo: Contents can be optimized here
// If Search successful. result = tempTitle. current index Iterate to the left (stop until the first one is reached or the match fails).
$leftIndex = $maxItemIndex-1;
while(true && $leftIndex>=0){
$leftIndex = $maxItemIndex-1; // tempTitle+left one .
preg_match("/({$this->matches_preg}+{$tempTitle})/i", $title, $matches); 🎜> .$matches[ 1];
// Continue to match with tempTitle.
preg_match("/$tempTitle/i", $contents, $matches);
// If the search fails....
if(count($matches)<1){ $result = $tempTitle; { // Under normal circumstances, this situation will not occur.
http://www.bkjia.com/PHPjc/478770.html
www.bkjia.com
true
http: //www.bkjia.com/PHPjc/478770.html
TechArticle
Scene description: In the past, when we extracted the title of the web page, we would directly extract the content between them. But the actual situation is In this way, for example, an article by javaeye http://www.iteye.com/news/2164...

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

If you are an experienced PHP developer, you might have the feeling that you’ve been there and done that already.You have developed a significant number of applications, debugged millions of lines of code, and tweaked a bunch of scripts to achieve op

JWT is an open standard based on JSON, used to securely transmit information between parties, mainly for identity authentication and information exchange. 1. JWT consists of three parts: Header, Payload and Signature. 2. The working principle of JWT includes three steps: generating JWT, verifying JWT and parsing Payload. 3. When using JWT for authentication in PHP, JWT can be generated and verified, and user role and permission information can be included in advanced usage. 4. Common errors include signature verification failure, token expiration, and payload oversized. Debugging skills include using debugging tools and logging. 5. Performance optimization and best practices include using appropriate signature algorithms, setting validity periods reasonably,

This tutorial demonstrates how to efficiently process XML documents using PHP. XML (eXtensible Markup Language) is a versatile text-based markup language designed for both human readability and machine parsing. It's commonly used for data storage an

A string is a sequence of characters, including letters, numbers, and symbols. This tutorial will learn how to calculate the number of vowels in a given string in PHP using different methods. The vowels in English are a, e, i, o, u, and they can be uppercase or lowercase. What is a vowel? Vowels are alphabetic characters that represent a specific pronunciation. There are five vowels in English, including uppercase and lowercase: a, e, i, o, u Example 1 Input: String = "Tutorialspoint" Output: 6 explain The vowels in the string "Tutorialspoint" are u, o, i, a, o, i. There are 6 yuan in total

Static binding (static::) implements late static binding (LSB) in PHP, allowing calling classes to be referenced in static contexts rather than defining classes. 1) The parsing process is performed at runtime, 2) Look up the call class in the inheritance relationship, 3) It may bring performance overhead.

What are the magic methods of PHP? PHP's magic methods include: 1.\_\_construct, used to initialize objects; 2.\_\_destruct, used to clean up resources; 3.\_\_call, handle non-existent method calls; 4.\_\_get, implement dynamic attribute access; 5.\_\_set, implement dynamic attribute settings. These methods are automatically called in certain situations, improving code flexibility and efficiency.
