


PHP extracts web page titles and removes irrelevant seo keywords_PHP tutorial
场景描述:
过往我们在抽取网页标题的时候,都会直接抽取 之间的内容. 但实际情况是这样,例如javaeye 的一篇文章 http://www.iteye.com/news/21643 , 的内容为 "10年软件开发教会我最重要的10件事 - 非技术 - ITeye资讯", 但实际引用中我们期望的标题应该为 "10年软件开发教会我最重要的10件事". 所以标题后面堆砌了很多不相关的关键字(应该是为了 seo 吧). 所以我们希望过滤掉这些关键字. 有下面的方法可以参考:
1. 查找 h1 等标签.(分析sina news 一些网站之后, 觉得不可行,会有很多干扰)
2. 从全文去标题后,将 之间的内容切割(按 _ | -)为 a1,a2,a3,a4,然后从最长的词组a3开始从全文查找. 如果查找成功,那么开始向左边迭代查询 a2,a1,直到查询失败为止 。左侧失败后,再继续向右迭代,同理. (这里我采用的是这种方法)
Php代码
/**
* @author pqcc
* @date: 2011-06-18
* Description: Given a web page content, extract the title of the web page. The extracted title does not Including seo keywords.
* e.g: The result of a news title extracted directly from
* But the result we hope for is: "The College English CET-4 and CET-6 will be tested this Saturday for 9.09 million people’s reference."
* Scope of application: Extraction of the title of the final page of the article, excluding topic pages, etc.
*/
class TitlePurify{
private $matches_preg = [-_s|—];
function getTitle($contents){/*{{{*/
$preg = "/
preg_match($preg, $contents, $matches);
if(count($matches)<=1){
return "标题抽取失败";
}
$title = $matches[1];
return $this->trimTitle($title, $contents);
}/*}}}*/
function trimMeta($contents){/*{{{*/
// 首先去除
$preg = "/
$contents = preg_replace($preg, , $contents);
$preg = "/]*>/i";
$contents = preg_replace($preg, , $contents);
return $contents;
}/*}}}*/
// 获取长度最长的 item 所处的index.
function getMaxIndex($titles){/*{{{*/
$maxItemIndex = 0;
$maxLength = 0;
$loop = 0;
foreach($titles as $item){
if(strlen($item)>$maxLength){
$maxLength = strlen($item);
$maxItemIndex = $loop;
}
$loop++;
}
return $maxItemIndex;
}/*}}}*/
function trim($title, $titles, $contents, $maxItemIndex){/*{{{*/
//@todo: Contents can be optimized here
// If Search successful. result = tempTitle. current index Iterate to the left (stop until the first one is reached or the match fails).
$leftIndex = $maxItemIndex-1;
while(true && $leftIndex>=0){
$leftIndex = $maxItemIndex-1; // tempTitle+left one .
preg_match("/({$this->matches_preg}+{$tempTitle})/i", $title, $matches); 🎜> .$matches[ 1];
// Continue to match with tempTitle.
preg_match("/$tempTitle/i", $contents, $matches);
// If the search fails....
if(count($matches)<1){ $result = $tempTitle; { // Under normal circumstances, this situation will not occur.
http://www.bkjia.com/PHPjc/478770.html
www.bkjia.com
true
http: //www.bkjia.com/PHPjc/478770.html
TechArticle
Scene description: In the past, when we extracted the title of the web page, we would directly extract the content between them. But the actual situation is In this way, for example, an article by javaeye http://www.iteye.com/news/2164...

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

In this chapter, we will understand the Environment Variables, General Configuration, Database Configuration and Email Configuration in CakePHP.

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

To work with date and time in cakephp4, we are going to make use of the available FrozenTime class.

To work on file upload we are going to use the form helper. Here, is an example for file upload.

In this chapter, we are going to learn the following topics related to routing ?

CakePHP is an open-source framework for PHP. It is intended to make developing, deploying and maintaining applications much easier. CakePHP is based on a MVC-like architecture that is both powerful and easy to grasp. Models, Views, and Controllers gu

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

Validator can be created by adding the following two lines in the controller.
