Home Backend Development PHP Tutorial PHP extracts web page titles and removes irrelevant seo keywords_PHP tutorial

PHP extracts web page titles and removes irrelevant seo keywords_PHP tutorial

Jul 13, 2016 pm 05:44 PM
php seo Keywords exist Scenes us extract title of Web page

场景描述:

  过往我们在抽取网页标题的时候,都会直接抽取 之间的内容. 但实际情况是这样,例如javaeye 的一篇文章 http://www.iteye.com/news/21643 ,  的内容为 "10年软件开发教会我最重要的10件事 - 非技术 - ITeye资讯", 但实际引用中我们期望的标题应该为 "10年软件开发教会我最重要的10件事". 所以标题后面堆砌了很多不相关的关键字(应该是为了  seo 吧). 所以我们希望过滤掉这些关键字. 有下面的方法可以参考:


1. 查找 h1 等标签.(分析sina news 一些网站之后, 觉得不可行,会有很多干扰)

2. 从全文去标题后,将 之间的内容切割(按 _ | -)为 a1,a2,a3,a4,然后从最长的词组a3开始从全文查找. 如果查找成功,那么开始向左边迭代查询 a2,a1,直到查询失败为止 。左侧失败后,再继续向右迭代,同理. (这里我采用的是这种方法)


Php代码 
/**
* @author pqcc
* @date: 2011-06-18
* Description: Given a web page content, extract the title of the web page. The extracted title does not Including seo keywords.
* e.g: The result of a news title extracted directly from is "College English Band 4 and 6 exams will start this Saturday for reference of 9.09 million people_Sina Education_Sina.com", <br> * But the result we hope for is: "The College English CET-4 and CET-6 will be tested this Saturday for 9.09 million people’s reference." <br> * Scope of application: Extraction of the title of the final page of the article, excluding topic pages, etc. <br>*/  <br>  <br>class TitlePurify{   <br>  <br>    private $matches_preg = [-_s|—];   <br>  <br>    function getTitle($contents){/*{{{*/  <br>        $preg = "/<title[^>]*>([w| ||W]*?)/i";  
        preg_match($preg, $contents, $matches);  
        if(count($matches)<=1){  
            return "标题抽取失败";  
        }  
        $title = $matches[1];  
        return $this->trimTitle($title, $contents);  
    }/*}}}*/ 
 
    function trimMeta($contents){/*{{{*/ 
        // 首先去除 内容, <meta> 内容.   <br>        $preg       = "/<title[^>]*>([w| ||W]*?)/i";  
        $contents   = preg_replace($preg, , $contents);  
        $preg       = "/]*>/i";  
        $contents   = preg_replace($preg, , $contents);  
        return $contents;  
    }/*}}}*/ 
 
 
    // 获取长度最长的 item 所处的index.  
    function getMaxIndex($titles){/*{{{*/ 
        $maxItemIndex   = 0;  
        $maxLength      = 0;  
        $loop           = 0;  
        foreach($titles as $item){  
            if(strlen($item)>$maxLength){  
                $maxLength      = strlen($item);  
                $maxItemIndex   = $loop;  
            }          
            $loop++;  
        }  
        return $maxItemIndex;  
    }/*}}}*/

function trim($title, $titles, $contents, $maxItemIndex){/*{{{*/
//@todo: Contents can be optimized here
// If Search successful. result = tempTitle.                                                                                             current index Iterate to the left (stop until the first one is reached or the match fails).
$leftIndex = $maxItemIndex-1;
while(true && $leftIndex>=0){
$leftIndex = $maxItemIndex-1; // tempTitle+left one . 
                    preg_match("/({$this->matches_preg}+{$tempTitle})/i", $title, $matches); 🎜>                                                                                                                                             .$matches[ 1];
// Continue to match with tempTitle.
preg_match("/$tempTitle/i", $contents, $matches);
// If the search fails....
if(count($matches)<1){                                                                                                                    $result = $tempTitle;                                            { // Under normal circumstances, this situation will not occur.
                                                                                                 


http://www.bkjia.com/PHPjc/478770.html

www.bkjia.com

true

http: //www.bkjia.com/PHPjc/478770.html

TechArticle

Scene description: In the past, when we extracted the title of the web page, we would directly extract the content between them. But the actual situation is In this way, for example, an article by javaeye http://www.iteye.com/news/2164...


Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
2 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Repo: How To Revive Teammates
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
Hello Kitty Island Adventure: How To Get Giant Seeds
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

CakePHP Project Configuration CakePHP Project Configuration Sep 10, 2024 pm 05:25 PM

In this chapter, we will understand the Environment Variables, General Configuration, Database Configuration and Email Configuration in CakePHP.

PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian PHP 8.4 Installation and Upgrade guide for Ubuntu and Debian Dec 24, 2024 pm 04:42 PM

PHP 8.4 brings several new features, security improvements, and performance improvements with healthy amounts of feature deprecations and removals. This guide explains how to install PHP 8.4 or upgrade to PHP 8.4 on Ubuntu, Debian, or their derivati

CakePHP Date and Time CakePHP Date and Time Sep 10, 2024 pm 05:27 PM

To work with date and time in cakephp4, we are going to make use of the available FrozenTime class.

CakePHP File upload CakePHP File upload Sep 10, 2024 pm 05:27 PM

To work on file upload we are going to use the form helper. Here, is an example for file upload.

CakePHP Routing CakePHP Routing Sep 10, 2024 pm 05:25 PM

In this chapter, we are going to learn the following topics related to routing ?

Discuss CakePHP Discuss CakePHP Sep 10, 2024 pm 05:28 PM

CakePHP is an open-source framework for PHP. It is intended to make developing, deploying and maintaining applications much easier. CakePHP is based on a MVC-like architecture that is both powerful and easy to grasp. Models, Views, and Controllers gu

How To Set Up Visual Studio Code (VS Code) for PHP Development How To Set Up Visual Studio Code (VS Code) for PHP Development Dec 20, 2024 am 11:31 AM

Visual Studio Code, also known as VS Code, is a free source code editor — or integrated development environment (IDE) — available for all major operating systems. With a large collection of extensions for many programming languages, VS Code can be c

CakePHP Creating Validators CakePHP Creating Validators Sep 10, 2024 pm 05:26 PM

Validator can be created by adding the following two lines in the controller.

See all articles