PHP regular matching to obtain the hyperlink address of the specified url page

PHP regular matching to obtain the hyperlink address of the specified url page_PHP tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-07-20 11:16:58

Original

1197 people have browsed it

In data collection and page analysis, it is often necessary to capture the content of a given url page, or the second and third level in-depth page content.

Here is the implementation of a test example for reference only.

The code is as follows:

/*
Match the given page link
return:array match[link,content,all]
*/
function match_links($host, $document) {
$pattern = '/(.*?)/i';
preg_match_all($pattern, $document, $m);
return $m;

preg_match_all("']+))[^ >]*>?(.*?)'isx",$document,$links);
while(list($key,$val) = each($links[2])) {
if(!empty($val))
If(preg_match("/http/",$val)){
$match['link'][] = $val;
}
else {
$match['link'][] = $host . $val;
}
}
while(list($key,$val) = each($links[3])) {
if(!empty($val))
If(preg_match("/http/",$val)){
$match['link'][] = $val;
}
else {
$match['link'][] = $host . $val;
}
}
while(list($key,$val) = each($links[4])) {
if(!empty($val))
$match['content'][] = $val;
}
while(list($key,$val) = each($links[0])) {
if(!empty($val))
$match['all'][] = $val;
}
return $match['link'];
}

/*
Get the page text content from the given url
*/
function get_content_from_url($url) {
$str = @file_get_contents($url);
if(mb_check_encoding($str, "GBK"))
$str = iconv("GBK","UTF-8", $str);
$str = strip_tags($str); // Filter html tags
/*
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@@is", "", $str );
$str = preg_replace( "@<(.*?)>@is", "", $str );
*/
//Filter non-Chinese characters
preg_match_all('/[x{4e00}-x{9fff}]+/u', $str, $matches);
$str = join(',', $matches[0]);
if(!$str)
Return NULL;

return $str;
}

function get_content($url,$depth) {
if(!$url || $depth < 1)
return false;

while($depth > 1){
$str = @file_get_contents($url);
if(!$str)
Return false;

$parseurl = parse_url($url);
if($parseurl['host'])
$host = $parseurl[scheme] . "://" . $parseurl['host'];

$arrlink = match_links($host,$str);
$arr_url = array_unique($arrlink);

$depth--;
foreach($arr_url as $url){
$content .= get_content($url, $depth); //Recursive call
}
}

$content .= get_content_from_url($url);

return $content;
}