Home > Backend Development > PHP Tutorial > 网页抓取信息(php正则表达式、php操作excel)

网页抓取信息(php正则表达式、php操作excel)

WBOY
Release: 2016-06-23 13:33:02
Original
876 people have browsed it

1.问题描述

实现对固定网页上自己需要的信息抓取,以表格形式存储。我是拿wustoj上的一个排行榜来练习的,地址:wustoj


2.思路

网页自己就简单学习了一下php,刚好用它来做点事情吧,我的想法是这样的:

(1)查看网页源代码并保存在文件中。

(2)根据需要的信息写出正则表达式,读文件,根据正则表达式来提取需要的信息。写正则表达式的时候最好分组,这样提取起来就方便了很多。

(3)对excel操作,将提取的信息以excel的形式输出。

比较好的开源php处理excel类链接:点击打开链接


3.体会

^是指要是原字符串的开头,$是指要是原字符串的结尾。
空字符不一定是空格。
用()来分组是好方法,如preg_macth_all(/$pattern/,$subject,matches)。
matches为二维数组,如果没有_all,则只会匹配第一部分,是一维数组。
$matches[0]保存完整模式的所有匹配。$matches[1]保存第一子组所有匹配,即所有匹配的第一部分。
中文匹配串我用的这个$patt_ch=chr(0x80)."-".chr(0xff)。


4.代码

<?phpheader ("Content-Type: text/html; charset=utf-8");$url = "http://acm.wust.edu.cn/contestrank.php?cid=1014";$result=file_get_contents($url);$file=fopen("content.php","w");fwrite($file,$result);$file=fopen("content.php","r");$patt_ch=chr(0x80)."-".chr(0xff);// <td>1<td>)([0-9]+|\*)(</td><td part2 href="status.php?user_id=team30&cid=1014">team30_姓名$namepatt="(<a>)(\*{0,1}team[0-9]+)(_)([$patt_ch]+)()";  // part2 part4//$namepatt="(team[0-9]+)(_)([$patt_ch]+)";   也可以用这个直接匹配"team_姓名"//</a><a href="status.php?user_id=team30&cid=1014&jresult=4">7</a>$problempatt="(<a>)([0-9]+)()";//Include classrequire_once('Classes/PHPExcel.php');require_once('Classes/PHPExcel/Writer/Excel2007.php');$objPHPExcel = new PHPExcel();//Set properties 设置文件属性$objPHPExcel->getProperties()->setCreator("Maarten Balliauw");$objPHPExcel->getProperties()->setLastModifiedBy("Maarten Balliauw");$objPHPExcel->getProperties()->setTitle("Office 2007 XLSX Test Document");$objPHPExcel->getProperties()->setSubject("Office 2007 XLSX Test Document");$objPHPExcel->getProperties()->setDescription("Test document for Office 2007 XLSX, generated using PHP classes.");$objPHPExcel->getProperties()->setKeywords("office 2007 openxml php");$objPHPExcel->getProperties()->setCategory("Test result file");$row=1;$objPHPExcel->getActiveSheet()->setCellValue('A'.$row, 'rank');$objPHPExcel->getActiveSheet()->setCellValue('B'.$row, 'team');$objPHPExcel->getActiveSheet()->setCellValue('C'.$row, 'solved');while(!feof($file)){	//echo $row." ";	$line=fgets($file);	if(preg_match("/$rankpatt/",$line,$match))	{		$row++;		//print_r	($match);		//echo	$match[2]." ";		//echo	"<br>";		$objPHPExcel->getActiveSheet()->setCellValue('A'.$row, $match[2]);		$objPHPExcel->getActiveSheet()->getStyle('A'.$row)->getAlignment()->setHorizontal(PHPExcel_Style_Alignment::HORIZONTAL_LEFT);	}	if(preg_match("/$namepatt/",$line,$match))	{		//print_r	($match);		//echo	$match[2]." ".$match[4]." ";		//echo	"<br>"; 		$objPHPExcel->getActiveSheet()->setCellValue('B'.$row, $match[2].$match[4]);	}	if(preg_match("/$problempatt/",$line,$match))	{		//print_r	($match);		//echo	$match[2]." ";		//echo	"<br>";		$objPHPExcel->getActiveSheet()->setCellValue('C' . $row, $match[2]);		$objPHPExcel->getActiveSheet()->getStyle('C'.$row)->getAlignment()->setHorizontal(PHPExcel_Style_Alignment::HORIZONTAL_LEFT);	}	$objWriter = new PHPExcel_Writer_Excel2007($objPHPExcel);	$objWriter->save(str_replace('.php', '.xlsx', __FILE__));}echo	"well done:)";?>  <br>  <br>  <p>5.运行结果</p>  <p><br> </p>  <p><br> </p>  <p><br> </p> </a>
</td>
Copy after login
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template