PHP crawls the embarrassing things on the home page of the Encyclopedia of Embarrassing Things

PHP crawls the embarrassing things on the home page of the Encyclopedia of Embarrassing Things_PHP Tutorial

WBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWBOYWB

Release： 2016-07-13 09:53:12

Original

1543 people have browsed it

PHP crawls embarrassing things on the homepage of Encyclopedia of Embarrassing Things

Suddenly I want to get some online data for fun, because there is SAE’s MySql database, and it is of no use leaving it there! So I started using PHP to write a small program that crawled the embarrassing things on the homepage of the Encyclopedia of Embarrassing Things. The data was all saved in MySql. Wouldn't it be fun!

Just do it! First determine the idea

Get HTML source code--->Parse HTML--->Save to database

Nothing difficult

1. Create the PHP file "getDataToDB.php",

2. Get the HTML source code of the specified URL

I am using the curl function here. For details, please refer to the PHP manual

The code is

<span new="" style="font-family:Times">// 获取对应链接的HTMLCODE
function GetHtmlCode($url) {
	$ch = curl_init (); // 初始化一个cur对象
	curl_setopt ( $ch, CURLOPT_URL, $url ); // 设置需要抓取的网页
	curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 ); // 设置crul参数，要求结果保存到字符串中还是输出到屏幕上
	curl_setopt ( $ch, CURLOPT_CONNECTTIMEOUT, 1000 ); // 设置链接延迟
	$HtmlCode = curl_exec ( $ch ); // 运行curl，请求网页
	return $HtmlCode;
}</span>

Copy after login

3. Introduce the third-party file ‘simple_html_dom.php’ to parse HTML

I don’t have the ability to use regular expressions here, so I searched online and finally found this, just like using Jsoup in Java (using Jsoup to parse the official website of Chuzhou University to get the news list). For details, see BLOG

The code is as follows

<span new="" style="font-family:Times">function getFmlDataToDB() {
	$link = mysql_connect ( SAE_MYSQL_HOST_M . &#39;:&#39; . SAE_MYSQL_PORT, SAE_MYSQL_USER, SAE_MYSQL_PASS );
	// 获取源码
	$html = str_get_html ( GetHtmlCode ( http://www.qiushibaike.com/ ) );
	
	if ($link) {
		mysql_select_db ( SAE_MYSQL_DB, $link );
		mysql_query ( &#39;set names utf8&#39; );
		// class=article block untagged mb15
		foreach ( $html->find ( &#39;div[class=article block untagged mb15]&#39; ) as $per ) {
			
			$z = null;
			$t = null;
			$w = null;
			$d = null;
			$p = null;
			$ds = null;
			$ps = null;
			
			// //作者
			$author = $per->find ( &#39;div[class=author]&#39; );
			if ($author != null) {
				$a = $author [0]->find ( &#39;a&#39; );
				$z = $a [1]->innertext;
			} else {
				$z = &#39;no author&#39;;
			}
			
			// 头像链接
			
			if ($author != null) {
				$icon = $author [0]->find ( &#39;a&#39; );
				$t = $icon [0]->src->innertext;
			} else {
				$t = &#39;...............&#39;;
			}
			
			// 文章内容
			$content = $per->find ( &#39;div[class=content]&#39; );
			$w = $content [0]->innertext;
			
			// 点赞数
			$vote1 = $per->find ( &#39;div[class=stats]&#39; );
			$vote2 = $vote1 [0]->find ( &#39;span[class=stats-vote]&#39; );
			$vote3 = $vote2 [0]->find ( &#39;i[class=number]&#39; );
			
			$d = $vote3 [0]->innertext;
			// 评论数
			$comments1 = $vote1 [0]->find ( &#39;span[class=stats-comments]&#39; );
			$comments2 = $comments1 [0]->find ( &#39;a[class=qiushi_comments]&#39; );
			$comments3 = $comments2 [0]->find ( &#39;i[class=number]&#39; );
			$p = $comments3 [0]->innertext;
			// 顶 数
			$up_down = $per->find ( &#39;div[class=stats-buttons bar clearfix]&#39; );
			
			$up_down1 = $up_down [0]->find ( &#39;ul&#39; );
			$li = $up_down1 [0]->find ( &#39;li&#39; );
			$up = $li [0]->find ( &#39;span[class=number hidden]&#39; );
			$ds = $up [0]->innertext;
			// 拍 数
			$down = $li [1]->find ( &#39;span[class=number hidden]&#39; );
			$ps = $down [0]->innertext;

		}
	} else {
		echo &#39;数据库链接KO&#39;;
	}
}</span>

Copy after login

This code is a bit confusing to write. I tried it and couldn't get the data of the child nodes directly. I could only peel off the outer layers and parse them layer by layer. If there is a new way to write it, I will update it. Please take a look. .

4. Create a database and insert data into the database

Here I use MySQL in SAE. For specific connection methods, see Using PHP to connect to the MySql database in SAE

What you need to pay attention to is the encoding format. You should add this sentence before the execution statement

<span style="font-family:Microsoft">mysql_query ( &#39;set names utf8&#39; );</span>

Copy after login

The core code is as follows:

<span style="font-family:Microsoft">			$sql = INSERT INTO `app_bmhjqs`.`db_fml` (`id`, `author`, `icon_url`, `content`, `vote`, `comments`, `up`, `down`) VALUES (NULL, &#39;$z&#39;, &#39;$t&#39;, &#39;$w&#39;, &#39;$d&#39;, &#39;$p&#39;, &#39;$ds&#39;, &#39;$ps&#39;);;
			// 解决乱码
			mysql_query ( &#39;set names utf8&#39; );
			$result = mysql_query ( $sql );</span>

Copy after login

In this way, Get--->Parse--->Insert is completed. The effect is to run the PHP file once, and the embarrassing things on the homepage of the Encyclopedia of Embarrassing Things will be added to the database! I wonder if I can write a timer to run the code at a certain interval. I can do this in Java, but I can't in PHP. After all, I am a little bird with no hair! Baidu. . . I found this way of writing

<span new="" style="font-family:Times">// 定时器
// ignore_user_abort (); // run script. in background
// set_time_limit ( 0 ); // run script. forever
// $interval = 30; // do every 15 minutes..

// do {
// 	echo date ( &#39;Y-m-d H:i:s&#39;, time () );
// 	echo &#39;写入数据库&#39;;
// 	//getFmlDataToDB ();
	
// } while ( true );</span>

Copy after login

Add this code to the file and publish it to SAE just before the school disconnects. I have not tested it! I can only wait until the next day to check the results!

This morning, I couldn’t wait to turn on my computer and open the SAE database. The situation is as follows:

Oh my god! I couldn't stand it anymore, so I quickly turned off the timer and wrote a button to trigger the event! If this continues, the database will be crowded!