Rumah > php教程 > php手册 > 利用php爬虫分析南京房价

利用php爬虫分析南京房价

PHP中文网
Lepaskan: 2016-08-20 08:48:20
asal
2363 orang telah melayarinya

利用php爬虫分析南京房价
   前些天看到csdn上的一篇文章,利用python写爬虫来分析上海房价的。感觉挺有意思的。正好最最近写snake后台也写到了文章采集,我也来用php的爬虫来分析一下南京的房价。说干就开始吧。
   本次爬虫的依赖文件: 首先是ares333大神的CURL类。我用的是初期的版本,这是https://github.com/ares333/php-curlmulti大神的github项目地址,他写的curl确实很牛!
  采集用的是phpQuery,不知道这个类的朋友,可以自行百度吧。
  至于数据的来源,我选择安居客,数据量还是可以的,打开安居客选到南京的频道。开始分析他们的页面结构,至于怎么用phpQuery分析页面结构采集的方法,这里就不做详细的介绍了。分析好结构,好,开始建立数据表。首先建立区域表,房屋交易都是分版块的,版块表结构如下CREATE TABLE `area` (<br>  `id` int(11) NOT NULL AUTO_INCREMENT,<br>  `name` varchar(155) NOT NULL COMMENT '南京市区',<br>  `url` varchar(155) NOT NULL COMMENT '房源区域连接',<br>  `pid` int(2) NOT NULL COMMENT '分类',<br>  PRIMARY KEY (`id`)<br>) ENGINE=MyISAM  DEFAULT CHARSET=utf8;    我是首先自己添加的一些区服的数据,其实可以采集这些,因为就那几个区,地址有限,就直接添加了。添加了是14条数据:
利用php爬虫分析南京房价
   初始数据准备好了,就可以开始采集所有的区域版块入口地址了。贴上代码
area.php

<?php
// +----------------------------------------------------------------------
// | 采集区域脚本
// +----------------------------------------------------------------------
// | Author: NickBai <1902822973@qq.com>
// +----------------------------------------------------------------------
set_time_limit(0);
require &#39;init.php&#39;;
//根据大区信息前往抓取
$sql = "select * from `area`";
$area = $db->query( $sql )->fetchAll( PDO::FETCH_ASSOC );
foreach($area as $key=>$vo){
    $url = $vo[&#39;url&#39;];
    $result = $curl->read($url);
    $charset = preg_match("/<meta.+?charset=[^\w]?([-\w]+)/i", $result[&#39;content&#39;], $temp) ? strtolower( $temp[1] ) : "";  
    phpQuery::$defaultCharset = $charset;  //设置默认编码
    $html = phpQuery::newDocumentHTML( $result[&#39;content&#39;] );
    $span = $html[&#39;.items .sub-items a&#39;];
    $st = $db->prepare("insert into area(name,url,pid) values(?,?,?)");
    foreach($span as $v){
        $v = pq( $v );
        //为方便分页抓取,先加入分页规则
        $href = trim( $v->attr(&#39;href&#39;) ) . &#39;p*/#filtersort&#39;;
        $st->execute([ trim( $v->text() ), $href, $vo[&#39;id&#39;]]);
    }
}
Salin selepas log masuk


采集出的单条数据如下:百家湖 http://nanjing.anjuke.com/sale/baijiahu/p*/#filtersort

数据地址都有了,而且页面地址我加了*,这样就可以替换了,打开程序就能开始采集每个模块下的其他分页的书据了。最重要的主程序就要开始了;
新建hdetail表来记录采集来的房屋数信息:

CREATE TABLE `hdetail` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `pid` int(5) NOT NULL COMMENT &#39;区域id&#39;,
  `square` int(10) DEFAULT NULL COMMENT &#39;面积&#39;,
  `housetype` varchar(55) DEFAULT &#39;&#39; COMMENT &#39;房屋类型&#39;,
  `price` int(10) DEFAULT &#39;0&#39; COMMENT &#39;单价&#39;,
  `allprice` int(10) DEFAULT &#39;0&#39; COMMENT &#39;总价&#39;,
  `name` varchar(155) DEFAULT &#39;&#39; COMMENT &#39;小区名称&#39;,
  `addr` varchar(155) DEFAULT &#39;&#39; COMMENT &#39;小区地址&#39;,
  PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Salin selepas log masuk

数据库有了,那么主程序奉上。

house.php

<?php
// +----------------------------------------------------------------------
// | 采集各区具体房源信息
// +----------------------------------------------------------------------
// | Author: NickBai <1902822973@qq.com>
// +----------------------------------------------------------------------
set_time_limit(0);
require &#39;init.php&#39;;
//查询各板块数据
$sql = "select * from `area` where id > 14";
$allarea = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC );
//http://www.php.cn/页面不存在时,会跳转到首页
foreach($allarea as $key=>$vo){
    $url = $vo[&#39;url&#39;];
    $i = 1;
    while ( true ){
        $urls = str_replace( "*" , $i, $url);
        $result = $curl->read( $urls );
        if( "http://nanjing.anjuke.com/sale/" == $result[&#39;info&#39;][&#39;url&#39;] ){
            break;
        }
        $charset = preg_match("/<meta.+?charset=[^\w]?([-\w]+)/i", $result[&#39;content&#39;], $temp) ? strtolower( $temp[1] ) : "";  
        phpQuery::$defaultCharset = $charset;  //设置默认编码
        $html = phpQuery::newDocumentHTML( $result[&#39;content&#39;] );
        $p = $html[&#39;#houselist-mod li .house-details&#39;];
        $isGet = count( $p->elements );  //未采集到内容跳出,视为结束
        if( 0 == $isGet ){
            break;
        }
        foreach($p as $v){
            $sql = "insert into hdetail(pid,square,housetype,price,allprice,name,addr) ";
            $pid = $vo[&#39;id&#39;];
            $square =  rtrim( trim( pq($v)->find("p:eq(1) span:eq(0)")->text() ), "平方米");
            $htype = trim( pq($v)->find("p:eq(1) span:eq(1)")->text() );
            $price = rtrim ( trim( pq($v)->find("p:eq(1) span:eq(2)")->text() ), "元/m²");
            $area = explode(" ", trim( pq($v)->find("p:eq(2) span")->text() ) );
    
            $name =  str_replace( chr(194) . chr(160), "", array_shift($area) );   //utf-8中的空格无法用trim去除,所以采用此方法
            $addr = rtrim( ltrim (trim( array_pop($area) ) , "["), "]" );
            $allprice = trim( pq($v)->siblings(".pro-price")->find("span strong")->text() );
            $sql .= " value( ". $pid .",". $square .", &#39;". $htype ."&#39; ,". $price .",". $allprice .", &#39;". $name ."&#39; ,&#39;". $addr ."&#39; )";
            $db->query($sql);
        }
        echo mb_convert_encoding($vo[&#39;name&#39;], "gbk", "utf-8") . " PAGE : ". $i . PHP_EOL;
        $i++;
    }
}
Salin selepas log masuk

跳过前面的大区,逐个采集。建议用cmd模式运行这个脚本。因为时间较长,所以用浏览器会导致卡死现象。至于不知道怎么用cmd命令执行php的,自己百度吧。
利用php爬虫分析南京房价
如果觉得慢的话,你们可以复制几分house.php文件,修改

$sql = "select * from `area` where id > 14";
Salin selepas log masuk

根据id进行截取,多打开几个cmd执行,就变成多进程模式了。
利用php爬虫分析南京房价
下面就是等待了,我是8.16号采集的,一共采集了311226条数据。好了现在数有了,就可以开始分析了。我分析的代码如下:

<?php
require "init.php";
$data = unserialize( file_get_contents(&#39;./data/nj.data&#39;) );
if( empty( $data ) ){
    //全南京
    $sql = "select avg(price) price from hdetail";
    $nanjing = intval( $db->query($sql)->fetch( PDO::FETCH_ASSOC )[&#39;price&#39;] );
    //其余数据
    $data = [
        $nanjing,
        getOtherPrice(&#39;2,3,4,5,6,7,8,10&#39;),
        getOtherPrice(&#39;1&#39;),
        getOtherPrice(&#39;2&#39;),
        getOtherPrice(&#39;3&#39;),
        getOtherPrice(&#39;4&#39;),
        getOtherPrice(&#39;5&#39;),
        getOtherPrice(&#39;6&#39;),
        getOtherPrice(&#39;7&#39;),
        getOtherPrice(&#39;8&#39;),
        getOtherPrice(&#39;9&#39;),
        getOtherPrice(&#39;10&#39;),
        getOtherPrice(&#39;11&#39;),
        getOtherPrice(&#39;12&#39;),
        getOtherPrice(&#39;13&#39;)
    ];
    //添加缓存
    file_put_contents(&#39;./data/nj.data&#39;, serialize( $data ));
}
//均价最高TOP10
$sql = "select avg(price) price,name from hdetail GROUP BY name ORDER BY price desc limit 10";
$res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC );
$x = "";
$y = "";
foreach($res as $vo){
    $x .= "&#39;" . $vo[&#39;name&#39;] . "&#39;,";
    $y .= intval( $vo[&#39;price&#39;] ). ",";
}
//均价最低TOP10
$sql = "select avg(price) price,name from hdetail GROUP BY name ORDER BY price asc limit 10";
$res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC );
$xl = "";
$yl = "";
foreach($res as $vo){
    $xl .= "&#39;" . $vo[&#39;name&#39;] . "&#39;,";
    $yl .= intval( $vo[&#39;price&#39;] ). ",";
}
//交易房型数据
$sql = "select count(0) allnum, housetype from hdetail GROUP BY housetype order by allnum desc";
$res = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC );
$htype = "";
foreach($res as $vo){
    $htype .= "[ &#39;" . $vo[&#39;housetype&#39;] . "&#39;, " .$vo[&#39;allnum&#39;]. "],";
}
$htype = rtrim($htype, &#39;,&#39;);
//交易的房屋面积数据
$square = [&#39;50平米以下&#39;, &#39;50-70平米&#39;, &#39;70-90平米&#39;, &#39;90-120平米&#39;, &#39;120-150平米&#39;, &#39;150-200平米&#39;, &#39;200-300平米&#39;, &#39;300平米以上&#39;];
$sql = "select count(0) allnum, square from hdetail GROUP BY square";
$squ = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC );
$p50 = 0;
$p70 = 0;
$p90 = 0;
$p120 = 0;
$p150 = 0;
$p200 = 0;
$p250 = 0;
$p300 = 0;
foreach($squ as $key=>$vo){
    if( $vo[&#39;square&#39;] < 50 ){
        $p50 += $vo[&#39;allnum&#39;];
    }
    if( $vo[&#39;square&#39;] >= 50 &&  $vo[&#39;square&#39;] < 70 ){
        $p70 += $vo[&#39;allnum&#39;];
    }
    if( $vo[&#39;square&#39;] >= 70 &&  $vo[&#39;square&#39;] < 90 ){
        $p90 += $vo[&#39;allnum&#39;];
    }
    if( $vo[&#39;square&#39;] >= 90 &&  $vo[&#39;square&#39;] < 120 ){
        $p120 += $vo[&#39;allnum&#39;];
    }
    if( $vo[&#39;square&#39;] >= 120 &&  $vo[&#39;square&#39;] < 150 ){
        $p150 += $vo[&#39;allnum&#39;];
    }
    if( $vo[&#39;square&#39;] >= 150 &&  $vo[&#39;square&#39;] < 200 ){
        $p200 += $vo[&#39;allnum&#39;];
    }
    if( $vo[&#39;square&#39;] >= 200 &&  $vo[&#39;square&#39;] < 300 ){
        $p250 += $vo[&#39;allnum&#39;];
    }
    if( $vo[&#39;square&#39;] >= 300 ){
        $p300 += $vo[&#39;allnum&#39;];
    }
}
$num = [ $p50, $p70, $p90, $p120, $p150, $p200, $p250, $p300 ];
$sqStr = "";
foreach($square as $key=>$vo){
    $sqStr .= "[ &#39;" . $vo . "&#39;, " .$num[$key]. "],";
}
//根据获取ids字符串获取对应的均价信息
function getOtherPrice($str){
    global $db;
    $sql = "select id from area where pid in(" . $str . ")";
    $city = $db->query($sql)->fetchAll( PDO::FETCH_ASSOC );
    $ids = "";
    foreach($city as $v){
        $ids .= $v[&#39;id&#39;] . ",";
    }
    $sql = "select avg(price) price from hdetail where pid in (".rtrim($ids, ",").")";
    $price = intval( $db->query($sql)->fetch( PDO::FETCH_ASSOC )[&#39;price&#39;] );
    return $price;
}
?>
<!DOCTYPE html>
<html>
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>南京房价分析</title>
    <link rel="shortcut icon" href="favicon.ico"> <link href="css/bootstrap.min.css?v=3.3.6" rel="stylesheet">
    <link href="css/font-awesome.min.css?v=4.4.0" rel="stylesheet">
    <link href="css/animate.min.css" rel="stylesheet">
    <link href="css/style.min.css?v=4.1.0" rel="stylesheet">
</head>
<body class="gray-bg">
    <p class="wrapper wrapper-content">
        <p class="row">
            <p class="col-sm-12">
                <p class="row">
                    <p class="col-sm-12">
                        <p class="ibox float-e-margins">
                            <p class="ibox-title">
                                <h5>全南京以及各区二手房均价</h5>
                                <p class="ibox-tools">
                                    <a class="collapse-link">
                                        <i class="fa fa-chevron-up"></i>
                                    </a>
                                    <a class="close-link">
                                        <i class="fa fa-times"></i>
                                    </a>
                                </p>
                            </p>
                            <p class="ibox-content">
                               <p id="container"></p>
                            </p>
                        </p>
                    </p>
                </p>
            </p>
        </p>
        <p class="row">
            <p class="col-sm-6">
                <p class="row">
                    <p class="col-sm-12">
                        <p class="ibox float-e-margins">
                            <p class="ibox-title">
                                <h5>均价最高的小区TOP10</h5>
                                <p class="ibox-tools">
                                    <a class="collapse-link">
                                        <i class="fa fa-chevron-up"></i>
                                    </a>
                                    <a class="close-link">
                                        <i class="fa fa-times"></i>
                                    </a>
                                </p>
                            </p>
                            <p class="ibox-content">
                               <p id="avgpriceh"></p>
                            </p>
                        </p>
                    </p>
                </p>
            </p>
            <p class="col-sm-6">
                <p class="row">
                    <p class="col-sm-12">
                        <p class="ibox float-e-margins">
                            <p class="ibox-title">
                                <h5>均价最低的小区TOP10</h5>
                                <p class="ibox-tools">
                                    <a class="collapse-link">
                                        <i class="fa fa-chevron-up"></i>
                                    </a>
                                    <a class="close-link">
                                        <i class="fa fa-times"></i>
                                    </a>
                                </p>
                            </p>
                            <p class="ibox-content">
                               <p id="avgpricel"></p>
                            </p>
                        </p>
                    </p>
                </p>
            </p>
        </p>
        <p class="row">
            <p class="col-sm-6">
                <p class="row">
                    <p class="col-sm-12">
                        <p class="ibox float-e-margins">
                            <p class="ibox-title">
                                <h5>交易房型比例</h5>
                                <p class="ibox-tools">
                                    <a class="collapse-link">
                                        <i class="fa fa-chevron-up"></i>
                                    </a>
                                    <a class="close-link">
                                        <i class="fa fa-times"></i>
                                    </a>
                                </p>
                            </p>
                            <p class="ibox-content">
                               <p id="htype"></p>
                            </p>
                        </p>
                    </p>
                </p>
            </p>
            <p class="col-sm-6">
                <p class="row">
                    <p class="col-sm-12">
                        <p class="ibox float-e-margins">
                            <p class="ibox-title">
                                <h5>交易房屋面积比例</h5>
                                <p class="ibox-tools">
                                    <a class="collapse-link">
                                        <i class="fa fa-chevron-up"></i>
                                    </a>
                                    <a class="close-link">
                                        <i class="fa fa-times"></i>
                                    </a>
                                </p>
                            </p>
                            <p class="ibox-content">
                               <p id="square"></p>
                            </p>
                        </p>
                    </p>
                </p>
            </p>
        </p>
    </p>
    <script type="text/javascript" src="js/jquery.min.js?v=2.1.4"></script>
    <script type="text/javascript" src="js/bootstrap.min.js?v=3.3.6"></script>
    <script type="text/javascript" src="http://cdn.hcharts.cn/highcharts/highcharts.js"></script>
    <script type="text/javascript">
        $(function () {
            $(&#39;#container&#39;).highcharts({
                chart: {
                    type: &#39;column&#39;
                },
                title: {
                    text: &#39;全南京以及各区二手房均价&#39;
                },
                subtitle: {
                    text: &#39;来源于安居客8.16的数据&#39;
                },
                xAxis: {
                    categories: [&#39;全南京&#39;,&#39;江南八区&#39;,&#39;江宁区&#39;,&#39;鼓楼区&#39;,&#39;白下区&#39;,&#39;玄武区&#39;,&#39;建邺区&#39;,&#39;秦淮区&#39;,&#39;下关区&#39;,&#39;雨花台区&#39;,&#39;浦口区&#39;,&#39;栖霞区&#39;,&#39;六合区&#39;,
                    &#39;溧水区&#39;,&#39;高淳区&#39;,&#39;大厂&#39;],
                    crosshair: true
                },
                yAxis: {
                    min: 0,
                    title: {
                        text: &#39;元/m²&#39;
                    }
                },
                tooltip: {
                    headerFormat: &#39;<span style="font-size:10px">{point.key}</span><table>&#39;,
                    pointFormat: &#39;<tr><td style="color:{series.color};padding:0">{series.name}: </td>&#39; +
                    &#39;<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>&#39;,
                    footerFormat: &#39;</table>&#39;,
                    shared: true,
                    useHTML: true
                },
                plotOptions: {
                    column: {
                        pointPadding: 0.2,
                        borderWidth: 0,
                        dataLabels:{
                         enabled:true// dataLabels设为true    
                        }
                    } 
                },
                series: [{
                    name: &#39;平均房价&#39;,
                    data: [<?php echo implode(&#39;,&#39;, $data); ?>]
                }]
            });
            //均价最高top10
            $(&#39;#avgpriceh&#39;).highcharts({
                chart: {
                    type: &#39;column&#39;
                },
                title: {
                    text: &#39;均价最高的小区TOP10&#39;
                },
                subtitle: {
                    text: &#39;来源于安居客8.16的数据&#39;
                },
                xAxis: {
                    categories: [<?=$x; ?>],
                    crosshair: true
                },
                yAxis: {
                    min: 0,
                    title: {
                        text: &#39;元/m²&#39;
                    }
                },
                tooltip: {
                    headerFormat: &#39;<span style="font-size:10px">{point.key}</span><table>&#39;,
                    pointFormat: &#39;<tr><td style="color:{series.color};padding:0">{series.name}: </td>&#39; +
                    &#39;<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>&#39;,
                    footerFormat: &#39;</table>&#39;,
                    shared: true,
                    useHTML: true
                },
                plotOptions: {
                    column: {
                        pointPadding: 0.2,
                        borderWidth: 0,
                        dataLabels:{
                         enabled:true// dataLabels设为true    
                        }
                    } 
                },
                series: [{
                    name: &#39;平均房价&#39;,
                    data: [<?=$y; ?>]
                }]
            });
            //均价最低top10
            $(&#39;#avgpricel&#39;).highcharts({
                chart: {
                    type: &#39;column&#39;
                },
                title: {
                    text: &#39;均价最低的小区TOP10&#39;
                },
                subtitle: {
                    text: &#39;来源于安居客8.16的数据&#39;
                },
                xAxis: {
                    categories: [<?=$xl; ?>],
                    crosshair: true
                },
                yAxis: {
                    min: 0,
                    title: {
                        text: &#39;元/m²&#39;
                    }
                },
                tooltip: {
                    headerFormat: &#39;<span style="font-size:10px">{point.key}</span><table>&#39;,
                    pointFormat: &#39;<tr><td style="color:{series.color};padding:0">{series.name}: </td>&#39; +
                    &#39;<td style="padding:0"><b>{point.y:.1f} 元/m²</b></td></tr>&#39;,
                    footerFormat: &#39;</table>&#39;,
                    shared: true,
                    useHTML: true
                },
                plotOptions: {
                    column: {
                        pointPadding: 0.2,
                        borderWidth: 0,
                        dataLabels:{
                         enabled:true// dataLabels设为true    
                        }
                    } 
                },
                series: [{
                    name: &#39;平均房价&#39;,
                    data: [<?=$yl; ?>]
                }]
            });
             // Radialize the colors
            Highcharts.getOptions().colors = Highcharts.map(Highcharts.getOptions().colors, function (color) {
                return {
                    radialGradient: { cx: 0.5, cy: 0.3, r: 0.7 },
                    stops: [
                        [0, color],
                        [1, Highcharts.Color(color).brighten(-0.3).get(&#39;rgb&#39;)] // darken
                    ]
                };
            });
            //房型类型
            $(&#39;#htype&#39;).highcharts({
                chart: {
                    plotBackgroundColor: null,
                    plotBorderWidth: null,
                    plotShadow: false
                },
                title: {
                    text: &#39;交易的二手房型比例&#39;
                },
                tooltip: {
                    pointFormat: &#39;{series.name}: <b>{point.percentage:.1f}%</b>&#39;
                },
                plotOptions: {
                    pie: {
                        allowPointSelect: true,
                        cursor: &#39;pointer&#39;,
                        dataLabels: {
                            enabled: true,
                            format: &#39;<b>{point.name}</b>: {point.percentage:.1f} %&#39;,
                            style: {
                                color: (Highcharts.theme && Highcharts.theme.contrastTextColor) || &#39;black&#39;
                            },
                            connectorColor: &#39;silver&#39;
                        }
                    }
                },
                series: [{
                    type: &#39;pie&#39;,
                    name: &#39;Browser share&#39;,
                    data: [
                        <?=$htype; ?>
                    ]
                }]
            });
            //房型面积类型
            $(&#39;#square&#39;).highcharts({
                chart: {
                    plotBackgroundColor: null,
                    plotBorderWidth: null,
                    plotShadow: false
                },
                title: {
                    text: &#39;交易的二手房面积比例&#39;
                },
                tooltip: {
                    pointFormat: &#39;{series.name}: <b>{point.percentage:.1f}%</b>&#39;
                },
                plotOptions: {
                    pie: {
                        allowPointSelect: true,
                        cursor: &#39;pointer&#39;,
                        dataLabels: {
                            enabled: true,
                            format: &#39;<b>{point.name}</b>: {point.percentage:.1f} %&#39;,
                            style: {
                                color: (Highcharts.theme && Highcharts.theme.contrastTextColor) || &#39;black&#39;
                            },
                            connectorColor: &#39;silver&#39;
                        }
                    }
                },
                series: [{
                    type: &#39;pie&#39;,
                    name: &#39;Browser share&#39;,
                    data: [
                        <?=$sqStr; ?>
                    ]
                }]
            });
        });
    </script>
</body>
</html>
Salin selepas log masuk

   

页面效果如下:
利用php爬虫分析南京房价
利用php爬虫分析南京房价
利用php爬虫分析南京房价
利用php爬虫分析南京房价
利用php爬虫分析南京房价
  哈哈,房价果然很是吓人,二手房都这个价钱了。还有什么有趣的信息,你自己可以去发掘。


Kenyataan Laman Web ini
Kandungan artikel ini disumbangkan secara sukarela oleh netizen, dan hak cipta adalah milik pengarang asal. Laman web ini tidak memikul tanggungjawab undang-undang yang sepadan. Jika anda menemui sebarang kandungan yang disyaki plagiarisme atau pelanggaran, sila hubungi admin@php.cn
Artikel terbaru oleh pengarang
Cadangan popular
Tutorial Popular
Lagi>
Muat turun terkini
Lagi>
kesan web
Kod sumber laman web
Bahan laman web
Templat hujung hadapan