今天来做一个PHP电影小爬虫。
我们来利用simple_html_dom的采集数据实例,这是一个PHP的库,上手很容易。
simple_html_dom 可以很好的帮助我们利用php解析html文档。通过这个php封装类可以很方便的解析html文档,对其中的html元素进行操作 (PHP5+以上版本)
下载地址:https://github.com/samacs/simple_html_dom
下面我们以 http://www.paopaotv.com 上的列表页 http://paopaotv.com/tv-type-id-5-pg-1.html 字母模式展现的列表为例,抓取页面上的列表数据,以及内容里面信息
<span style="color: #008080;"> 1</span> <span style="color: #000000;">php </span><span style="color: #008080;"> 2</span> <span style="color: #0000ff;">include_once</span> 'simple_html_dom.php'<span style="color: #000000;">; </span><span style="color: #008080;"> 3</span> <span style="color: #008000;">//</span><span style="color: #008000;">获取html数据转化为对象</span> <span style="color: #008080;"> 4</span> <span style="color: #800080;">$html</span> = file_get_html('http://paopaotv.com/tv-type-id-5-pg-1.html'<span style="color: #000000;">); </span><span style="color: #008080;"> 5</span> <span style="color: #008000;">//</span><span style="color: #008000;">A-Z的字母列表每条数据是在id=letter-focus 的div内class= letter-focus-item的dl标签内,用find方法查找即为 </span> <span style="color: #008080;"> 6</span> <span style="color: #800080;">$listData</span>=<span style="color: #800080;">$html</span>->find("#letter-focus .letter-focus-item");<span style="color: #008000;">//</span><span style="color: #008000;">$listData为数组对象</span> <span style="color: #008080;"> 7</span> <span style="color: #0000ff;">foreach</span>(<span style="color: #800080;">$listData</span> <span style="color: #0000ff;">as</span><span style="color: #800080;">$key</span>=><span style="color: #800080;">$eachRowData</span><span style="color: #000000;">){ </span><span style="color: #008080;"> 8</span> <span style="color: #800080;">$filmName</span>=<span style="color: #800080;">$eachRowData</span>->find("dd span",0)->plaintext;<span style="color: #008000;">//</span><span style="color: #008000;">获取影视名称</span> <span style="color: #008080;"> 9</span> <span style="color: #800080;">$filmUrl</span>=<span style="color: #800080;">$eachRowData</span>->find("dd a",0)->href;<span style="color: #008000;">//</span><span style="color: #008000;">获取dd标签下影视对应的地址 </span><span style="color: #008080;">10</span> <span style="color: #008000;">//获取影视的详细信息</span> <span style="color: #008080;">11</span> <span style="color: #800080;">$filmInfo</span>=file_get_html("http://paopaotv.com".<span style="color: #800080;">$filmUrl</span><span style="color: #000000;">); </span><span style="color: #008080;">12</span> <span style="color: #800080;">$filmDetail</span>=<span style="color: #800080;">$filmInfo</span>->find(".info dl"<span style="color: #000000;">); </span><span style="color: #008080;">13</span> <span style="color: #0000ff;">foreach</span>(<span style="color: #800080;">$filmDetail</span> <span style="color: #0000ff;">as</span> <span style="color: #800080;">$film</span><span style="color: #000000;">){ </span><span style="color: #008080;">14</span> <span style="color: #800080;">$info</span>=<span style="color: #800080;">$film</span>->find("dd"<span style="color: #000000;">); </span><span style="color: #008080;">15</span> <span style="color: #800080;">$row</span>=<span style="color: #0000ff;">null</span><span style="color: #000000;">; </span><span style="color: #008080;">16</span> <span style="color: #0000ff;">foreach</span>(<span style="color: #800080;">$info</span> <span style="color: #0000ff;">as</span> <span style="color: #800080;">$childInfo</span><span style="color: #000000;">){ </span><span style="color: #008080;">17</span> <span style="color: #800080;">$row</span>[]=<span style="color: #800080;">$childInfo</span>-><span style="color: #000000;">plaintext; </span><span style="color: #008080;">18</span> <span style="color: #000000;">} </span><span style="color: #008080;">19</span> <span style="color: #800080;">$cate</span>[<span style="color: #800080;">$key</span>][]=<span style="color: #008080;">join</span>(",",<span style="color: #800080;">$row</span>);<span style="color: #008000;">//</span><span style="color: #008000;">将影视的信息存放到数组中</span> <span style="color: #008080;">20</span> <span style="color: #000000;">} </span><span style="color: #008080;">21</span> }
这样通过simple_html_dom,就可以将paopaotv.com影视列表中信息,以及影视的具体信息就抓取到了,之后你可以继续抓取影视详细页面上的视频地址信息,然后将该影视的所有信息都存放到数据库中。
下面是simple_html_dom常用的属性以及方法:
<span style="color: #008080;"> 1</span> <span style="color: #800080;">$html</span> = file_get_html('http://paopaotv.com/tv-type-id-5-pg-1.html'<span style="color: #000000;">); </span><span style="color: #008080;"> 2</span> <span style="color: #800080;">$e</span> = <span style="color: #800080;">$html</span>->find("div", 0<span style="color: #000000;">); </span><span style="color: #008080;"> 3</span> <span style="color: #008000;">//</span><span style="color: #008000;">标签</span> <span style="color: #008080;"> 4</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">tag; </span><span style="color: #008080;"> 5</span> <span style="color: #008000;">//</span><span style="color: #008000;">外文本</span> <span style="color: #008080;"> 6</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">outertext; </span><span style="color: #008080;"> 7</span> <span style="color: #008000;">//</span><span style="color: #008000;">内文本</span> <span style="color: #008080;"> 8</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">innertext; </span><span style="color: #008080;"> 9</span> <span style="color: #008000;">//</span><span style="color: #008000;">纯文本</span> <span style="color: #008080;">10</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">plaintext; </span><span style="color: #008080;">11</span> <span style="color: #008000;">//</span><span style="color: #008000;">子元素</span> <span style="color: #008080;">12</span> <span style="color: #800080;">$e</span>->children ( [int <span style="color: #800080;">$index</span><span style="color: #000000;">] ); </span><span style="color: #008080;">13</span> <span style="color: #008000;">//</span><span style="color: #008000;">父元素</span> <span style="color: #008080;">14</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">parent (); </span><span style="color: #008080;">15</span> <span style="color: #008000;">//</span><span style="color: #008000;">第一个子元素</span> <span style="color: #008080;">16</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">first_child (); </span><span style="color: #008080;">17</span> <span style="color: #008000;">//</span><span style="color: #008000;">最后一个子元素</span> <span style="color: #008080;">18</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">last_child (); </span><span style="color: #008080;">19</span> <span style="color: #008000;">//</span><span style="color: #008000;">后一个兄弟元素</span> <span style="color: #008080;">20</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">next_sibling (); </span><span style="color: #008080;">21</span> <span style="color: #008000;">//</span><span style="color: #008000;">前一个兄弟元素</span> <span style="color: #008080;">22</span> <span style="color: #800080;">$e</span>-><span style="color: #000000;">prev_sibling (); </span><span style="color: #008080;">23</span> <span style="color: #008000;">//</span><span style="color: #008000;">标签数组</span> <span style="color: #008080;">24</span> <span style="color: #800080;">$ret</span> = <span style="color: #800080;">$html</span>->find('a'<span style="color: #000000;">); </span><span style="color: #008080;">25</span> <span style="color: #008000;">//</span><span style="color: #008000;">第一个a标签</span> <span style="color: #008080;">26</span> <span style="color: #800080;">$ret</span> = <span style="color: #800080;">$html</span>->find('a', 0);
更多用法可以参考官方手册。
是不是很简单呢?有问题欢迎提出来交流