php 如何精准获取网站中的所有超链接?

WBOY
发布: 2016-06-06 20:22:03
原创
2023 人浏览过

想获取网站中的所有超链接,使用的是php snoopy类

<code>$sourceURL = $url;
$snoopy->fetchlinks($sourceURL);
$content = $snoopy->results;</code>
登录后复制
登录后复制

获取的结果如下:

<code>array (size=627)
  0 => string 'http://www.alibaba.com/https://login.alibaba.com/' (length=49)
  1 => string 'http://sh.vip.alibaba.com?tracelog=nav_ma' (length=41)
  2 => string 'http://message.alibaba.com/feedback/default.htm?routeto=inbox&tracelog=nav_ma_mc' (length=80)
  3 => string 'http://www.alibaba.com//hz-favorite.alibaba.com/favorite/favorite_home.htm?tracelog=nav_ma_fav' (length=94)
  4 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_myalibaba' (length=57)
  5 => string 'http://hz.sourcing.alibaba.com/rfq/request/rfq_manage_list.htm?tracelog=nav_ma_mana_rfq' (length=87)
  6 => string 'http://biz.alibaba.com/generalorders/list_orders.htm?tracelog=ma_mana_orders' (length=76)
  7 => string 'http://sh.vip.alibaba.com/product/post_product_interface.htm?tracelog=newschp_nav_madp' (length=86)
  8 => string 'http://sh.vip.alibaba.com/product/manage_products.htm?tracelog=newschp_nav_mamng' (length=80)
  9 => string 'http://hz.sourcing.alibaba.com/rfq/quotation/rfq_not_quoted_manage_list.htm?nav_ma_rec_rfqs' (length=91)
  10 => string 'http://www.alibaba.com/javascript:;' (length=35)
  11 => string 'http://www.alibaba.com/Products?tracelog=beacon_cate_140704' (length=59)
  12 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_forbuyers' (length=57)
  13 => string 'http://globalexpo.alibaba.com?tracelog=beacon_expo_150820' (length=57)
  14 => string 'http://wholesale.alibaba.com?tracelog=nav_ws' (length=44)
  15 => string 'http://buyer.alibaba.com/bizid_buyer?tracelog=nav_bi' (length=52)
  16 => string 'http://tradeassurance.alibaba.com/bao/buyer_advertise.htm?tracelog=from_home_menu' (length=81)
  17 => string 'http://activities.alibaba.com/alibaba/secure-payment.php?tracelog=beacon_payment_150114' (length=87)
  18 => string 'http://ecredit.alibaba.com/ecl/buyer.htm?tracelog=beacon_credit_140704' (length=70)
  19 => string 'http://inspection.alibaba.com/?tracelog=beacon_is_140704' (length=56)
  20 => string 'http://buyer.alibaba.com/intelligence?tracelog=beacon_ti_140704' (length=63)
  21 => string 'http://buyer.alibaba.com/forum?tracelog=beacon_df_140704' (length=56)
  22 => string 'http://ask.alibaba.com/?tracelog=beacon_ta_140704' (length=49)
  23 => string 'http://www.alibaba.com/javascript:;' (length=35)
  24 => string 'http://seller.alibaba.com/memberships/index.html?tracelog=seller_channel_member_hp_header' (length=89)
  25 => string 'http://seller.alibaba.com/learningcenter?tracelog=seller_channel_lc_hp_header' (length=77)
  26 => string 'http://seller.alibaba.com/training.htm?tracelog=seller_channel_training_hp_header' (length=81)
  27 => string 'http://sourcing.alibaba.com/?tracelog=newschp_nav_narfq' (length=55)
  28 => string 'http://www.alibaba.com/javascript:;' (length=35)</code>
登录后复制
登录后复制

怎么能把“http://www.alibaba.com/javascript:;”类似的URL去掉?

回复内容:

想获取网站中的所有超链接,使用的是php snoopy类

<code>$sourceURL = $url;
$snoopy->fetchlinks($sourceURL);
$content = $snoopy->results;</code>
登录后复制
登录后复制

获取的结果如下:

<code>array (size=627)
  0 => string 'http://www.alibaba.com/https://login.alibaba.com/' (length=49)
  1 => string 'http://sh.vip.alibaba.com?tracelog=nav_ma' (length=41)
  2 => string 'http://message.alibaba.com/feedback/default.htm?routeto=inbox&tracelog=nav_ma_mc' (length=80)
  3 => string 'http://www.alibaba.com//hz-favorite.alibaba.com/favorite/favorite_home.htm?tracelog=nav_ma_fav' (length=94)
  4 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_myalibaba' (length=57)
  5 => string 'http://hz.sourcing.alibaba.com/rfq/request/rfq_manage_list.htm?tracelog=nav_ma_mana_rfq' (length=87)
  6 => string 'http://biz.alibaba.com/generalorders/list_orders.htm?tracelog=ma_mana_orders' (length=76)
  7 => string 'http://sh.vip.alibaba.com/product/post_product_interface.htm?tracelog=newschp_nav_madp' (length=86)
  8 => string 'http://sh.vip.alibaba.com/product/manage_products.htm?tracelog=newschp_nav_mamng' (length=80)
  9 => string 'http://hz.sourcing.alibaba.com/rfq/quotation/rfq_not_quoted_manage_list.htm?nav_ma_rec_rfqs' (length=91)
  10 => string 'http://www.alibaba.com/javascript:;' (length=35)
  11 => string 'http://www.alibaba.com/Products?tracelog=beacon_cate_140704' (length=59)
  12 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_forbuyers' (length=57)
  13 => string 'http://globalexpo.alibaba.com?tracelog=beacon_expo_150820' (length=57)
  14 => string 'http://wholesale.alibaba.com?tracelog=nav_ws' (length=44)
  15 => string 'http://buyer.alibaba.com/bizid_buyer?tracelog=nav_bi' (length=52)
  16 => string 'http://tradeassurance.alibaba.com/bao/buyer_advertise.htm?tracelog=from_home_menu' (length=81)
  17 => string 'http://activities.alibaba.com/alibaba/secure-payment.php?tracelog=beacon_payment_150114' (length=87)
  18 => string 'http://ecredit.alibaba.com/ecl/buyer.htm?tracelog=beacon_credit_140704' (length=70)
  19 => string 'http://inspection.alibaba.com/?tracelog=beacon_is_140704' (length=56)
  20 => string 'http://buyer.alibaba.com/intelligence?tracelog=beacon_ti_140704' (length=63)
  21 => string 'http://buyer.alibaba.com/forum?tracelog=beacon_df_140704' (length=56)
  22 => string 'http://ask.alibaba.com/?tracelog=beacon_ta_140704' (length=49)
  23 => string 'http://www.alibaba.com/javascript:;' (length=35)
  24 => string 'http://seller.alibaba.com/memberships/index.html?tracelog=seller_channel_member_hp_header' (length=89)
  25 => string 'http://seller.alibaba.com/learningcenter?tracelog=seller_channel_lc_hp_header' (length=77)
  26 => string 'http://seller.alibaba.com/training.htm?tracelog=seller_channel_training_hp_header' (length=81)
  27 => string 'http://sourcing.alibaba.com/?tracelog=newschp_nav_narfq' (length=55)
  28 => string 'http://www.alibaba.com/javascript:;' (length=35)</code>
登录后复制
登录后复制

怎么能把“http://www.alibaba.com/javascript:;”类似的URL去掉?

QueryList

<code class="php"><?php
//采集某页面所有的图片
$data = QueryList::Query('http://cms.querylist.cc/bizhi/453.html',['image' => ['img','src']])->data;
//打印结果
print_r($data);

//采集某页面所有的超链接
$data = QueryList::Query('http://cms.querylist.cc/google/list_1.html',['link' => ['a','href']])->data;
//打印结果
print_r($data);</code>
登录后复制

http://git.oschina.net/jae/QueryList
可以看下这个,比snoopy要强大一些,支持jquery选择器语法

相关标签:
php
来源:php.cn
本站声明
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系admin@php.cn
热门教程
更多>
最新下载
更多>
网站特效
网站源码
网站素材
前端模板
关于我们 免责声明 Sitemap
PHP中文网:公益在线PHP培训,帮助PHP学习者快速成长!