php 如何精准获取网站中的所有超链接?

WBOY
Release: 2016-06-06 20:22:03
Original
2024 people have browsed it

想获取网站中的所有超链接,使用的是php snoopy类

<code>$sourceURL = $url;
$snoopy->fetchlinks($sourceURL);
$content = $snoopy->results;</code>
Copy after login
Copy after login

获取的结果如下:

<code>array (size=627)
  0 => string 'http://www.alibaba.com/https://login.alibaba.com/' (length=49)
  1 => string 'http://sh.vip.alibaba.com?tracelog=nav_ma' (length=41)
  2 => string 'http://message.alibaba.com/feedback/default.htm?routeto=inbox&tracelog=nav_ma_mc' (length=80)
  3 => string 'http://www.alibaba.com//hz-favorite.alibaba.com/favorite/favorite_home.htm?tracelog=nav_ma_fav' (length=94)
  4 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_myalibaba' (length=57)
  5 => string 'http://hz.sourcing.alibaba.com/rfq/request/rfq_manage_list.htm?tracelog=nav_ma_mana_rfq' (length=87)
  6 => string 'http://biz.alibaba.com/generalorders/list_orders.htm?tracelog=ma_mana_orders' (length=76)
  7 => string 'http://sh.vip.alibaba.com/product/post_product_interface.htm?tracelog=newschp_nav_madp' (length=86)
  8 => string 'http://sh.vip.alibaba.com/product/manage_products.htm?tracelog=newschp_nav_mamng' (length=80)
  9 => string 'http://hz.sourcing.alibaba.com/rfq/quotation/rfq_not_quoted_manage_list.htm?nav_ma_rec_rfqs' (length=91)
  10 => string 'http://www.alibaba.com/javascript:;' (length=35)
  11 => string 'http://www.alibaba.com/Products?tracelog=beacon_cate_140704' (length=59)
  12 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_forbuyers' (length=57)
  13 => string 'http://globalexpo.alibaba.com?tracelog=beacon_expo_150820' (length=57)
  14 => string 'http://wholesale.alibaba.com?tracelog=nav_ws' (length=44)
  15 => string 'http://buyer.alibaba.com/bizid_buyer?tracelog=nav_bi' (length=52)
  16 => string 'http://tradeassurance.alibaba.com/bao/buyer_advertise.htm?tracelog=from_home_menu' (length=81)
  17 => string 'http://activities.alibaba.com/alibaba/secure-payment.php?tracelog=beacon_payment_150114' (length=87)
  18 => string 'http://ecredit.alibaba.com/ecl/buyer.htm?tracelog=beacon_credit_140704' (length=70)
  19 => string 'http://inspection.alibaba.com/?tracelog=beacon_is_140704' (length=56)
  20 => string 'http://buyer.alibaba.com/intelligence?tracelog=beacon_ti_140704' (length=63)
  21 => string 'http://buyer.alibaba.com/forum?tracelog=beacon_df_140704' (length=56)
  22 => string 'http://ask.alibaba.com/?tracelog=beacon_ta_140704' (length=49)
  23 => string 'http://www.alibaba.com/javascript:;' (length=35)
  24 => string 'http://seller.alibaba.com/memberships/index.html?tracelog=seller_channel_member_hp_header' (length=89)
  25 => string 'http://seller.alibaba.com/learningcenter?tracelog=seller_channel_lc_hp_header' (length=77)
  26 => string 'http://seller.alibaba.com/training.htm?tracelog=seller_channel_training_hp_header' (length=81)
  27 => string 'http://sourcing.alibaba.com/?tracelog=newschp_nav_narfq' (length=55)
  28 => string 'http://www.alibaba.com/javascript:;' (length=35)</code>
Copy after login
Copy after login

怎么能把“http://www.alibaba.com/javascript:;”类似的URL去掉?

回复内容:

想获取网站中的所有超链接,使用的是php snoopy类

<code>$sourceURL = $url;
$snoopy->fetchlinks($sourceURL);
$content = $snoopy->results;</code>
Copy after login
Copy after login

获取的结果如下:

<code>array (size=627)
  0 => string 'http://www.alibaba.com/https://login.alibaba.com/' (length=49)
  1 => string 'http://sh.vip.alibaba.com?tracelog=nav_ma' (length=41)
  2 => string 'http://message.alibaba.com/feedback/default.htm?routeto=inbox&tracelog=nav_ma_mc' (length=80)
  3 => string 'http://www.alibaba.com//hz-favorite.alibaba.com/favorite/favorite_home.htm?tracelog=nav_ma_fav' (length=94)
  4 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_myalibaba' (length=57)
  5 => string 'http://hz.sourcing.alibaba.com/rfq/request/rfq_manage_list.htm?tracelog=nav_ma_mana_rfq' (length=87)
  6 => string 'http://biz.alibaba.com/generalorders/list_orders.htm?tracelog=ma_mana_orders' (length=76)
  7 => string 'http://sh.vip.alibaba.com/product/post_product_interface.htm?tracelog=newschp_nav_madp' (length=86)
  8 => string 'http://sh.vip.alibaba.com/product/manage_products.htm?tracelog=newschp_nav_mamng' (length=80)
  9 => string 'http://hz.sourcing.alibaba.com/rfq/quotation/rfq_not_quoted_manage_list.htm?nav_ma_rec_rfqs' (length=91)
  10 => string 'http://www.alibaba.com/javascript:;' (length=35)
  11 => string 'http://www.alibaba.com/Products?tracelog=beacon_cate_140704' (length=59)
  12 => string 'http://rfq.alibaba.com/form.htm?tracelog=header_forbuyers' (length=57)
  13 => string 'http://globalexpo.alibaba.com?tracelog=beacon_expo_150820' (length=57)
  14 => string 'http://wholesale.alibaba.com?tracelog=nav_ws' (length=44)
  15 => string 'http://buyer.alibaba.com/bizid_buyer?tracelog=nav_bi' (length=52)
  16 => string 'http://tradeassurance.alibaba.com/bao/buyer_advertise.htm?tracelog=from_home_menu' (length=81)
  17 => string 'http://activities.alibaba.com/alibaba/secure-payment.php?tracelog=beacon_payment_150114' (length=87)
  18 => string 'http://ecredit.alibaba.com/ecl/buyer.htm?tracelog=beacon_credit_140704' (length=70)
  19 => string 'http://inspection.alibaba.com/?tracelog=beacon_is_140704' (length=56)
  20 => string 'http://buyer.alibaba.com/intelligence?tracelog=beacon_ti_140704' (length=63)
  21 => string 'http://buyer.alibaba.com/forum?tracelog=beacon_df_140704' (length=56)
  22 => string 'http://ask.alibaba.com/?tracelog=beacon_ta_140704' (length=49)
  23 => string 'http://www.alibaba.com/javascript:;' (length=35)
  24 => string 'http://seller.alibaba.com/memberships/index.html?tracelog=seller_channel_member_hp_header' (length=89)
  25 => string 'http://seller.alibaba.com/learningcenter?tracelog=seller_channel_lc_hp_header' (length=77)
  26 => string 'http://seller.alibaba.com/training.htm?tracelog=seller_channel_training_hp_header' (length=81)
  27 => string 'http://sourcing.alibaba.com/?tracelog=newschp_nav_narfq' (length=55)
  28 => string 'http://www.alibaba.com/javascript:;' (length=35)</code>
Copy after login
Copy after login

怎么能把“http://www.alibaba.com/javascript:;”类似的URL去掉?

QueryList

<code class="php"><?php //采集某页面所有的图片
$data = QueryList::Query('http://cms.querylist.cc/bizhi/453.html',['image' => ['img','src']])->data;
//打印结果
print_r($data);

//采集某页面所有的超链接
$data = QueryList::Query('http://cms.querylist.cc/google/list_1.html',['link' => ['a','href']])->data;
//打印结果
print_r($data);</code>
Copy after login

http://git.oschina.net/jae/QueryList
可以看下这个,比snoopy要强大一些,支持jquery选择器语法

Related labels:
php
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!