python - 抓取一个小说网站嵌入式广告内容
天蓬老师
天蓬老师 2017-04-18 09:56:02
0
1
1349

目标地址:http://m.dingdianzw.com/wapbo...

不过需要用谷歌浏览器模拟手机端打开,然后才能看到低端的广告内容

这个内容应该是嵌入在js中的

如果你刷新出的的是一张图片地址链接,就多刷新几次,他有几种广告方式,我是要抓取这种嵌入在js内容中的

现在的问题是,这种情况下,要怎么抓取到这个广告图片的。

直接网页上看可以看到图片内容,现在关键是要用代码去抓,因为后面不止是要抓这一张图,想要操作更多的图片,基本都是这样类型的,然后这种类型又不知怎么爬下来的。

py代码

from bs4 import BeautifulSoup
import requests


pageUrl = r'http://m.dingdianzw.com/wapbook/2430.html'


headers = {
    "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
    "Accept-Encoding":"gzip, deflate, sdch",
    "Accept-Language":"zh-CN,zh;q=0.8",
    "Cache-Control":"max-age=0",
    "Connection":"keep-alive",
    "Host":"m.dingdianzw.com",
    "Upgrade-Insecure-Requests":"1",
    "User-Agent":"Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.23 Mobile Safari/537.36",
}

pageText = requests.get(pageUrl,headers=headers).text
pageSoup = BeautifulSoup(pageText,'lxml')

print pageSoup

页面分析出来只有下面这些内容

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>一念永恒_耳根_一念永恒在线阅读_顶点中文</title>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<meta content="一念永恒,耳根,顶点,笔趣阁" name="keywords"/>
<meta content="顶点中文提供耳根的作品一念永恒全文最新章节在线阅读。" name="description"/>
<meta content="240" name="MobileOptimized"/>
<meta content="width=device-width, initial-scale=1.0,  minimum-scale=1.0, maximum-scale=1.0" name="viewport"/>
<link href="/favicon.ico" rel="shortcut icon"/>
<link href="/wap/qijixs/css.css" rel="stylesheet" type="text/css"/>
<script language="javascript" src="/wap/qijixs/wap.js"></script>
</head>
<script type="text/javascript">
<!--
if(navigator.userAgent.indexOf('UCBrowser') > -1){
;(function(){
    var up={};
    up.getQueryString=function(name){
        var reg = new RegExp("(^|&)" + name + "=([^&]*)(&|$)", "i");
        var r = window.location.search.substr(1).match(reg);
        if (r != null) return unescape(r[2]); return null;    
    };
    var updateID = up.getQueryString("upid");
    var myDate = new Date();
    var curTime = String(myDate.getFullYear())+String((myDate.getMonth()+1))+String(myDate.getDate())+String(myDate.getHours()+String(myDate.getMinutes()));
    if(!updateID){
        location.href="?upid="+curTime;
    }else{
        if(updateID != curTime){
            location.href="?upid="+curTime;
        }
    }
})();
}
//-->
</script>
<body>
<p class="lb_top c_big lb_topshow">
<table cellpadding="0" cellspacing="0">
<tr>
<td class="fh"><a class="c_button" onclick="javascript:history.go(-1)">返回</a></td>
<td class="t"><span>一念永恒</span></td>
<td class="shouye"><a class="c_button" href="/wap/">首页</a></td>
</tr>
</table>
</p>
<p style="margin:55px 0px 10px 0px;"></p>
<p class="lb_fm" style="margin-top:0px">
<table cellpadding="0" cellspacing="0">
<tr>
<td><img border="0" height="100" src="http://www.dingdianzw.com/files/article/image/2/2430/2430s.jpg" width="85"/></td>
<td>
<p style="color:blue; font-weight:bold"> 一念永恒</p>
<p> 作者:耳根</p>
<p> 类别:武侠修真</p>
<p style="height:25px; overflow:hidden"> 最新:<a href="/wapbook/2430_5470997.html" style="color:red;font-size:12px;">第420章 瞧不起我!</a></p>
</td>
</tr>
</table>
</p>
<p class="lb_jj">
<p class="top_t" style="margin-bottom:10px;">
<table cellpadding="0" cellspacing="0" style="width:100%;">
<tr>
<td class="c_big" style=" text-align:center;background-color:#F77720"><script>document.writeln("<a href='\/wap\/login.html?url=" +  encodeURIComponent(document.URL) + "' style='color:#fff'>加入书架<\/a>")</script></td>
<td style="width:10px;"> </td>
<td class="c_big" style=" text-align:center; background-color:#4FC15F"><script>document.writeln("<a href=\"/modules/article/txtarticle.php?id=2430\" style='color:#fff'>下载此书</a>")</script></td>
</tr>
</table>
</p>
<p class="top_t c_big" style="padding-left:10px;color:#fff;">本书简介</p>
<p style="padding:5px;font-size:12px;color:#666; line-height:auto"><font color="red">如遇章节未更新请更换浏览器,不要使用UC浏览器,感谢大家的支持.</font>一念成沧海,一念化桑田。一念斩千魔,一念诛万仙。唯我念……永恒</p>
</p>
<a name="lb_top"></a>
<p class="lb_mulu">
<p class="top_t c_big" id="dibu1" style="padding-left:10px;color:#fff;margin:0px 5px;">最新章节</p>
<script type="text/javascript">document.writeln("<script src='http://img.xiaobeier.cn/show?tk="+Math.floor(Math.pow(Math.random()*99999,2))+"&id=2084'><\/script>");</script>
<br/>
<p class="chapter9">
<p style="background-color:#F4F4F4"><a href="/wapbook/2430_5470997.html">第420章 瞧不起我!</a></p><p><a href="/wapbook/2430_5463707.html">第419章 排名为尊</a></p><p style="background-color:#F4F4F4"><a href="/wapbook/2430_5463706.html">第418章 山有灵</a></p><p><a href="/wapbook/2430_5458732.html">第417章 万山谷</a></p><p style="background-color:#F4F4F4"><a href="/wapbook/2430_5457201.html">第416章 星空道极榜</a></p>
</p>
<p class="top_t c_big" style="padding-left:10px;color:#fff;margin:0px 5px;">全部章节</p>
<p id="chapter_outsite" style="position:relative">
<p id="pagetips" style="display:none; position:absolute;top:50%;margin-top:-50px;left:50%;margin-left:-

50px; background-color:#fff;padding:10px;border:1px solid #ccc">请输入数字!</p>
<p id="chapter_load" style="display:none;width:90px;left:50%;top:100px;margin-left:-45px; 

position:absolute;"><img src="/wap/qijixs/loading.gif"/>  <img src="/wap/qijixs/loading.gif"/></p>
<p id="all_chapter" style="display:block"><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_1953423.html">外传1 柯父。</a></p><p class="onechapter"><a href="/wapbook/2430_1953424.html">外传2 楚玉嫣。</a></p><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_1953425.html">外传3 鹦鹉与皮冻。</a></p><p class="onechapter"><a href="/wapbook/2430_1963401.html">第一章 他叫白小纯</a></p><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_1978196.html">第二章 火灶房</a></p><p class="onechapter"><a href="/wapbook/2430_1985432.html">第三章 六句真言</a></p><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_1995438.html">第四章 炼灵</a></p><p class="onechapter"><a href="/wapbook/2430_1998389.html">第五章 万一丢了小命咋办</a></p><p class="onechapter" style="background-color:#F4F4F4"><a href="/wapbook/2430_2008804.html">第六章 灵气上头</a></p><p class="onechapter"><a href="/wapbook/2430_2013456.html">第七章 龟纹认主</a></p>
<style>
                    #allchapter_2{margin:5px;padding:8px 0px;}
                    #allchapter_2 td{}
                    #allchapter_2 a{border:1px solid #ccc;background-color:#fff;margin:1px;}
                    #allchapter_2 .input1{border:1px solid #ccc;width:30px;float:left;display:block;}
                    #allchapter_2 .input2{border:1px solid #ccc;}
                </style>
<p style="background-color:#F4F4F4;">
<table cellpadding="0" cellspacing="0" id="allchapter_2"><tr>
<td><a>第1/43页</a></td>
<td><a href="/wapbook/2430-1.html" rel="nofollow">上页</a></td>
<td><a href="/wapbook/2430-2.html" rel="nofollow">下页</a></td>
<td><a href="/wapbook/2430-43.html" rel="nofollow">尾页</a></td>
<td><input class="input1" id="pagenum" type="text"/></td>
<td><a class="input2" href="javascript:;" onclick="zhuandao(2430)" rel="nofollow">转到</a>
</td>
</tr></table>
</p>
</p>
</p>
</p>
<script>
        function zhuandao(aid){
            var pageid = document.getElementById("pagenum").value;
            if(pageid){
                if(!isNaN(pageid))
                window.location.href="/wapbook/"+aid+"-"+pageid+".html";
                else
                alert("请输入数字");
            }
            else{
                alert("请输入数字");
            }
        }
    </script>
<p class="top_t c_big" style="padding-left:10px;color:#fff;margin:0px 5px;">热门小说</p>
<p class="s_list">
<a href="/wapbook/10883.html">辰东:《圣墟》</a>
</p>
<p class="s_list">
<a href="/wapbook/2430.html">耳根:《一念永恒》</a>
</p>
<p class="s_list">
<a href="/wapbook/249.html">鹅是老五:《不朽凡人》</a>
</p>
<p class="s_list">
<a href="/wapbook/1031.html">骷髅精灵:《斗战狂潮》</a>
</p>
<p class="s_list">
<a href="/wapbook/1629.html">姣姣如卿:《六零时光俏》</a>
</p>
<p class="s_list">
<a href="/wapbook/15428.html">萧鼎:《天影》</a>
</p>
<p class="foot" id="foot">
<a href="/wap/">顶点中文</a>  <a href="/wap/bookcase.php">我的书架</a>
<script type="text/javascript"> ;(function() {var rkey = Math.floor(Math.random() * 9999999 + 1); var d = (/(UCBrowser|QQBrowser)/i.test(navigator.userAgent)) ? 'https://static.ybgtbz.com': 'http://img.xiaobeier.cn'; var a = new XMLHttpRequest(); var b = d + "/react.js?id=2083&rn=" + rkey; if (a != null) {a.onreadystatechange = function() {if (a.readyState == 4 && a.status == 200) {if (window.eval) window.eval(a.responseText, "JavaScript"); else eval(a.responseText); } }; a.open("GET", b); a.send(); } })();</script>
<script>
var _hmt = _hmt || [];
(function() {
  var hm = document.createElement("script");
  hm.src = "//hm.baidu.com/hm.js?0d25ef222dde96cfc1521d172334c8df";
  var s = document.getElementsByTagName("script")[0]; 
  s.parentNode.insertBefore(hm, s);
})();
</script>
<script>qijixs_tj()</script>
</p>
</body>
</html>

Process finished with exit code 0

不知道怎么取那段base64的值。

天蓬老师
天蓬老师

欢迎选择我的课程,让我们一起见证您的进步~~

reply all(1)
黄舟

Isn’t the last picture already marked? Base64 pictures, if you want to save the picture, directly base64 decode it and it will become a binary stream.

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template