When doing collection procedures, sometimes we need to capture some pages that can only be accessed after logging in. But sometimes even if we log in successfully, we still cannot crawl the relevant pages. Why is this?
Well, the most likely reason is that the cookie after successful login was not passed along.
For some websites whose security precautions are not very high, we can log in through the PHP function curl_setopt.
<?php //在指定目录中建立一个具有唯一文件名的文件。如果该目录不存在,tempnam() 会在系统临时目录中生成一个文件,并返回其文件名。 $cookie_file = tempnam('./tmp','cookie');//其中 cookie 为文件名的前缀 $postfield = 'LoginForm[username]=admin&LoginForm[password]=admin&LoginForm[rememberMe]=0&yt0=Login'; $url = "http://localhost/testdrive/index.php?r=site/login";//登录 提交的 url,可以通过 firfox 的 firebug 工具或者 google chrome 的开发人员工具来查看 $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch, CURLOPT_HEADER, 0); curl_setopt($ch, CURLOPT_POST, 1); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);//保存 cookie 的文件 curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch, CURLOPT_POSTFIELDS,$postfield); $strlen = curl_exec($ch); $url = "http://localhost/testdrive/index.php";//访问登录后的页面。 $ch = curl_init($url); curl_setopt($ch, CURLOPT_RETURNTRANSFER, 0); curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); curl_setopt($ch, CURLOPT_COOKIEJAR, $cookie_file);//再次发送请求时,cookie 就会自动传递过去 $strlen = curl_exec($ch); ?>