How to implement interfaceless crawling using python + selenium + chromedriver
迷茫
迷茫 2017-05-18 10:53:13
0
2
928

In the process of using selenium to crawl 12306, I found that phantomjs cannot be used to crawl, and chromedriver can be used. It should be that phantomjs is detected and banned by the website. Using chromedriver will display the interface again, and the crawling efficiency is low.
Now I have two questions. I have been searching on Google for a long time and have not found an effective solution.
1. How to disguise phantomjs as much as possible
2. How to set up chromedriver so that it does not display the interface, or still Are there any other ways to improve crawling efficiency

grateful! ! !

迷茫
迷茫

业精于勤,荒于嬉;行成于思,毁于随。

reply all(2)
洪涛

You can achieve your needs through PyVirtualDisplay. The code is probably like this:

#!/usr/bin/env python

from pyvirtualdisplay import Display
from selenium import webdriver

display = Display(visible=0, size=(800, 600))
display.start()

# now Firefox will run in a virtual display. 
# you will not see the browser.
browser = webdriver.Chrome()
browser.get('http://www.baidu.com')
print browser.title
browser.quit()

display.stop()

I don’t know if you have modified the header information of phantomjs, you can pass

from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('lang=zh_CN.UTF-8')
options.add_argument('user-agent="Mozilla/5.0 (iPod; U; CPU iPhone OS 2_1 like Mac OS X; ja-jp) AppleWebKit/525.18.1 (KHTML, like Gecko) Version/3.1.1 Mobile/5F137 Safari/525.20"')
browser = webdriver.Chrome(chrome_options=options)
url = "https://baidu.com"
browser.get(url)
browser.quit()

This method modifies the header information of phantomjs. You can also try this

世界只因有你

You can refer to my article to run selenium in headless mode

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template