Home Web Front-end JS Tutorial Use selenium to capture Taobao product information

Use selenium to capture Taobao product information

Mar 23, 2018 pm 04:38 PM
selenium information merchandise

This time I will bring you the use of selenium to capture Taobao product information. What are the precautions for using selenium to capture Taobao product information? The following is a practical case, let's take a look.

Taobao pages use a lot of js to load data, so it is easier to use selenium to crawl. As a testing tool, selenum is mainly used with the windowless browser phantomjs.

import re
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from pyquery import PyQuery as pq
'''
wait.until()语句是selenum里面的显示等待,wait是一个WebDriverWait对象,它设置了等待时间,如果页面在等待时间内
没有在 DOM中找到元素,将继续等待,超出设定时间后则抛出找不到元素的异常,也可以说程序每隔xx秒看一眼,如果条件
成立了,则执行下一步,否则继续等待,直到超过设置的最长时间,然后抛出TimeoutException
1.presence_of_element_located 元素加载出,传入定位元组,如(By.ID, 'p')
2.element_to_be_clickable 元素可点击
3.text_to_be_present_in_element 某个元素文本包含某文字
'''
# 定义一个无界面的浏览器
browser = webdriver.PhantomJS(
 service_args=[
  '--load-images=false',
  '--disk-cache=true'])
# 10s无响应就down掉
wait = WebDriverWait(browser, 10)
#虽然无界面但是必须要定义窗口
browser.set_window_size(1400, 900)
def search():
 '''
 此函数的作用为完成首页点击搜索的功能,替换标签可用于其他网页使用
 :return:
 '''
 print('正在搜索')
 try:
  #访问页面
  browser.get('https://www.taobao.com')
  # 选择到淘宝首页的输入框
  input = wait.until(
   EC.presence_of_element_located((By.CSS_SELECTOR, '#q'))
  )
  #搜索的那个按钮
  submit = wait.until(EC.element_to_be_clickable(
   (By.CSS_SELECTOR, '#J_TSearchForm > p.search-button > button')))
  #send_key作为写到input的内容
  input.send_keys('面条')
  #执行点击搜索的操作
  submit.click()
  #查看到当前的页码一共是多少页
  total = wait.until(EC.presence_of_element_located(
   (By.CSS_SELECTOR, '#mainsrp-pager > p > p > p > p.total')))
  #获取所有的商品
  get_products()
  #返回总页数
  return total.text
 except TimeoutException:
  return search()
def next_page(page_number):
 '''
 翻页函数,
 :param page_number:
 :return:
 '''
 print('正在翻页', page_number)
 try:
  #这个是我们跳转页的输入框
  input = wait.until(EC.presence_of_element_located(
   (By.CSS_SELECTOR, '#mainsrp-pager > p > p > p > p.form > input')))
  #跳转时的确定按钮
  submit = wait.until(
   EC.element_to_be_clickable(
    (By.CSS_SELECTOR,
     '#mainsrp-pager > p > p > p > p.form > span.J_Submit')))
  #清除里面的数字
  input.clear()
  #重新输入数字
  input.send_keys(page_number)
  #选择并点击
  submit.click()
  #判断当前页是不是我们要现实的页
  wait.until(
   EC.text_to_be_present_in_element(
    (By.CSS_SELECTOR,
     '#mainsrp-pager > p > p > p > ul > li.item.active > span'),
    str(page_number)))
  #调用函数获取商品信息
  get_products()
 #捕捉超时,重新进入翻页的函数
 except TimeoutException:
  next_page(page_number)
def get_products():
 '''
 搜到页面信息在此函数在爬取我们需要的信息
 :return:
 '''
 #每一个商品标签,这里是加载出来以后才会拿网页源代码
 wait.until(EC.presence_of_element_located(
  (By.CSS_SELECTOR, '#mainsrp-itemlist .items .item')))
 #这里拿到的是整个网页源代码
 html = browser.page_source
 #pq解析网页源代码
 doc = pq(html)
 items = doc('#mainsrp-itemlist .items .item').items()
 for item in items:
  # print(item)
  product = {
   'image': item.find('.pic .img').attr('src'),
   'price': item.find('.price').text(),
   'deal': item.find('.deal-cnt').text()[:-3],
   'title': item.find('.title').text(),
   'shop': item.find('.shop').text(),
   'location': item.find('.location').text()
  }
  print(product)
def main():
 try:
  #第一步搜索
  total = search()
  #int类型刚才找到的总页数标签,作为跳出循环的条件
  total = int(re.compile('(\d+)').search(total).group(1))
  #只要后面还有就继续爬,继续翻页
  for i in range(2, total + 1):
   next_page(i)
 except Exception:
  print('出错啦')
 finally:
  #关闭浏览器
  browser.close()
if name == 'main':
 main()
Copy after login
I believe you have mastered the method after reading the case in this article. For more exciting information, please pay attention to other related articles on the php Chinese website!

Recommended reading:

Detailed explanation of the use of Express and Koa2

JS imitation of the home page interface of Toutiao mobile terminal

Vue enumeration type implements HTML

The above is the detailed content of Use selenium to capture Taobao product information. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

How to remove author and last modified information in Microsoft Word How to remove author and last modified information in Microsoft Word Apr 15, 2023 am 11:43 AM

Microsoft Word documents contain some metadata when saved. These details are used for identification on the document, such as when it was created, who the author was, date modified, etc. It also has other information such as number of characters, number of words, number of paragraphs, and more. If you might want to remove the author or last modified information or any other information so that other people don't know the values, then there is a way. In this article, let’s see how to remove a document’s author and last modified information. Remove author and last modified information from Microsoft Word document Step 1 – Go to

Laravel development: How to use Laravel Dusk and Selenium for browser testing? Laravel development: How to use Laravel Dusk and Selenium for browser testing? Jun 14, 2023 pm 01:53 PM

Laravel development: How to use LaravelDusk and Selenium for browser testing? As web applications become more complex, we need to ensure that all parts of it function properly. Browser testing is a common testing method used to ensure the correctness and stability of an application under various browsers. In Laravel development, you can use LaravelDusk and Selenium for browser testing. This article will introduce how to use these two tools to test

Learn to install Selenium easily using PyCharm: PyCharm installation and configuration guide Learn to install Selenium easily using PyCharm: PyCharm installation and configuration guide Jan 04, 2024 pm 09:48 PM

PyCharm installation tutorial: Easily learn how to install Selenium, specific code examples are needed. As Python developers, we often need to use various third-party libraries and tools to complete project development. Among them, Selenium is a very commonly used library for automated testing and UI testing of web applications. As an integrated development environment (IDE) for Python development, PyCharm provides us with a convenient and fast way to develop Python code, so how

How to get the GPU in Windows 11 and check the graphics card details How to get the GPU in Windows 11 and check the graphics card details Nov 07, 2023 am 11:21 AM

Using System Information Click Start and enter System Information. Just click on the program as shown in the image below. Here you can find most of the system information, and one thing you can find is graphics card information. In the System Information program, expand Components, and then click Show. Let the program gather all the necessary information and once it's ready, you can find the graphics card-specific name and other information on your system. Even if you have multiple graphics cards, you can find most content related to dedicated and integrated graphics cards connected to your computer from here. Using the Device Manager Windows 11 Just like most other versions of Windows, you can also find the graphics card on your computer from the Device Manager. Click Start and then

How to share contact details with NameDrop: How-to guide for iOS 17 How to share contact details with NameDrop: How-to guide for iOS 17 Sep 16, 2023 pm 06:09 PM

In iOS 17, there's a new AirDrop feature that lets you exchange contact information with someone by touching two iPhones. It's called NameDrop, and here's how it works. Instead of entering a new person's number to call or text them, NameDrop allows you to simply place your iPhone near their iPhone to exchange contact details so they have your number. Putting the two devices together will automatically pop up the contact sharing interface. Clicking on the pop-up will display a person's contact information and their contact poster (you can customize and edit your own photos, also a new feature of iOS17). This screen also includes the option to "Receive Only" or share your own contact information in response.

The single-view NeRF algorithm S^3-NeRF uses multi-illumination information to restore scene geometry and material information. The single-view NeRF algorithm S^3-NeRF uses multi-illumination information to restore scene geometry and material information. Apr 13, 2023 am 10:58 AM

Current image 3D reconstruction work usually uses a multi-view stereo reconstruction method (Multi-view Stereo) that captures the target scene from multiple viewpoints (multi-view) under constant natural lighting conditions. However, these methods usually assume Lambertian surfaces and have difficulty recovering high-frequency details. Another approach to scene reconstruction is to utilize images captured from a fixed viewpoint but with different point lights. Photometric Stereo methods, for example, take this setup and use its shading information to reconstruct the surface details of non-Lambertian objects. However, existing single-view methods usually use normal map or depth map to represent the visible

Using Selenium and PhantomJS in Scrapy crawler Using Selenium and PhantomJS in Scrapy crawler Jun 22, 2023 pm 06:03 PM

Using Selenium and PhantomJS in Scrapy crawlers Scrapy is an excellent web crawler framework under Python and has been widely used in data collection and processing in various fields. In the implementation of the crawler, sometimes it is necessary to simulate browser operations to obtain the content presented by certain websites. In this case, Selenium and PhantomJS are needed. Selenium simulates human operations on the browser, allowing us to automate web application testing

How NameDrop works on iPhone (and how to disable it) How NameDrop works on iPhone (and how to disable it) Nov 30, 2023 am 11:53 AM

In iOS17, there is a new AirDrop feature that allows you to exchange contact information with someone by touching two iPhones at the same time. It's called NameDrop, and here's how it actually works. NameDrop eliminates the need to enter a new person's number to call or text them so they have your number, you can simply hold your iPhone close to their iPhone to exchange contact information. Putting the two devices together will automatically pop up the contact sharing interface. Clicking on the popup will display a person's contact information and their contact poster (a photo of your own that you can customize and edit, also new to iOS 17). This screen also includes "Receive Only" or share your own contact information in response

See all articles