Python uses the Tesseract library to implement identification verification

小云云
Release: 2018-03-29 13:31:27
Original
1878 people have browsed it

1. Introduction to Tesseract

Tesseract is an OCR library (OCR is the abbreviation of Optical Character Recognition in English). It is used to scan text data and then analyze image files. In the process of processing and obtaining text and layout information, Tesseract is currently recognized as the best OCR library with relatively accurate recognition.

2. Use of Tesseract

1. Download and install Tesseract: Click to download

2. Set environment variables under Windows system:


#根据下载安装文件的路径配置环境变量
set TESSDATA_PREFIX F:\Tesseract-OCR\
Copy after login

3. Install the pytesseract module


pip install pytesseract
Copy after login

4. How to introduce the tesseract.exe application in a Python script:


pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe'
Copy after login

5. Case demonstration

Recognize the following picture text:


import pytesseract
from PIL import Image
#1.引入Tesseract程序
pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe'
#2.使用Image模块下的Open()函数打开图片
image = Image.open('6.jpg',mode='r')
print(image)
#3.识别图片文字
code= pytesseract.image_to_string(image)
print(code)
Copy after login

Result demonstration:


Google

Note: Some verification codes cannot be recognized by the tesseract-OCR engine. For example, the verification code generated by Douban cannot recognize its content. If you need to crawl the data in Douban, you need to manually enter the verification code:

3. Simulated login Zhihu source code


import requests
import time
import pytesseract
from PIL import Image
from bs4 import BeautifulSoup

def captcha(data):
  with open('captcha.jpg','wb') as fp:
    fp.write(data)
  time.sleep(1)
  image = Image.open("captcha.jpg")
  text = pytesseract.image_to_string(image)
  print "机器识别后的验证码为:" + text
  command = raw_input("请输入Y表示同意使用,按其他键自行重新输入:")
  if (command == "Y" or command == "y"):
    return text
  else:
    return raw_input('输入验证码:')

def zhihuLogin(username,password):

  # 构建一个保存Cookie值的session对象
  sessiona = requests.Session()
  headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'}

  # 先获取页面信息,找到需要POST的数据(并且已记录当前页面的Cookie)
  html = sessiona.get('https://www.zhihu.com/#signin', headers=headers).content

  # 找到 name 属性值为 _xsrf 的input标签,取出value里的值
  _xsrf = BeautifulSoup(html ,'lxml').find('input', attrs={'name':'_xsrf'}).get('value')

  # 取出验证码,r后面的值是Unix时间戳,time.time()
  captcha_url = 'https://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000)
  response = sessiona.get(captcha_url, headers = headers)


  data = {
    "_xsrf":_xsrf,
    "email":username,
    "password":password,
    "remember_me":True,
    "captcha": captcha(response.content)
  }

  response = sessiona.post('https://www.zhihu.com/login/email', data = data, headers=headers)
  print response.text

  response = sessiona.get('https://www.zhihu.com/people/maozhaojun/activities', headers=headers)
  print response.text


if __name__ == "__main__":
  #username = raw_input("username")
  #password = raw_input("password")
  zhihuLogin('xxxx@qq.com','ALAxxxxIME')
Copy after login

Related recommendations:

Call pytesseract under python to identify a website verification code

The above is the detailed content of Python uses the Tesseract library to implement identification verification. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template