1. Introduction to Tesseract
Tesseract is an OCR library (OCR is the abbreviation of Optical Character Recognition in English). It is used to scan text data and then analyze image files. In the process of processing and obtaining text and layout information, Tesseract is currently recognized as the best OCR library with relatively accurate recognition.
2. Use of Tesseract
1. Download and install Tesseract: Click to download
2. Set environment variables under Windows system:
#根据下载安装文件的路径配置环境变量 set TESSDATA_PREFIX F:\Tesseract-OCR\
3. Install the pytesseract module
pip install pytesseract
4. How to introduce the tesseract.exe application in a Python script:
pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe'
5. Case demonstration
Recognize the following picture text:
import pytesseract from PIL import Image #1.引入Tesseract程序 pytesseract.pytesseract.tesseract_cmd = r'F:\Tesseract-OCR\tesseract.exe' #2.使用Image模块下的Open()函数打开图片 image = Image.open('6.jpg',mode='r') print(image) #3.识别图片文字 code= pytesseract.image_to_string(image) print(code)
Result demonstration:
Google
Note: Some verification codes cannot be recognized by the tesseract-OCR engine. For example, the verification code generated by Douban cannot recognize its content. If you need to crawl the data in Douban, you need to manually enter the verification code:
3. Simulated login Zhihu source code
import requests import time import pytesseract from PIL import Image from bs4 import BeautifulSoup def captcha(data): with open('captcha.jpg','wb') as fp: fp.write(data) time.sleep(1) image = Image.open("captcha.jpg") text = pytesseract.image_to_string(image) print "机器识别后的验证码为:" + text command = raw_input("请输入Y表示同意使用,按其他键自行重新输入:") if (command == "Y" or command == "y"): return text else: return raw_input('输入验证码:') def zhihuLogin(username,password): # 构建一个保存Cookie值的session对象 sessiona = requests.Session() headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0'} # 先获取页面信息,找到需要POST的数据(并且已记录当前页面的Cookie) html = sessiona.get('https://www.zhihu.com/#signin', headers=headers).content # 找到 name 属性值为 _xsrf 的input标签,取出value里的值 _xsrf = BeautifulSoup(html ,'lxml').find('input', attrs={'name':'_xsrf'}).get('value') # 取出验证码,r后面的值是Unix时间戳,time.time() captcha_url = 'https://www.zhihu.com/captcha.gif?r=%d&type=login' % (time.time() * 1000) response = sessiona.get(captcha_url, headers = headers) data = { "_xsrf":_xsrf, "email":username, "password":password, "remember_me":True, "captcha": captcha(response.content) } response = sessiona.post('https://www.zhihu.com/login/email', data = data, headers=headers) print response.text response = sessiona.get('https://www.zhihu.com/people/maozhaojun/activities', headers=headers) print response.text if __name__ == "__main__": #username = raw_input("username") #password = raw_input("password") zhihuLogin('xxxx@qq.com','ALAxxxxIME')
Related recommendations:
Call pytesseract under python to identify a website verification code
The above is the detailed content of Python uses the Tesseract library to implement identification verification. For more information, please follow other related articles on the PHP Chinese website!