1. Preface
This experiment will explain cracking through a simple exampleVerification code principles, you will learn and practice the following knowledge points:
PythonBasic knowledge
Use of PIL module
2. Detailed explanation of examples
Installation pillow (PIL) library:
$ sudo apt-get update $ sudo apt-get install python-dev $ sudo apt-get install libtiff5-dev libjpeg8-dev zlib1g-dev \ libfreetype6-dev liblcms2-dev libwebp-dev tcl8.6-dev tk8.6-dev python-tk $ sudo pip install pillow
Download files for experiments:
$ wget http://labfile.oss.aliyuncs.com/courses/364/python_captcha.zip $ unzip python_captcha.zip $ cd python_captcha
This is the verification code captcha.gif used in our experiment
Extract textPicture
Create a new crack in the working directory .py file, edit it.
#-*- coding:utf8 -*- from PIL import Image im = Image.open("captcha.gif") #(将图片转换为8位像素模式) im = im.convert("P") #打印颜色直方图 print im.histogram()
Output:
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 , 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 2, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0 , 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 0, 0, 1, 2, 0, 1, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 3, 1, 3, 3, 0, 0, 0, 0, 0, 0, 1, 0, 3, 2, 132, 1, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 15, 0 , 1, 0, 1, 0, 0, 8, 1, 0, 0, 0, 0, 1, 6, 0, 2, 0, 0, 0, 0, 18, 1, 1, 1, 1, 1, 2, 365, 115, 0, 1, 0, 0, 0, 135, 186, 0, 0, 1, 0, 0, 0, 116, 3, 0, 0, 0, 0, 0, 21, 1, 1, 0, 0, 0, 2, 10, 2, 0, 0, 0, 0, 2, 10, 0, 0, 0, 0, 1, 0, 625]
Each digit of the color histogram represents what is contained in the image. The number of pixels corresponding to the color of the bit.
Each pixel can represent 256 colors. You will find that the white point is the most (the position of white number 255, which is the last digit, you can see that there are 625 white points. pixel). The red pixel is around 200, and we can get useful colors by sorting
his = im.histogram() values = {} for i in range(256): values[i] = his[i] for j,k in sorted(values.items(),key=lambda x:x[1],reverse = True)[:10]: print j,k
Output:
255 625 212 365 220 186 219 135 169 132 227 116 213 115 234 21 205 18 184 15
We get the 10 most in the picture. A color, of which 220 and 227 are the red and gray we need, we can use this information to construct a black and white binary image
#-*- coding:utf8 -*- from PIL import Image im = Image.open("captcha.gif") im = im.convert("P") im2 = Image.new("P",im.size,255) for x in range(im.size[1]): for y in range(im.size[0]): pix = im.getpixel((y,x)) if pix == 220 or pix == 227: # these are the numbers to get im2.putpixel((y,x),0) im2.show()
The result obtained:
#.
##Extract a single character imageThe next step is to get the pixel collection of a single character. Since the example is relatively simple, we cut it vertically:
inletter = False foundletter=False start = 0 end = 0 letters = [] for y in range(im2.size[0]): for x in range(im2.size[1]): pix = im2.getpixel((y,x)) if pix != 255: inletter = True if foundletter == False and inletter == True: foundletter = True start = y if foundletter == True and inletter == False: foundletter = False end = y letters.append((start,end)) inletter=False print letters
Output:
[(6, 14), (15, 25), (27, 35), (37, 46), (48, 56), (57, 67)]
Get the starting and ending column numbers of each character.
import hashlib import time count = 0 for letter in letters: m = hashlib.md5() im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] )) m.update("%s%s"%(time.time(),count)) im3.save("./%s.gif"%(m.hexdigest())) count += 1
(Continue the above code)
Cut the picture and get each character. The part of the image where the characters are located.
AI and vector space image recognitionHere we use a vector space search engine for character recognition, which has many advantages:
Does not require a large amount of Training iteration
Will not overtrain
You can add/remove wrong data at any time to view the effect
Easy to understand and write into code
Provides hierarchical results, you can view the closest multiple matches
For things that cannot be recognized, just add them to the search engine and they will be recognized immediately.
Of course, it also has shortcomings, such as the classification speed is much slower than the neural network, it cannot find its own way to solve the problem, etc.
The name of the vector space search engine sounds very grand, but the principle is very simple. Take the example in the article:
You have 3 documents, how do we calculate the similarity between them? The more words that two documents use in common, the more similar they are! But what if there are too many words? It’s up to us to select a few key words. The selected words are also called features. Each feature is like a dimension in the space (x, y, z, etc.), and a set of features is A vector, we can get such a vector for every document. As long as we calculate the angle between the vectors, we can get the similarity of the articles.
Use Python class to implement vector space:
import math class VectorCompare: #计算矢量大小 def magnitude(self,concordance): total = 0 for word,count in concordance.iteritems(): total += count ** 2 return math.sqrt(total) #计算矢量之间的 cos 值 def relation(self,concordance1, concordance2): relevance = 0 topvalue = 0 for word, count in concordance1.iteritems(): if concordance2.has_key(word): topvalue += count * concordance2[word] return topvalue / (self.magnitude(concordance1) * self.magnitude(concordance2))
It will compare two python dictionary types and output their similarity (represented by numbers from 0 to 1)
Will Putting the previous contents togetherThere is also the work of extracting a large number of verification codes to extract single character images as a training set, but as long as students who have read the above carefully will know how to do these tasks, here is Omitted. You can directly use the provided training set to perform the following operations.
The iconset directory contains our training set.
Last appended content:
#将图片转换为矢量 def buildvector(im): d1 = {} count = 0 for i in im.getdata(): d1[count] = i count += 1 return d1 v = VectorCompare() iconset = ['0','1','2','3','4','5','6','7','8','9','0','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'] #加载训练集 imageset = [] for letter in iconset: for img in os.listdir('./iconset/%s/'%(letter)): temp = [] if img != "Thumbs.db" and img != ".DS_Store": temp.append(buildvector(Image.open("./iconset/%s/%s"%(letter,img)))) imageset.append({letter:temp}) count = 0 #对验证码图片进行切割 for letter in letters: m = hashlib.md5() im3 = im2.crop(( letter[0] , 0, letter[1],im2.size[1] )) guess = [] #将切割得到的验证码小片段与每个训练片段进行比较 for image in imageset: for x,y in image.iteritems(): if len(y) != 0: guess.append( ( v.relation(y[0],buildvector(im3)),x) ) guess.sort(reverse=True) print "",guess[0] count += 1
Get the result
Everything is ready, try running our code:
python crack.py
Output
(0.96376811594202894, '7') (0.96234028545977002, 's') (0.9286884286888929, '9') (0.98350370609844473, 't') (0.96751165072506273, '9') (0.96989711688772628, 'j')
SummaryThe above is the entire content of this article. I hope the content of this article can bring some help to everyone's study or work. If you have any questions, you can leave a message to communicate.
The above is the detailed content of Detailed explanation of using Python to crack verification code example code. For more information, please follow other related articles on the PHP Chinese website!