Implement verification code recognition in PHP (intermediate level)

In the previous article >, it talks about how to identify simple verification. What is simple here is that the verification code consists of numbers and letters, has a unified format, and has a fixed position every time it appears. This article will continue to study in depth the identification of verification codes. The goal of this identification is that the verification code is composed of characters and numbers. The verification code is rotated (maybe both left and right), the position is not fixed, and there is adhesion between characters. And the verification code has stronger interferon. The method explained in this article is not a universal solution, and the code provided cannot directly solve your problem. This is just a method. Readers need to solve it by themselves. It should be noted that identifying verification codes has nothing to do with specific programming languages. This is only implemented using the PHP language. You can use any language to implement it using the method introduced here.

This article explains step by step the steps in the process of identifying the verification code.

As shown in the picture above, the subsequent explanations will focus on this picture.
1: After getting a verification code, the first thing we need to do at first glance is to binarize it. The part of the verification code is represented by 1, and the background part is represented by 0. The identification method is very simple. We print out the RGB of the main picture of the verification code, and then analyze its rules. Through the RGB code, we can easily distinguish the above The R value of this picture is greater than 120, and the values of G and B are less than 80, so according to this rule we can easily binarize the above picture. Let’s look at the two pictures identified in the elementary chapter

At first glance, it felt very complicated. The background color of the verification code picture is different every time, and it is not a single color. The color of each verification code number is also different every time. It seems difficult to binarize, but in fact it is easy to find when we print out its RGB value. No matter how the color of the verification number changes, the RGB value of the number will always have a value less than 125, so the following judgment is made

$rgbarray['red'] < 125 || $rgbarray['green']<125|| $rgbarray['blue'] < 125

It is easy for us to distinguish where the numbers are and where the background is.

The reason why we can find these rules is that when making the interferon of the verification code, in order for the interferon not to affect the display effect of the numbers, the RGB and digital RGB of the interferon must be independent of each other and do not interfere with each other. . As long as we understand this rule, we can easily achieve binarization.

The 120, 80, 125 and other thresholds we found may be different from the actual RGB. Therefore, sometimes after binarization, 1 will appear in some places. For numbers displayed in fixed positions on the verification code, this This kind of interference doesn't make much sense. But for pictures with uncertain positions of verification codes, it is likely to cause interference when we cut characters. Therefore, denoising is required after binarization.

2: Next we proceed to the second step, making noise. The principle of drying is very simple, which is to remove isolated effective values. If the noise is relatively high and the required efficiency is relatively high, there is still a lot of work to be done. Fortunately, we don't require so much depth here. We can use the simplest method. If a point is 1, then determine whether the number in the 8 directions of the point is 1. If it is not 1, then If it is considered a dry point, just set it to 1.

As shown in the picture above, we use this method to easily find that 1 in the red box is the dry point, and just set it to 1 directly.

We use a trick when judging. Sometimes the noise may be two consecutive 1 s, so we

[php:collapse] + expand sourceview plaincopyprint?$num = 0;
if($data[$i][$j] == 1)
{
    // 上
    if(isset($data[$i-1][$j])){
        $num = $num + $data[$i-1][$j];
    }
    // 下
    if(isset($data[$i+1][$j])){
        $num = $num + $data[$i+1][$j];
    }
    // 左
    if(isset($data[$i][$j-1])){
        $num = $num + $data[$i][$j-1];
    }
    // 右
    if(isset($data[$i][$j+1])){
        $num = $num + $data[$i][$j+1];
    }
    // 上左
    if(isset($data[$i-1][$j-1])){
        $num = $num + $data[$i-1][$j-1];
    }
    // 上右
    if(isset($data[$i-1][$j+1])){
        $num = $num + $data[$i-1][$j+1];
    }
    // 下左
    if(isset($data[$i+1][$j-1])){
        $num = $num + $data[$i+1][$j-1];
    }
    // 下右
    if(isset($data[$i+1][$j+1])){
        $num = $num + $data[$i+1][$j+1];
    }
}
if($num == 0){
    $data[$i][$j] = 0;
}
$num = 0;
if($data[$i][$j] == 1)
{
// 上
if(isset($data[$i-1][$j])){
  $num = $num + $data[$i-1][$j];
}
// 下
if(isset($data[$i+1][$j])){
  $num = $num + $data[$i+1][$j];
}
// 左
if(isset($data[$i][$j-1])){
  $num = $num + $data[$i][$j-1];
}
// 右
if(isset($data[$i][$j+1])){
  $num = $num + $data[$i][$j+1];
}
// 上左
if(isset($data[$i-1][$j-1])){
  $num = $num + $data[$i-1][$j-1];
}
// 上右
if(isset($data[$i-1][$j+1])){
  $num = $num + $data[$i-1][$j+1];
}
// 下左
if(isset($data[$i+1][$j-1])){
  $num = $num + $data[$i+1][$j-1];
}
// 下右
if(isset($data[$i+1][$j+1])){
  $num = $num + $data[$i+1][$j+1];
}
}
if($num == 0){
$data[$i][$j] = 0;
}

我们计算这个点的 8 个方向上的值之和，最后我们判断他们的和是否小于特定的阈值
三：经过去噪后，我们就得到干净的二值化的数据，接下来要做的就是切割字符了。切割字符的方法有很多种，这里我采用最简单的一种，先垂直方向切割成为字符，然后在水平方向去掉多于的 0000 ，如下图

第一步切割红线部分，第二步切割蓝线部分，这样就可以得到独立的字符了。但是像下面这种情况

The above method will cut the dw character into one character, which is wrong cutting, so here we involve the cutting of glue characters.
Four: Adhesive character cutting. When making verification codes, the adhesion of regular characters is easy to separate. If the characters themselves are scaled, the deformation will be difficult to handle. After analysis, we can find that the above character adhesion is a very simple way. Regular character adhesion, so we also use a very simple way to deal with this situation. After completing the segmentation operation, we cannot immediately determine that the segmented part is a character. We need to verify it. The key factor of verification is whether the width of the cut character is greater than the threshold. The criterion for selecting this threshold is that no matter how a character is rotated, The deformation will not be greater than this threshold, so if the block we cut is greater than this threshold, it can be considered to be a glued character; if it is greater than the sum of the two thresholds, it is considered to be three characters glued, and so on. After knowing this rule, cutting the sticky characters is very simple. If we find that it is a block of glued characters, we can just divide the block into two or more new blocks. Of course, in order to better restore the characters, I generally use +1 and -1 to appropriately supplement the character block.
Five: After the above four steps, we can extract relatively pure character blocks. The next step is to match characters. There are many methods for establishing feature codes for rotated characters, so we will not go into in-depth study here. The simplest way I use here is to build a matching library for all situations of all characters, so I added a study operation to the code I provided. The purpose is to first manually identify the verification code of the picture, and then use the study method to write Enter the feature code library. The more image data is written in this way, the more accurate rows can be verified and identified.
Okay, after the above steps, we can basically identify most of the verification codes on the Internet today. Here we are using the simplest method without using any OCR knowledge. These methods should be at the pinnacle of the non-OCR field. If you want to identify more complex verification codes, you need more OCR knowledge. If I have the chance, I will introduce them one by one in the advanced chapter.
Below are some easy-to-identify verification codes, hoping to attract the attention of website administrators.

Some suggestions for making verification codes
For the program to identify verification codes, the most difficult part is the cutting of verification characters and the establishment of feature codes. When many domestic programmers only make verification codes, they always like to add a lot of interferons and interference lines to the verification code, which affects the effect. If you don’t say it, it still won’t achieve very good results; therefore, if you want to make your verification code difficult for this person to recognize, you only need to do the following two points
1: Character adhesion, preferably all characters have adhesion parts;
2: Do not use specification characters. Use different proportions of scaling or rotation for each part of the verification code.
As long as these two points are achieved, or the deformation of these two points is achieved, it will be difficult for the recognition program to identify. Let's take a look at the verification codes of Yahoo and Google. They have white characters on a black background, but are difficult to recognize.

Goole:

yahoo:

Under the source file: click to download http://www.BkJia.com/uploadfile/2012/0316/20120316111107739.rar

Excerpted from ugg’s column