PHP has a function similar_text() that calculates the similarity of two strings. It can get a percentage to express the similarity of the two strings. The effect is as follows:
similar_text('aaaa', 'aaaa', $percent);
var_dump($percent);
//float(100)
similar_text('aaaa', 'aaaabbbb', $percent );
var_dump($percent);
//float(66.666666666667)
similar_text('abcdef', 'aabcdefg', $percent);
var_dump($percent);
/ /float(85.714285714286)
Using this function, you can use it for fuzzy search or other functions that require fuzzy matching. Recently, I have involved this function in the feature matching step in the research on verification code recognition.
But what kind of algorithm does this function use? I studied his underlying implementation and summarized it in three steps:
(1) Find the longest segment with the same part in the two strings;
(2) Use the same method to find the longest segment with the same part in the remaining two segments. Analogize until there are no identical parts;
(3) Similarity = sum of the lengths of all identical parts * 2 / sum of the lengths of the two strings;
The source code version I studied is PHP 5.4.6, the relevant code is located in lines 2951~3031 of the file php-5.4.6/ext/standard/string.c. The following is the source code after I added comments.
//Find the longest segment with the same part in the two strings
static void php_similar_str(const char *txt1, int len1, const char *txt2, int len2, int *pos1, int * pos2, int *max)
{
char *p, *q;
char *end1 = (char *) txt1 + len1;
char *end2 = (char *) txt2 + len2;
int l;
*max = 0;
//Start traversing based on the first string
for (p = (char *) txt1; p < end1; p++) {
//Traverse the second string
for (q = (char *) txt2; q < end2; q++) {
//Find the same character, continue to loop, l is the length of the same part
for (l = 0; (p + l < end1) && (q + l < end2) && (p[l] == q[l]); l++);
//Bubble method to find the longest l and remember the starting position of the same part
if (l > *max) {
*max = l;
*pos1 = p - txt1;
*pos2 = q - txt2;
}
}
}
}
//Calculate the total length of the same part of the two strings
static int php_similar_char(const char *txt1, int len1, const char *txt2, int len2)
{
int sum;
int pos1, pos2, max;
//Find out The longest segment of the same part of the two strings
php_similar_str(txt1, len1, txt2, len2, &pos1, &pos2, &max);
//This is the initial assignment to sum and the judgment of the max value
//If max is zero, it means that the two strings do not have any same characters, and it will jump out if
if ((sum = max)) {
//Recurse for the first half, the same segment length Accumulation
if (pos1 && pos2) {
sum += php_similar_char(txt1, pos1,
txt2, pos2);
}
//Recurse for the second half, the same segment length is accumulated
if ((pos1 + max < len1) && (pos2 + max < len2)) {
sum += php_similar_char(txt1 + pos1 + max, len1 - pos1 - max,
txt2 + pos2 + max, len2 - pos2 - max);
}
}
return sum;
}
//PHP function definition
PHP_FUNCTION(similar_text)
{
char *t1, *t2;
zval **percent = NULL;
int ac = ZEND_NUM_ARGS();
int sim;
int t1_len, t2_len;
//Check parameter validity
if (zend_parse_parameters(ZEND_NUM_ARGS() TSRMLS_CC, "ss|Z", &t1, &t1_len, &t2, &t2_len, &percent) == FAILURE) {
return;
}
//If there is a third parameter
if (ac > 2) {
convert_to_double_ex(percent);
}
//If two characters The string length is 0, and 0 is returned
if (t1_len + t2_len == 0) {
if (ac > 2) {
Z_DVAL_PP(percent) = 0;
}
RETURN_LONG(0);
}
//Call the above function to calculate the similarity of two strings
sim = php_similar_char(t1, t1_len, t2, t2_len);
//You can see the calculation formula of the third parameter percent
if (ac > 2) {
Z_DVAL_PP(percent) = sim * 200.0 / (t1_len + t2_len);
}
RETURN_LONG(sim);
}
In addition, PHP also provides another function levenshtein() for calculating string similarity. It expresses string similarity by calculating the edit distance of two strings. This is also a very common algorithm. The performance of levenshtein() is better than similar_text(), because from the previous code analysis, we can see that the complexity of similar_text() is O(n^3), n represents the length of the longest string, and levenshtein() The complexity is O(m*n), m and n are the lengths of the two strings respectively.