Computing String Similarity in MySQL
Calculating the similarity between two strings is a common task in natural language processing and data analysis. MySQL provides a robust function, LEVENSHTEIN, which allows for efficient and accurate string similarity computation.
LEVENSHTEIN measures the edit distance between two strings, which is the minimum number of insertions, deletions, or substitutions required to transform one string into the other. A lower Levenshtein distance indicates higher similarity between the strings.
To calculate the similarity percentage between two strings, @a and @b, we can use the following formula:
similarity = count(similar words between @a and @b) / (count(@a) + count(@b) - count(intersection))
where the intersection represents the number of words that appear in both @a and @b.
To implement this formula in MySQL, we can employ the following functions:
CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11) DETERMINISTIC BEGIN DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; DECLARE s1_char CHAR; DECLARE cv0, cv1 text; SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; IF s1 = s2 THEN RETURN 0; ELSEIF s1_len = 0 THEN RETURN s2_len; ELSEIF s2_len = 0 THEN RETURN s1_len; ELSE WHILE j <= s2_len DO SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; END WHILE; WHILE i <= s1_len DO SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; WHILE j <= s2_len DO SET c = c + 1; IF s1_char = SUBSTRING(s2, j, 1) THEN SET cost = 0; ELSE SET cost = 1; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; IF c > c_temp THEN SET c = c_temp; END IF; SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; IF c > c_temp THEN SET c = c_temp; END IF; SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; END WHILE; SET cv1 = cv0, i = i + 1; END WHILE; END IF; RETURN c; END
CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11) DETERMINISTIC BEGIN DECLARE s1_len, s2_len, max_len INT; SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); IF s1_len > s2_len THEN SET max_len = s1_len; ELSE SET max_len = s2_len; END IF; RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); END
Using these functions, we can calculate the similarity between @a and @b as follows:
SELECT LEVENSHTEIN_RATIO(@a, @b);
This will return the similarity percentage as an integer value between 0 and 100, where 100 indicates complete similarity.
The above is the detailed content of How Can I Efficiently Compute String Similarity Percentages in MySQL?. For more information, please follow other related articles on the PHP Chinese website!