How Can I Efficiently Compute String Similarity Percentages in MySQL?-Mysql Tutorial-php.cn

How Can I Efficiently Compute String Similarity Percentages in MySQL?

Patricia Arquette

Release： 2024-11-30 03:57:12

Original

454 people have browsed it

How Can I Efficiently Compute String Similarity Percentages in MySQL?

Computing String Similarity in MySQL

Calculating the similarity between two strings is a common task in natural language processing and data analysis. MySQL provides a robust function, LEVENSHTEIN, which allows for efficient and accurate string similarity computation.

LEVENSHTEIN measures the edit distance between two strings, which is the minimum number of insertions, deletions, or substitutions required to transform one string into the other. A lower Levenshtein distance indicates higher similarity between the strings.

To calculate the similarity percentage between two strings, @a and @b, we can use the following formula:

similarity = count(similar words between @a and @b) / (count(@a) + count(@b) - count(intersection))

Copy after login

where the intersection represents the number of words that appear in both @a and @b.

To implement this formula in MySQL, we can employ the following functions:

CREATE FUNCTION `levenshtein`( s1 text, s2 text) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT; 
    DECLARE s1_char CHAR; 
    DECLARE cv0, cv1 text; 
    SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0; 
    IF s1 = s2 THEN 
      RETURN 0; 
    ELSEIF s1_len = 0 THEN 
      RETURN s2_len; 
    ELSEIF s2_len = 0 THEN 
      RETURN s1_len; 
    ELSE 
      WHILE j <= s2_len DO 
        SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1; 
      END WHILE; 
      WHILE i <= s1_len DO 
        SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1; 
        WHILE j <= s2_len DO 
          SET c = c + 1; 
          IF s1_char = SUBSTRING(s2, j, 1) THEN  
            SET cost = 0; ELSE SET cost = 1; 
          END IF; 
          SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost; 
          IF c > c_temp THEN SET c = c_temp; END IF; 
            SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1; 
            IF c > c_temp THEN  
              SET c = c_temp;  
            END IF; 
            SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1; 
        END WHILE; 
        SET cv1 = cv0, i = i + 1; 
      END WHILE; 
    END IF; 
    RETURN c; 
  END

Copy after login

CREATE FUNCTION `levenshtein_ratio`( s1 text, s2 text ) RETURNS int(11)
    DETERMINISTIC
BEGIN 
    DECLARE s1_len, s2_len, max_len INT; 
    SET s1_len = LENGTH(s1), s2_len = LENGTH(s2); 
    IF s1_len > s2_len THEN  
      SET max_len = s1_len;  
    ELSE  
      SET max_len = s2_len;  
    END IF; 
    RETURN ROUND((1 - LEVENSHTEIN(s1, s2) / max_len) * 100); 
  END

Copy after login

Using these functions, we can calculate the similarity between @a and @b as follows:

SELECT LEVENSHTEIN_RATIO(@a, @b);

Copy after login

This will return the similarity percentage as an integer value between 0 and 100, where 100 indicates complete similarity.

The above is the detailed content of How Can I Efficiently Compute String Similarity Percentages in MySQL?. For more information, please follow other related articles on the PHP Chinese website!