Finding Similar Strings Efficiently in PostgreSQL
Intro: Finding similar strings in large datasets can encounter performance issues when using conventional methods. This article presents a solution that significantly speeds up the search process by employing PostgreSQL's pg_trgm module.
Using SET pg_trgm.similarity_threshold and the % Operator:
The query you provided suffers from excessive similarity calculations. To enhance efficiency, utilize the SET pg_trgm.similarity_threshold configuration parameter and the % operator:
SET pg_trgm.similarity_threshold = 0.8; SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM names n1 JOIN names n2 ON n1.name <> n2.name AND n1.name % n2.name ORDER BY sim DESC;
This approach leverages a trigram GiST index, significantly accelerating the search.
Utilizing Functional Indexes:
To further improve performance, consider employing functional indexes to prefilter possible matches before the cross join. This reduces the number of similarity calculations required, as demonstrated in the following query:
CREATE FUNCTION first_char(text) RETURNS text AS $$ SELECT substring(, 1, 1); $$ LANGUAGE SQL; CREATE INDEX first_char_idx ON names (first_char(name));
SELECT similarity(n1.name, n2.name) AS sim, n1.name, n2.name FROM names n1 JOIN names n2 ON first_char(n1.name) = first_char(n2.name) AND n1.name <> n2.name ORDER BY sim DESC;
Conclusion:
By employing the pg_trgm module, SET pg_trgm.similarity_threshold, the % operator, and functional indexes, you can dramatically enhance the performance of finding similar strings in PostgreSQL, even for large datasets.
The above is the detailed content of How Can I Efficiently Find Similar Strings in PostgreSQL?. For more information, please follow other related articles on the PHP Chinese website!