Elasticsearch offers flexible methods for fuzzy matching of data, including emails and phone numbers. This article explores how to optimize performance for such queries using custom analyzers and token filters.
Custom Analyzers for Fuzzy Matching
To efficiently fuzzy match emails and phone numbers, it's recommended to create custom analyzers in Elasticsearch. These analyzers consist of a tokenizer that prepares input data for analysis and a set of filters that execute specific transformations.
Email Analyzer
The index_email_analyzer analyzer leverages the standard tokenizer to break down the input. It then applies filters such as lowercase, name_ngram_filter, and trim to convert the email to lowercase, generate ngrams of varying lengths (from 3 to 20 characters), and remove spaces.
The search_email_analyzer similarly uses the standard tokenizer but employs only lowercase and trim filters. This prepares the input for searching, where the ngram filter is not required.
Phone Analyzer
For phone numbers, the index_phone_analyzer utilizes the digit_edge_ngram_tokenizer to generate ngrams of varying lengths (1 to 15 characters) that start with a digit. This allows for matching any prefix of a phone number. The digit_only char filter removes non-digit characters to ensure only numerical values are analyzed.
The search_phone_analyzer uses the keyword tokenizer, which generates a single token from the input, enabling exact matching of phone numbers.
Implementing the Analyzers
Here's a sample mapping that incorporates these custom analyzers:
PUT myindex { "settings": { "analysis": { "analyzer": { "email_url_analyzer": { "type": "custom", "tokenizer": "uax_url_email", "filter": [ "trim" ] }, "index_phone_analyzer": { "type": "custom", "char_filter": [ "digit_only" ], "tokenizer": "digit_edge_ngram_tokenizer", "filter": [ "trim" ] }, "search_phone_analyzer": { "type": "custom", "char_filter": [ "digit_only" ], "tokenizer": "keyword", "filter": [ "trim" ] }, "index_email_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "name_ngram_filter", "trim" ] }, "search_email_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "trim" ] } }, "char_filter": { "digit_only": { "type": "pattern_replace", "pattern": "\D+", "replacement": "" } }, "tokenizer": { "digit_edge_ngram_tokenizer": { "type": "edgeNGram", "min_gram": "1", "max_gram": "15", "token_chars": [ "digit" ] } }, "filter": { "name_ngram_filter": { "type": "ngram", "min_gram": "1", "max_gram": "20" } } } }, "mappings": { "your_type": { "properties": { "email": { "type": "string", "analyzer": "index_email_analyzer", "search_analyzer": "search_email_analyzer" }, "phone": { "type": "string", "analyzer": "index_phone_analyzer", "search_analyzer": "search_phone_analyzer" } } } } }
Performing Fuzzy Queries
To match emails ending with "@gmail.com" or phone numbers starting with "136", you can issue queries like:
POST myindex { "query": { "term": { "email": "@gmail.com" } } } POST myindex { "query": { "term": { "phone": "136" } } }
These queries will leverage the custom analyzers to generate the necessary ngrams for fuzzy matching.
The above is the detailed content of How to Optimize Fuzzy Matching of Emails and Phone Numbers in Elasticsearch?. For more information, please follow other related articles on the PHP Chinese website!