To eliminate diacritical markings (e.g., tilde, umlaut, etc.) from Unicode characters, consider employing the following algorithms:
In Java, utilize the following code:
public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}\u0591-\u05C7]+"); private static String stripDiacritics(String str) { str = Normalizer.normalize(str, Normalizer.Form.NFD); str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(""); return str; }
Example:
stripDiacritics("Björn") = Bjorn
For a more comprehensive solution, include a second cleanup stage to handle non-diacritic special characters.
public static final char DEFAULT_REPLACE_CHAR = '-'; public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR); private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder() // ... [List of non-diacritic characters] public static String simplifiedString(String orig) { String str = orig; if (str == null) { return null; } str = stripDiacritics(str); str = stripNonDiacritics(str); if (str.length() == 0) { // ... } return str.toLowerCase(); } // ... [Continued implementation]
These algorithms effectively remove diacritics for search purposes. However, non-diacritic special characters, such as Białegostok's "ł," require additional handling. The enhanced algorithm attempts to replace these characters with their closest equivalent.
The above is the detailed content of How Can I Efficiently Remove Diacritics from Unicode Strings in Java?. For more information, please follow other related articles on the PHP Chinese website!