How Can I Efficiently Remove Diacritics from Unicode Strings in Java?-javaTutorial-php.cn

How Can I Efficiently Remove Diacritics from Unicode Strings in Java?

Barbara Streisand

Release： 2024-12-11 01:23:10

Original

644 people have browsed it

How Can I Efficiently Remove Diacritics from Unicode Strings in Java?

Remove Diacritic Marks from Unicode Characters

To eliminate diacritical markings (e.g., tilde, umlaut, etc.) from Unicode characters, consider employing the following algorithms:

Java Algorithm

In Java, utilize the following code:

public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}\u0591-\u05C7]+");

private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}

Copy after login

Example:

stripDiacritics("Björn")  = Bjorn

Copy after login

Enhanced Algorithm

For a more comprehensive solution, include a second cleanup stage to handle non-diacritic special characters.

public static final char DEFAULT_REPLACE_CHAR = '-';
public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);
private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()
        // ... [List of non-diacritic characters]

public static String simplifiedString(String orig) {
    String str = orig;
    if (str == null) {
        return null;
    }
    str = stripDiacritics(str);
    str = stripNonDiacritics(str);
    if (str.length() == 0) {
        // ... 
    }
    return str.toLowerCase();
}

// ... [Continued implementation]

Copy after login

Applicability and Limitations

These algorithms effectively remove diacritics for search purposes. However, non-diacritic special characters, such as Białegostok's "ł," require additional handling. The enhanced algorithm attempts to replace these characters with their closest equivalent.

The above is the detailed content of How Can I Efficiently Remove Diacritics from Unicode Strings in Java?. For more information, please follow other related articles on the PHP Chinese website!