Home > Java > javaTutorial > How Can I Efficiently Remove Diacritics from Unicode Strings in Java?

How Can I Efficiently Remove Diacritics from Unicode Strings in Java?

Barbara Streisand
Release: 2024-12-11 01:23:10
Original
553 people have browsed it

How Can I Efficiently Remove Diacritics from Unicode Strings in Java?

Remove Diacritic Marks from Unicode Characters

To eliminate diacritical markings (e.g., tilde, umlaut, etc.) from Unicode characters, consider employing the following algorithms:

Java Algorithm

In Java, utilize the following code:

public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile("[\p{InCombiningDiacriticalMarks}\p{IsLm}\p{IsSk}\u0591-\u05C7]+");

private static String stripDiacritics(String str) {
    str = Normalizer.normalize(str, Normalizer.Form.NFD);
    str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");
    return str;
}
Copy after login

Example:

stripDiacritics("Björn")  = Bjorn
Copy after login

Enhanced Algorithm

For a more comprehensive solution, include a second cleanup stage to handle non-diacritic special characters.

public static final char DEFAULT_REPLACE_CHAR = '-';
public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);
private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()
        // ... [List of non-diacritic characters]

public static String simplifiedString(String orig) {
    String str = orig;
    if (str == null) {
        return null;
    }
    str = stripDiacritics(str);
    str = stripNonDiacritics(str);
    if (str.length() == 0) {
        // ... 
    }
    return str.toLowerCase();
}

// ... [Continued implementation]
Copy after login

Applicability and Limitations

These algorithms effectively remove diacritics for search purposes. However, non-diacritic special characters, such as Białegostok's "ł," require additional handling. The enhanced algorithm attempts to replace these characters with their closest equivalent.

The above is the detailed content of How Can I Efficiently Remove Diacritics from Unicode Strings in Java?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template