Removing Diacritical Marks from Unicode Characters
Diacritical marks, such as tilde, umlaut, and circumflex, can modify the pronunciation and spelling of characters. To facilitate search and comparison, it may be necessary to remove these marks. Here's how to remove diacritical marks from Unicode characters in Java:
Using Normalization Form NFD and Regular Expressions
The normalize(Normalizer.Form.NFD) method decomposes a Unicode string into its base characters and diacritical marks. By combining this with a regular expression that matches diacritical marks, you can remove them from the string.
import java.util.regex.Pattern; public class DiacriticRemover { public static final Pattern DIACRITICS_PATTERN = Pattern.compile("[\p{InCombiningDiacriticalMarks}]"); public static String removeDiacritics(String str) { return DIACRITICS_PATTERN.matcher(str).replaceAll(""); } }
Sample Usage:
String withDiacritics = "Björń"; String withoutDiacritics = DiacriticRemover.removeDiacritics(withDiacritics); System.out.println(withoutDiacritics); // Output: Bjorn
Enhanced String Simplification
To handle non-diacritic special characters that can affect search and comparison, consider using Google's ImmutableMap and an additional cleanup round.
import com.google.common.collect.ImmutableMap; public class StringSimplifier { private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder() // ... (define replacements here) .build(); public static String simplifiedString(String str) { return NONDIACRITICS.entrySet().stream() .reduce(str, (s, entry) -> s.replaceAll(entry.getKey(), entry.getValue()), String::concat); } }
Sample Usage:
String withNonDiacritics = "Białystok"; String simplified = StringSimplifier.simplifiedString(withNonDiacritics); System.out.println(simplified); // Output: Bialystok
By using these techniques, you can remove diacritical marks and simplify strings for improved search and comparison capabilities.
The above is the detailed content of How to Remove Diacritical Marks from Unicode Strings in Java?. For more information, please follow other related articles on the PHP Chinese website!