Introduction
Iterating through the Unicode codepoints of a Java String requires a unique approach as Java uses a UTF-16-esque encoding. This article explores different strategies and addresses concerns regarding the encoding of characters outside the Basic Multilingual Plane (BMP).
Approaching the Problem
Initially, one might consider using String#codePointAt(int) indexed by character offset. However, this approach presents two concerns: it's not indexed by codepoint offset, and handling codepoints outside the BMP poses challenges.
An alternative approach involves using String#charAt(int) to obtain characters and testing their membership in the high-surrogates range. While this method provides a way to determine if a codepoint is outside the BMP, it comes with the following drawbacks:
The Optimal Solution
Fortunately, Java provides the canonical way to iterate over codepoints using String#codePointAt(int):
<code class="java">for (int offset = 0; offset < length; ) { final int codepoint = s.codePointAt(offset); // do something with the codepoint offset += Character.charCount(codepoint); }</code>
Addressing Concerns
Conclusion
To summarize, iterating through Unicode codepoints in Java Strings requires a deeper understanding of the underlying encoding. However, using the canonical approach outlined in this article provides a correct and efficient solution for this common need.
The above is the detailed content of How do you iterate through Unicode codepoints in Java Strings?. For more information, please follow other related articles on the PHP Chinese website!