Home > Java > javaTutorial > body text

How do you iterate through Unicode codepoints in Java Strings?

Linda Hamilton
Release: 2024-10-25 14:10:02
Original
196 people have browsed it

How do you iterate through Unicode codepoints in Java Strings?

Iterating through Unicode Codepoints in Java Strings

Introduction

Iterating through the Unicode codepoints of a Java String requires a unique approach as Java uses a UTF-16-esque encoding. This article explores different strategies and addresses concerns regarding the encoding of characters outside the Basic Multilingual Plane (BMP).

Approaching the Problem

Initially, one might consider using String#codePointAt(int) indexed by character offset. However, this approach presents two concerns: it's not indexed by codepoint offset, and handling codepoints outside the BMP poses challenges.

An alternative approach involves using String#charAt(int) to obtain characters and testing their membership in the high-surrogates range. While this method provides a way to determine if a codepoint is outside the BMP, it comes with the following drawbacks:

  • Uncertainty about the representation of BMP-range codepoints
  • High computational cost

The Optimal Solution

Fortunately, Java provides the canonical way to iterate over codepoints using String#codePointAt(int):

<code class="java">for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}</code>
Copy after login

Addressing Concerns

  • Java indeed uses a UTF-16-esque encoding, storing characters outside the BMP as surrogates.
  • The code provided above handles BMP-range codepoints correctly.
  • Increasing the offset by Character.charCount(codepoint) correctly navigates surrogate pairs.

Conclusion

To summarize, iterating through Unicode codepoints in Java Strings requires a deeper understanding of the underlying encoding. However, using the canonical approach outlined in this article provides a correct and efficient solution for this common need.

The above is the detailed content of How do you iterate through Unicode codepoints in Java Strings?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template
About us Disclaimer Sitemap
php.cn:Public welfare online PHP training,Help PHP learners grow quickly!