Home > Java > javaTutorial > What are Surrogate Pairs in Java's UTF-16 Encoding?

What are Surrogate Pairs in Java's UTF-16 Encoding?

Linda Hamilton
Release: 2024-12-05 10:12:11
Original
1020 people have browsed it

What are Surrogate Pairs in Java's UTF-16 Encoding?

Surrogates Pairs in Java's UTF-16 Encoding

The concept of surrogate pairs arises when working with Unicode characters that have high code points beyond the range of 16-bit code units used in UTF-16 encoding.

What is a Surrogate Pair?

In UTF-16, a surrogate pair is a combination of two code units that together represent a single code point. When a character cannot be encoded in a single 16-bit code unit, it is represented as follows:

  • A high surrogate code unit is used at the beginning of the pair, with a range of U D800 to U DBFF.
  • A low surrogate code unit follows the high surrogate, with a range of U DC00 to U DFFF.

The high surrogate indicates the first half of the code point, while the low surrogate indicates the second half. Together, they form a 32-bit code point.

Encoding and Decoding

The process of encoding a code point beyond U FFFF into a surrogate pair is as follows:

Subtract 0x10000 from the code point to get the high surrogate value.
Shift the high surrogate value right by 10 bits.
Add U D800 to the high surrogate value to get the high surrogate code unit.
Take the remaining 10 bits of the code point to get the low surrogate value.
Add U DC00 to the low surrogate value to get the low surrogate code unit.

Decoding a code point from a surrogate pair involves the reverse process.

Example

Consider the Unicode character U 10400, which represents the character ?. To encode this character into a surrogate pair:

Subtract 0x10000 from U 10400: 0x400
Shift 0x400 right by 10 bits: 0x4
Add U D800 to 0x4: U D804 (high surrogate code unit)
Add U DC00 to 0x3C0: U DC00 (low surrogate code unit)

The character U 10400 is now represented by the surrogate pair U D804 U DC00.

The above is the detailed content of What are Surrogate Pairs in Java's UTF-16 Encoding?. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Latest Articles by Author
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template