What are Surrogate Pairs in Java's UTF-16 Encoding?-javaTutorial-php.cn

What are Surrogate Pairs in Java's UTF-16 Encoding?

Linda Hamilton

Release： 2024-12-05 10:12:11

Original

1068 people have browsed it

What are Surrogate Pairs in Java's UTF-16 Encoding?

Surrogates Pairs in Java's UTF-16 Encoding

The concept of surrogate pairs arises when working with Unicode characters that have high code points beyond the range of 16-bit code units used in UTF-16 encoding.

What is a Surrogate Pair?

In UTF-16, a surrogate pair is a combination of two code units that together represent a single code point. When a character cannot be encoded in a single 16-bit code unit, it is represented as follows:

A high surrogate code unit is used at the beginning of the pair, with a range of U D800 to U DBFF.
A low surrogate code unit follows the high surrogate, with a range of U DC00 to U DFFF.

The high surrogate indicates the first half of the code point, while the low surrogate indicates the second half. Together, they form a 32-bit code point.

Encoding and Decoding

The process of encoding a code point beyond U FFFF into a surrogate pair is as follows:

Subtract 0x10000 from the code point to get the high surrogate value.
Shift the high surrogate value right by 10 bits.
Add U D800 to the high surrogate value to get the high surrogate code unit.
Take the remaining 10 bits of the code point to get the low surrogate value.
Add U DC00 to the low surrogate value to get the low surrogate code unit.

Decoding a code point from a surrogate pair involves the reverse process.

Example

Consider the Unicode character U 10400, which represents the character ?. To encode this character into a surrogate pair:

Subtract 0x10000 from U 10400: 0x400
Shift 0x400 right by 10 bits: 0x4
Add U D800 to 0x4: U D804 (high surrogate code unit)
Add U DC00 to 0x3C0: U DC00 (low surrogate code unit)

The character U 10400 is now represented by the surrogate pair U D804 U DC00.

The above is the detailed content of What are Surrogate Pairs in Java's UTF-16 Encoding?. For more information, please follow other related articles on the PHP Chinese website!