Surrogates Pairs in Java's UTF-16 Encoding
The concept of surrogate pairs arises when working with Unicode characters that have high code points beyond the range of 16-bit code units used in UTF-16 encoding.
What is a Surrogate Pair?
In UTF-16, a surrogate pair is a combination of two code units that together represent a single code point. When a character cannot be encoded in a single 16-bit code unit, it is represented as follows:
The high surrogate indicates the first half of the code point, while the low surrogate indicates the second half. Together, they form a 32-bit code point.
Encoding and Decoding
The process of encoding a code point beyond U FFFF into a surrogate pair is as follows:
Subtract 0x10000 from the code point to get the high surrogate value.
Shift the high surrogate value right by 10 bits.
Add U D800 to the high surrogate value to get the high surrogate code unit.
Take the remaining 10 bits of the code point to get the low surrogate value.
Add U DC00 to the low surrogate value to get the low surrogate code unit.
Decoding a code point from a surrogate pair involves the reverse process.
Example
Consider the Unicode character U 10400, which represents the character ?. To encode this character into a surrogate pair:
Subtract 0x10000 from U 10400: 0x400
Shift 0x400 right by 10 bits: 0x4
Add U D800 to 0x4: U D804 (high surrogate code unit)
Add U DC00 to 0x3C0: U DC00 (low surrogate code unit)
The character U 10400 is now represented by the surrogate pair U D804 U DC00.
The above is the detailed content of What are Surrogate Pairs in Java's UTF-16 Encoding?. For more information, please follow other related articles on the PHP Chinese website!