What are Surrogate Pairs and How Do They Encode Characters Beyond Basic Multilingual Plane in Java Strings?-javaTutorial-php.cn

What are Surrogate Pairs and How Do They Encode Characters Beyond Basic Multilingual Plane in Java Strings?

DDD

Release： 2024-12-31 13:10:24

Original

397 people have browsed it

What are Surrogate Pairs and How Do They Encode Characters Beyond Basic Multilingual Plane in Java Strings?

Understanding Surrogate Pairs in Java's String Encoding

While exploring the documentation for StringBuffer's reverse() method, you may encounter the term "surrogate pair." This concept is crucial in the context of Unicode string encoding. Let's delve into what a surrogate pair is and how it relates to the ranges known as low and high surrogates.

Decoding Surrogate Pairs: A Deeper Look into Unicode

Unicode assigns each character a code point ranging from 0x0 to 0x10FFFF. However, Java's internal representation of Unicode strings utilizes UTF-16 encoding, which employs 16-bit code units. Given that 16-bit code units can only represent the range from 0x0 to 0xFFFF, a solution was needed to accommodate characters with code points beyond this limit. This solution came in the form of surrogate pairs.

High and Low Surrogates: Decoding Unicode's Extended Range

Surrogate pairs are constructed using two code units:

High Surrogate: Occupies the code unit range from 0xD800 to 0xDBFF and is used at the beginning of the pair.
Low Surrogate: Falls within the range of 0xDC00 to 0xDFFF and follows the high surrogate.

Together, the high and low surrogates create a 31-bit code point that can represent characters in the range from 0x10000 to 0x10FFFF. This extended range allows for the encoding of characters from various languages, symbols, and emojis.

The above is the detailed content of What are Surrogate Pairs and How Do They Encode Characters Beyond Basic Multilingual Plane in Java Strings?. For more information, please follow other related articles on the PHP Chinese website!