Understanding Surrogate Pairs in Java's String Encoding
While exploring the documentation for StringBuffer's reverse() method, you may encounter the term "surrogate pair." This concept is crucial in the context of Unicode string encoding. Let's delve into what a surrogate pair is and how it relates to the ranges known as low and high surrogates.
Decoding Surrogate Pairs: A Deeper Look into Unicode
Unicode assigns each character a code point ranging from 0x0 to 0x10FFFF. However, Java's internal representation of Unicode strings utilizes UTF-16 encoding, which employs 16-bit code units. Given that 16-bit code units can only represent the range from 0x0 to 0xFFFF, a solution was needed to accommodate characters with code points beyond this limit. This solution came in the form of surrogate pairs.
High and Low Surrogates: Decoding Unicode's Extended Range
Surrogate pairs are constructed using two code units:
Together, the high and low surrogates create a 31-bit code point that can represent characters in the range from 0x10000 to 0x10FFFF. This extended range allows for the encoding of characters from various languages, symbols, and emojis.
The above is the detailed content of What are Surrogate Pairs and How Do They Encode Characters Beyond Basic Multilingual Plane in Java Strings?. For more information, please follow other related articles on the PHP Chinese website!