Java's Internal Representation of Strings
Java employs UTF-16 for its internal text representation. This means that each character in a Java string is encoded using a 16-bit Unicode code unit. This representation allows Java to support a wide range of characters, including those from non-Latin alphabets.
Modified UTF-8 for Serialization
While Java uses UTF-16 internally, it utilizes a modified version of UTF-8 for string serialization. This modified UTF-8 format ensures compatibility with other systems that use UTF-8 encoding, such as web browsers. For external data representation, Java typically follows strict CESU-8.
Character Representation in Memory
A single character in Java, represented as a char primitive type, occupies two bytes in memory. This is regardless of the character's Unicode code point. Code points higher than 65535 require two characters, resulting in a 4-byte representation in memory.
In certain circumstances, Java may employ a compression technique called UseCompressedStrings. This technique allows for 8-bit ISO-8859-1 encoding for strings that do not require UTF-16. However, this is an implementation-specific optimization and not the default internal representation for strings.
The above is the detailed content of How Does Java Represent Strings Internally?. For more information, please follow other related articles on the PHP Chinese website!