Determining UTF-8 Encoded String Length
In C , std::string encoding may vary, and using the length() function on a UTF-8 encoded string can yield an inaccurate representation of its actual length. To ascertain the correct length, consider the following byte sequence patterns:
0x00000000 - 0x0000007F: 0xxxxxxx 0x00000080 - 0x000007FF: 110xxxxx 10xxxxxx 0x00000800 - 0x0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 0x00010000 - 0x001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
To calculate the actual length of a UTF-8 encoded string:
The following code snippet illustrates the implementation:
<code class="cpp">int len = 0; const char *s = str.c_str(); // convert to C-style string while (*s) len += (*s++ & 0xc0) != 0x80;</code>
The above is the detailed content of How to Accurately Determine the Length of a UTF-8 Encoded String in C ?. For more information, please follow other related articles on the PHP Chinese website!