Determining the Actual Length of UTF-8 Encoded Strings in C
UTF-8 is a variable-width character encoding scheme, which means that the length of a string in bytes does not necessarily correspond to the number of characters it contains. This can be an issue when working with UTF-8 strings in C , as the str.length() method returns the number of bytes in the string, not the number of characters.
To accurately determine the length of a UTF-8 encoded string in C , you can use the following approach:
Count the number of first-bytes in the string. First-bytes are bytes that do not match 10xxxxxx, as these bytes indicate the start of multi-byte character sequences.
Here is an example implementation:
<code class="cpp">int len = 0; while (*s) len += (*s++ & 0xc0) != 0x80;</code>
In this code, the s pointer iterates through the string, and the & 0xc0 operation masks off the first two bits of each byte. If the first two bits are 0b10 (indicating a continuation byte), the count is not incremented. Otherwise, it is incremented, and the pointer is advanced to the next byte. This process continues until the end of the string is reached, at which point len will contain the actual character length of the string.
The above is the detailed content of How to Determine the Actual Length of UTF-8 Encoded Strings in C ?. For more information, please follow other related articles on the PHP Chinese website!