Get Multibyte Character Count before Match with preg_match() (PREG_OFFSET_CAPTURE Parameter is Unhelpfully Counting Bytes)
In UTF-8 encoded strings, preg_match() may report incorrect character offsets within captured matches when using the PREG_OFFSET_CAPTURE parameter. The reason for this is that the captured offsets are counted in bytes, even when the subject string is interpreted as UTF-8 with the "u" modifier.
Solution:
To obtain the correct character offsets within UTF-8 captured matches, use mb_strlen to calculate the character count based on UTF-8 byte offsets:
$str = "\xC2\xA1Hola!"; preg_match('/H/u', $str, $a_matches, PREG_OFFSET_CAPTURE); echo mb_strlen(substr($str, 0, $a_matches[0][1]));
The above is the detailed content of How to Correctly Get Multibyte Character Count Before a `preg_match()`?. For more information, please follow other related articles on the PHP Chinese website!