This article mainly introduces Python's implementation of the interception function for Chinese strings, and analyzes the relevant implementation skills of Python's Chinese string interception operations for utf-8 and gb18030 encoding based on specific examples. Friends in need can refer to the following
The example in this article describes how Python implements the interception function for Chinese strings. Share it with everyone for your reference, the details are as follows:
For strings containing multi-bytes, when truncation, you must determine how many bytes of characters are at the truncation point, and multi-byte characters cannot be divided from them to avoid truncation. After garbled code
The implementation on utf8 and gb18030 is given below. You can use either one. You can transcode first, use encode, decode;
Method 1: Convert utf8 :
def subString(string,length): if length >= len(string): return string result = '' i = 0 p = 0 while True: ch = ord(string[i]) #1111110x if ch >= 252: p = p + 6 #111110xx elif ch >= 248: p = p + 5 #11110xxx elif ch >= 240: p = p + 4 #1110xxxx elif ch >= 224: p = p + 3 #110xxxxx elif ch >= 192: p = p + 2 else: p = p + 1 if p >= length: break; else: i = p return string[0:i]
Method 2: Encoding gb18030
##
def cut_string_off(string,s_len): if len(string)==0 or s_len <=0: return string elif len(string)==1 or s_len >= len(string): return string elif s_len < len(string): len_num=0 while len_num < s_len: tmp_c=ord(string[len_num]) if tmp_c >0 and tmp_c <=0x7F: len_num+=1 continue tmp_nextc=ord(string[len_num+1]) if tmp_c >= 0x81 and tmp_c <=0xFE and tmp_nextc>=0x40 and tmp_nextc<=0xFE: len_num+=2 continue else: len_num +=1; continue break tmp = string[0:len_num] # print utf2gbk(tmp) return tmp
The above is the detailed content of Python implementation example for interception function containing Chinese strings. For more information, please follow other related articles on the PHP Chinese website!