Home > Backend Development > Python Tutorial > How to use python to batch modify the encoding format of text files

How to use python to batch modify the encoding format of text files

WBOY
Release: 2023-05-01 19:13:11
forward
2573 people have browsed it

Use python to batch modify the encoding format of text files

Convert the encoding format of text files in batches, such as ascii, gb2312, utf8, etc., and convert each other. Judging from the size of the character set, utf8>gb2312>ascii , so it is best to convert gb2312 to utf8, otherwise garbled characters will easily appear.

The main difference between gb2312 and utf-8:

About the font size: UTF-8 > gb2312 (utf8 has all characters and gb2312 only has Chinese characters)

About saving size: UTF-8> gb2312 (utf8 is more bloated and loads slower, gb2312 is smaller and loads faster)

About scope of application: gb2312 is mainly used in mainland China. It is a localized character set. UTF-8 contains characters that are needed by all countries in the world. It is an international encoding and has strong versatility. UTF-8 encoded text can be displayed on browsers in various countries that support the UTF8 character set.

import sys
import chardet
import codecs
 
def get_encoding_type(fileName):
    '''print the encoding format of a txt file '''
    with open(fileName, 'rb') as f:
        data = f.read()
        encoding_type = chardet.detect(data)
        #print(encoding_type)
        return encoding_type
        # such as {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
 
def convert_encoding_type(filename_in, filename_out, encode_in="gb2312", encode_out="utf-8"):
    '''convert encoding format of txt file '''
    #filename_in = 'flash.c'
    #filename_out = 'flash_gb2312.c'
    #encode_in = 'utf-8'  # 输入文件的编码类型
    #encode_out = 'gb2312'# 输出文件的编码类型
    with codecs.open(filename=filename_in, mode='r', encoding=encode_in) as fi:
        data = fi.read()
        with open(filename_out, mode='w', encoding=encode_out) as fo:
            fo.write(data)
            fo.close()
        # with open(filename_out, 'rb') as f:
        #     data = f.read()
        #     print(chardet.detect(data))
 
if __name__=="__main__":
    # fileName = argv[1]
    # get_encoding_type(fileName)
    # convert_encoding_type(fileName, fileName)
    filename_of_files = sys.argv[1]   #the file contain full file path at each line
    with open(filename_of_files, 'rb') as f:
        lines = f.readlines()
        for line in lines:
            fileName = line[:-1]
            encoding_type = get_encoding_type(fileName)
            if encoding_type['encoding']=='GB2312':
                print(encoding_type)
                convert_encoding_type(fileName, fileName)
                print(fileName)
Copy after login

Supplement: python implements batch conversion of files to utf-8 format

python implements batch conversion of files to utf-8 format

xml_path = './'
with open(xml_path , 'rb+') as f:
    content = f.read()
    codeType = detect(content)['encoding']
    content = content.decode(codeType, "ignore").encode("utf8")
    fp.seek(0)
    fp.write(content)
Copy after login

The above is the detailed content of How to use python to batch modify the encoding format of text files. For more information, please follow other related articles on the PHP Chinese website!

Related labels:
source:yisu.com
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template