CSV files, commonly used for data exchange, often contain accented characters that require UTF8 encoding to preserve their integrity. The Python csvreader, however, supports only ASCII data.
When attempting to read a UTF8 CSV file with accented French or Spanish characters, despite using code to handle UTF8 encoding, the following exception was encountered:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 68: ordinal not in range(128)
The solution lies in understanding the purpose of the encode method. It converts Unicode strings into byte strings, not vice versa. By correctly utilizing the codecs module and specifically codecs.open for handling UTF8 text files, the code can be simplified:
<code class="python">import csv def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs): csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs) for row in csv_reader: yield [unicode(cell, 'utf-8') for cell in row] filename = 'da.csv' reader = unicode_csv_reader(open(filename)) for field1, field2, field3 in reader: print field1, field2, field3 </code>
If the input data is not in UTF8, such as ISO-8859-1, the code requires transcoding:
<code class="python">line.decode('whateverweirdcodec').encode('utf-8')</code>
However, this is often unnecessary as csv can directly handle ISO-8859-* encoded byte strings.
The above is the detailed content of How to Handle UTF8 Encoding in Python When Reading CSV Files?. For more information, please follow other related articles on the PHP Chinese website!