Detailed explanation of the character encoding problem when lxml processes xml-XML/RSS Tutorial-php.cn

Home

Backend Development

XML/RSS Tutorial

Detailed explanation of the character encoding problem when lxml processes xml

黄舟

Mar 17, 2017 pm 04:53 PM

In order to simplify the problem, the content of xml is simplified into the following form:

<?xml version="1.0" encoding="gbk"?><DOCUMENT><da><![CDATA[中文，就是任性]]></da></DOCUMENT>

Copy after login

Its encoding is gbk, and one of the nodes is a Chinese character
Use lxml The following exception occurred when extracting the value of the node

lxml.etree.XMLSyntaxError: Extra content at the end of the document

Copy after login

The corresponding Python script at this time is:

tst = u'<?xml version="1.0" encoding="gbk"?><DOCUMENT><da><![CDATA[中文，就是任性]]></da></DOCUMENT>'
for event,element in etree.iterparse(BytesIO(tst.encode('utf-8'))):
    print("%s, %s" % (element.tag, element.text))

Copy after login

But before simplification, another exception was reported

lxml.etree.XMLSyntaxError: input conversion failed due to input error, bytes 0x8B 0x2C 0xE6 0x9D

Copy after login

No matter which exception it is, it is probably related to the encoding form of the characters.
After various attempts to no avail, I later saw this article on stackoverflow. The problem mentioned in the article is related to the encoding value in xml. I tried adding a piece of code

tst = u'<?xml version="1.0" encoding="gbk"?><DOCUMENT><da><![CDATA[中文，就是任性]]></da></DOCUMENT>'
tst = tst.replace('encoding="gbk"', 'encoding="utf-8"')
for event,element in etree.iterparse(BytesIO(tst.encode('utf-8'))):
    print("%s, %s" % (element.tag, element.text))

Copy after login

Added a replacement statement, replace the previous encoding="gbk" with encoding:"utf-8"
So we finally got the result:

da, 中文，就是任性
DOCUMENT, None

Copy after login

The above is the detailed content of Detailed explanation of the character encoding problem when lxml processes xml. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Assassin's Creed Shadows: Seashell Riddle Solution

3 weeks ago By DDD

What's New in Windows 11 KB5054979 & How to Fix Update Issues

2 weeks ago By DDD

Where to find the Crane Control Keycard in Atomfall

3 weeks ago By DDD

Saving in R.E.P.O. Explained (And Save Files)

1 months ago By 尊渡假赌尊渡假赌尊渡假赌

Assassin's Creed Shadows - How To Find The Blacksmith And Unlock Weapon And Armour Customisation

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7571

CakePHP Tutorial

1386

What is the format of the account name of steam

win11 activation key permanent

nyt connections hints and answers

110

Related knowledge

Can I open an XML file using PowerPoint? Feb 19, 2024 pm 09:06 PM

Can XML files be opened with PPT? XML, Extensible Markup Language (Extensible Markup Language), is a universal markup language that is widely used in data exchange and data storage. Compared with HTML, XML is more flexible and can define its own tags and data structures, making the storage and exchange of data more convenient and unified. PPT, or PowerPoint, is a software developed by Microsoft for creating presentations. It provides a comprehensive way of

Convert XML data to CSV format in Python Aug 11, 2023 pm 07:41 PM

Convert XML data in Python to CSV format XML (ExtensibleMarkupLanguage) is an extensible markup language commonly used for data storage and transmission. CSV (CommaSeparatedValues) is a comma-delimited text file format commonly used for data import and export. When processing data, sometimes it is necessary to convert XML data to CSV format for easy analysis and processing. Python is a powerful

How to solve the problem of garbled characters in tomcat logs? Dec 28, 2023 pm 01:50 PM

What are the methods to solve the problem of garbled tomcat logs? Tomcat is a popular open source JavaServlet container that is widely used to support the deployment and running of JavaWeb applications. However, sometimes garbled characters appear when using Tomcat to record logs, which causes a lot of trouble to developers. This article will introduce several methods to solve the problem of garbled Tomcat logs. Adjust Tomcat's character encoding settings. Tomcat uses ISO-8859-1 character encoding by default.

Handling errors and exceptions in XML using Python Aug 08, 2023 pm 12:25 PM

Handling Errors and Exceptions in XML Using Python XML is a commonly used data format used to store and represent structured data. When we use Python to process XML, sometimes we may encounter some errors and exceptions. In this article, I will introduce how to use Python to handle errors and exceptions in XML, and provide some sample code for reference. Use try-except statement to catch XML parsing errors When we use Python to parse XML, sometimes we may encounter some

Python parsing special characters and escape sequences in XML Aug 08, 2023 pm 12:46 PM

Python parses special characters and escape sequences in XML XML (eXtensibleMarkupLanguage) is a commonly used data exchange format used to transfer and store data between different systems. When processing XML files, you often encounter situations that contain special characters and escape sequences, which may cause parsing errors or misinterpretation of the data. Therefore, when parsing XML files using Python, we need to understand how to handle these special characters and escape sequences. 1. Special characters and

How to handle XML and JSON data formats in C# development Oct 09, 2023 pm 06:15 PM

How to handle XML and JSON data formats in C# development requires specific code examples. In modern software development, XML and JSON are two widely used data formats. XML (Extensible Markup Language) is a markup language used to store and transmit data, while JSON (JavaScript Object Notation) is a lightweight data exchange format. In C# development, we often need to process and operate XML and JSON data. This article will focus on how to use C# to process these two data formats, and attach

Using Python to implement data verification in XML Aug 10, 2023 pm 01:37 PM

Using Python to implement data validation in XML Introduction: In real life, we often deal with a variety of data, among which XML (Extensible Markup Language) is a commonly used data format. XML has good readability and scalability, and is widely used in various fields, such as data exchange, configuration files, etc. When processing XML data, we often need to verify the data to ensure the integrity and correctness of the data. This article will introduce how to use Python to implement data verification in XML and give the corresponding

Convert POJO to XML using Jackson library in Java? Sep 18, 2023 pm 02:21 PM

Jackson is a Java-based library that is useful for converting Java objects to JSON and JSON to Java objects. JacksonAPI is faster than other APIs, requires less memory area, and is suitable for large objects. We use the writeValueAsString() method of the XmlMapper class to convert the POJO to XML format, and the corresponding POJO instance needs to be passed as a parameter to this method. Syntax publicStringwriteValueAsString(Objectvalue)throwsJsonProcessingExceptionExampleimp

See all articles