Home Backend Development Python Tutorial python用于url解码和中文解析的小脚本(python url decoder)

python用于url解码和中文解析的小脚本(python url decoder)

Jun 16, 2016 am 08:46 AM
url

复制代码 代码如下:

# -*- coding: utf8 -*-
#! python
print(repr("测试报警,xxxx是大猪头".decode("UTF8").encode("GBK")).replace("\\x","%"))


注意第一个 decode("UTF8") 要与文件声明的编码一样。

最开始对这个问题的接触,来自于一个Javascript解谜闯关的小游戏,某一关的提示如下:

刚开始的几关都是很简单很简单的哦~~这一关只是简单的字符串变形而已…..

后面是一大长串开头是%5Cu4e0b%5Cu4e00%5Cu5173%5Cu7684这样的字符串。
这种东西以前经常在浏览器的地址栏见到,就是一直不知道怎么转换成能看懂的东东,
网上google了一下,结合python的url解码和unicode解码,解决方式如下:

复制代码 代码如下:

import urllib escaped_str="%5Cu4e0b%5Cu4e00%5Cu5173%5Cu7684%5Cu9875%5Cu9762%5Cu540d%5Cu5b57%5Cu662f%5Cx20%5Cx69%5Cx32%5Cx6a%5Cx62%5Cx6a%5Cx33%5Cx69%5Cx34%5Cx62%5Cx62%5Cx35%5Cx34%5Cx62%5Cx35%5Cx32%5Cx69%5Cx62%5Cx33%5Cx2e%5Cx68%5Cx74%5Cx6d"
print urllib.unquote(escaped_str).decode('unicode-escape')

最近,我对firefox的autoproxy插件中的gfwlist中的中文词汇(用过代理的同学们,你们懂的)产生了兴趣,然而这些网址都是用url编码的,比如http://zh.wikipedia.org/wiki/%E9%97%A8,需要使用正则表达式将被url编码的中文字符提取出来,写了个小脚本如下:

复制代码 代码如下:

import urllib
import re
with open("listfile","r") as f:
    for url_str in f:
        match=re.compile("((%\w{2}){3,})").findall(url_str)
        #汉字url编码的样式是:百分号+2个十六进制数,重复3次

        if match!=None:
            #如果匹配成功,则将提取出的部分转换为中文
            for trans in match:
                print urllib.unquote(trans[0]),

然而这个脚本仍有一些缺点,对于列表文件中的某些中文字符仍然不能正常解码,比如下面这几行测试代码

复制代码 代码如下:

import urllib
a="http://zh.wikipedia.org/wiki/%BD%F0%B6"
b="http://zh.wikipedia.org/wiki/%E9%97%A8"
de=urllib.unquote
print de(a),de(b)

输出结果就是前者可以正确解码,而后者不可以,个人觉得原因可能和big5编码有关,如果谁知道什么解决办法,还请告诉我一下~

以下是补充:

de(a).decode(“gbk”,”ignore”)
de(b).decode(“utf8″,”ignore”)

這樣你可以得到這些字串的unicode編碼。

你用的unquote不是decoder, 你需要作必要的decode和encode。我一直用utf8作我默認環境的,我覺得你大概用的gbk吧,所以後者的解碼你那邊失敗了。猜編碼是很累的事情,如果大家都用utf8倒也好,但是有些人習慣了gb。

http://yac163.svn.sourceforge.net/viewvc/yac163/trunk/yac163-nox/Pic.py?revision=198&view=markup

參考我這個很古老code裡面的#102-147行 給每個decode和encode調用加上(…,”ignore”)。

复制代码 代码如下:

def strdecode( string,charset=None ):
     if isinstance(string,unicode):
         return string
     if charset:
         try:
             return string.decode(charset)
         except UnicodeDecodeError:
             return _strdecode(string)
     else:
         return _strdecode(string)

 def _strdecode(string):
     try:

         return string.decode('utf8')
     except UnicodeDecodeError:
         try:
             return string.decode('gb2312')
         except UnicodeDecodeError:
             try:

                 return string.decode('gbk')
             except UnicodeDecodeError:
                 return string.decode('gb18030')

 def strencode( string,charset=None ):
     if isinstance(string,str):
         return string
     if charset:
         try:
             return string.encode(charset)
         except UnicodeEncodeError:
             return _strencode(string)
     else:
         return _strencode(string)
 def _strencode(string):

     try:
         return string.encode('utf8')
     except UnicodeEncodeError:
         try:
             return string.encode('gb2312')
         except UnicodeEncodeError:
             try:
                 return string.encode('gbk')
             except UnicodeEncodeError:
                 return string.encode('gb18030')

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

AI Hentai Generator

AI Hentai Generator

Generate AI Hentai for free.

Hot Article

R.E.P.O. Energy Crystals Explained and What They Do (Yellow Crystal)
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Best Graphic Settings
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. How to Fix Audio if You Can't Hear Anyone
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌
R.E.P.O. Chat Commands and How to Use Them
4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

PHP function introduction—get_headers(): Get the response header information of the URL PHP function introduction—get_headers(): Get the response header information of the URL Jul 25, 2023 am 09:05 AM

PHP function introduction—get_headers(): Overview of obtaining the response header information of the URL: In PHP development, we often need to obtain the response header information of the web page or remote resource. The PHP function get_headers() can easily obtain the response header information of the target URL and return it in the form of an array. This article will introduce the usage of get_headers() function and provide some related code examples. Usage of get_headers() function: get_header

Why NameResolutionError(self.host, self, e) from e and how to solve it Why NameResolutionError(self.host, self, e) from e and how to solve it Mar 01, 2024 pm 01:20 PM

The reason for the error is NameResolutionError(self.host,self,e)frome, which is an exception type in the urllib3 library. The reason for this error is that DNS resolution failed, that is, the host name or IP address attempted to be resolved cannot be found. This may be caused by the entered URL address being incorrect or the DNS server being temporarily unavailable. How to solve this error There may be several ways to solve this error: Check whether the entered URL address is correct and make sure it is accessible Make sure the DNS server is available, you can try using the "ping" command on the command line to test whether the DNS server is available Try accessing the website using the IP address instead of the hostname if behind a proxy

How to get your Steam ID in a few steps? How to get your Steam ID in a few steps? May 08, 2023 pm 11:43 PM

Nowadays, many Windows users who love games have entered the Steam client and can search, download and play any good games. However, many users' profiles may have the exact same name, making it difficult to find a profile or even link a Steam profile to other third-party accounts or join Steam forums to share content. The profile is assigned a unique 17-digit id, which remains the same and cannot be changed by the user at any time, whereas the username or custom URL can. Regardless, some users don't know their Steamid, and it's important to know this. If you don't know how to find your account's Steamid, don't panic. In this article

What is the difference between html and url What is the difference between html and url Mar 06, 2024 pm 03:06 PM

Differences: 1. Different definitions, url is a uniform resource locator, and html is a hypertext markup language; 2. There can be many urls in an html, but only one html page can exist in a url; 3. html refers to is a web page, and url refers to the website address.

How to use URL encoding and decoding in Java How to use URL encoding and decoding in Java May 08, 2023 pm 05:46 PM

Use url to encode and decode the class java.net.URLDecoder.decode(url, decoding format) decoder.decoding method for encoding and decoding. Convert into an ordinary string, URLEncoder.decode(url, encoding format) turns the ordinary string into a string in the specified format packagecom.zixue.springbootmybatis.test;importjava.io.UnsupportedEncodingException;importjava.net.URLDecoder;importjava.net. URLEncoder

Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Scrapy optimization tips: How to reduce crawling of duplicate URLs and improve efficiency Jun 22, 2023 pm 01:57 PM

Scrapy is a powerful Python crawler framework that can be used to obtain large amounts of data from the Internet. However, when developing Scrapy, we often encounter the problem of crawling duplicate URLs, which wastes a lot of time and resources and affects efficiency. This article will introduce some Scrapy optimization techniques to reduce the crawling of duplicate URLs and improve the efficiency of Scrapy crawlers. 1. Use the start_urls and allowed_domains attributes in the Scrapy crawler to

How to add URL prefix to SpringBoot multiple controllers How to add URL prefix to SpringBoot multiple controllers May 12, 2023 pm 06:37 PM

Preface In some cases, the prefixes in the service controller are consistent. For example, the prefix of all URLs is /context-path/api/v1, and a unified prefix needs to be added to some URLs. The conceivable solution is to modify the context-path of the service and add api/v1 to the context-path. Modifying the global prefix can solve the above problem, but there are disadvantages. If the URL has multiple prefixes, for example, some URLs require prefixes. If it is api/v2, it cannot be distinguished. If you do not want to add api/v1 to some static resources in the service, it cannot be distinguished. The following uses custom annotations to uniformly add certain URL prefixes. one,

what does url mean what does url mean Aug 04, 2023 am 11:43 AM

URL is the abbreviation of "Uniform Resource Locator", which means "Uniform Resource Locator" in Chinese. A URL is an address used to locate and access specific resources through the Internet. It is commonly seen in web browsing and HTTP requests. The main function of URL is to locate and access resources on the Internet. These resources can be web pages, pictures, videos, documents or other files.

See all articles