Home Backend Development Python Tutorial How to detect rare words with Python

How to detect rare words with Python

Mar 11, 2017 am 10:53 AM
python Uncommon words

I recently encountered a requirement at work, which requires detecting whether a field contains rare words and some illegal characters such as ~!@#$%^&*. I solved it by searching for information on the Internet. Now I will share the solution process and sample code with everyone. Those who need it can refer to it. Let’s take a look below.

Solution idea

The first thing that comes to mind is to use python’s regular expressions to match illegal characters and then find illegal records. However, ideals are always full, but reality is cruel. During the implementation process, I discovered that I lacked knowledge about character encoding and Python's internal string representation. During this period, I went through a lot of pitfalls, and although there were still some ambiguities in the end, I finally had an overall clear understanding. Record your experience here to avoid falling in the same place in the future.

The following test environment is the python 2.7.8 environment that comes with ArcGIS 10.3. There is no guarantee that other python environments will also be applicable.

Python regular expression

The regular function in python is provided by the built-in re function library, which mainly uses 3 functions. re.compile() Provides reusable regular expressions, match() and search() functions return matching results. The difference between the two is : match() Start matching from the specified position, search() will search backward from the specified position until a matching string is found. For example, in the following code, match_result starts matching from the first character f, and returns a null value if the match fails; search_result searches backward from f until the first matching character is found. a, and then use the group() function to output the matching result as the character a.

import re

pattern = re.compile('[abc]')
match_result = pattern.match('fabc')
if match_result:
 print match_result.group()

search_result = pattern.search('fabc')
if search_result:
 print search_result.group()
Copy after login

The above implementation requires compiling a pattern first and then matching. In fact, we can directly use the re.match(pattern, string) function to achieve the same function. However, the direct matching method is not as flexible as compiling first and then matching. First of all, regular expressions cannot be reused. If a large amount of data is matched with the same pattern, it means that internal compilation is required every time, causing performance losses; in addition, re .match() The function is not as powerful as pattern.match() . The latter can specify the position from which to start matching.

Encoding issues

After understanding the basic functions of python regular expressions, the only thing left is to find a suitable regular expression to match rare words and illegal characters. Illegal characters are very simple. You can match them by using the following pattern:

pattern = re.compile(r'[~!@#$%^&* ]')
Copy after login

However, the matching of rare characters really stumps me. The first is the definition of rare words. What kind of words are considered rare? After consultation with the project manager, it was determined that non-GB2312 characters are rare characters. The next question is, how to match GB2312 characters?

After query, the range of GB2312 is [\xA1-\xF7][\xA1-\xFE] , among which the range of Chinese character area is [\xB0-\xF7] [\xA1-\xFE] . Therefore, the expression after adding rare word matching is:

pattern = re.compile(r'[~!@#$%^&* ]|[^\xA1-\xF7][^\xA1-\xFE]')
Copy after login

The problem seems to be solved logically, but I am still too simple and too naive. Since the strings to be judged are all read from layer files, arcpy thoughtfully encodes the read characters into unicode format. Therefore, I need to find out the encoding range of GB2312 character set in unicode. But the reality is that the distribution of the GB2312 character set in unicode is not continuous, and using regular expressions to represent this range must be very complicated. The idea of ​​using regular expressions to match rare words seems to have hit a dead end.

Solution

Since the provided string is in unicode format, can I convert it to GB2312 and then match it? In fact, it is not possible, because the unicode character set is much larger than the GB2312 character set, so GB2312 => unicode is always achievable, and conversely unicode => GB2312 is not necessarily possible can succeed.

This suddenly provided me with another idea. Assuming that the unicode => GB2312 conversion of a string will fail, does that mean that it does not belong to the GB2312 character set? So, I used the unicode_string.encode('GB2312') function to try to convert the string, catching the UnicodeEncodeError exception to identify rare characters.

The final code is as follows:

import re

def is_rare_name(string):
 pattern = re.compile(u"[~!@#$%^&* ]")
 match = pattern.search(string)
 if match:
 return True

 try:
    string.encode("gb2312")
  except UnicodeEncodeError:
   return True

  return False
Copy after login

Summary

The above is the detailed content of How to detect rare words with Python. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial
1664
14
PHP Tutorial
1266
29
C# Tutorial
1239
24
PHP and Python: Different Paradigms Explained PHP and Python: Different Paradigms Explained Apr 18, 2025 am 12:26 AM

PHP is mainly procedural programming, but also supports object-oriented programming (OOP); Python supports a variety of paradigms, including OOP, functional and procedural programming. PHP is suitable for web development, and Python is suitable for a variety of applications such as data analysis and machine learning.

Choosing Between PHP and Python: A Guide Choosing Between PHP and Python: A Guide Apr 18, 2025 am 12:24 AM

PHP is suitable for web development and rapid prototyping, and Python is suitable for data science and machine learning. 1.PHP is used for dynamic web development, with simple syntax and suitable for rapid development. 2. Python has concise syntax, is suitable for multiple fields, and has a strong library ecosystem.

PHP and Python: A Deep Dive into Their History PHP and Python: A Deep Dive into Their History Apr 18, 2025 am 12:25 AM

PHP originated in 1994 and was developed by RasmusLerdorf. It was originally used to track website visitors and gradually evolved into a server-side scripting language and was widely used in web development. Python was developed by Guidovan Rossum in the late 1980s and was first released in 1991. It emphasizes code readability and simplicity, and is suitable for scientific computing, data analysis and other fields.

Python vs. JavaScript: The Learning Curve and Ease of Use Python vs. JavaScript: The Learning Curve and Ease of Use Apr 16, 2025 am 12:12 AM

Python is more suitable for beginners, with a smooth learning curve and concise syntax; JavaScript is suitable for front-end development, with a steep learning curve and flexible syntax. 1. Python syntax is intuitive and suitable for data science and back-end development. 2. JavaScript is flexible and widely used in front-end and server-side programming.

How to run sublime code python How to run sublime code python Apr 16, 2025 am 08:48 AM

To run Python code in Sublime Text, you need to install the Python plug-in first, then create a .py file and write the code, and finally press Ctrl B to run the code, and the output will be displayed in the console.

Where to write code in vscode Where to write code in vscode Apr 15, 2025 pm 09:54 PM

Writing code in Visual Studio Code (VSCode) is simple and easy to use. Just install VSCode, create a project, select a language, create a file, write code, save and run it. The advantages of VSCode include cross-platform, free and open source, powerful features, rich extensions, and lightweight and fast.

Golang vs. Python: Performance and Scalability Golang vs. Python: Performance and Scalability Apr 19, 2025 am 12:18 AM

Golang is better than Python in terms of performance and scalability. 1) Golang's compilation-type characteristics and efficient concurrency model make it perform well in high concurrency scenarios. 2) Python, as an interpreted language, executes slowly, but can optimize performance through tools such as Cython.

How to run python with notepad How to run python with notepad Apr 16, 2025 pm 07:33 PM

Running Python code in Notepad requires the Python executable and NppExec plug-in to be installed. After installing Python and adding PATH to it, configure the command "python" and the parameter "{CURRENT_DIRECTORY}{FILE_NAME}" in the NppExec plug-in to run Python code in Notepad through the shortcut key "F6".

See all articles