如何用 Python 中的 NLTK 对中文进行分析和处理?
用nltk对自己的日记进行分析。得到以下结果(节选)
'\xb8\xb0', '\xe5\xbc\xba\xe8\xba', '\xe5\xbd\xbc\xe5', '\xb8\xb4', '\xb8\x8a', '\xb8\x8b', '\xb8\x88', '\xb8\x89', '\xb8\x8e', '\xb8\x8f', '\xb8\x8d', '\xb8\x82', '\xb8\x83', '\xb8\x80', '\xb8\x81', '\xb8\x87', 'tend', '\xb8\x9a',
请问对于中文的自然语言分析,有哪些方法和工具可以推荐?
回复内容:
最近正在用nltk 对中文网络商品评论进行褒贬情感分类,计算评论的信息熵(entropy)、互信息(point mutual information)和困惑值(perplexity)等(不过这些概念我其实也还理解不深...只是nltk 提供了相应方法)。
我感觉用nltk 处理中文是完全可用的。其重点在于中文分词和文本表达的形式。
中文和英文主要的不同之处是中文需要分词。因为nltk 的处理粒度一般是词,所以必须要先对文本进行分词然后再用nltk 来处理(不需要用nltk 来做分词,直接用分词包就可以了。严重推荐结巴分词,非常好用)。
中文分词之后,文本就是一个由每个词组成的长数组:[word1, word2, word3…… wordn]。之后就可以使用nltk 里面的各种方法来处理这个文本了。比如用FreqDist 统计文本词频,用bigrams 把文本变成双词组的形式:[(word1, word2), (word2, word3), (word3, word4)……(wordn-1, wordn)]。
再之后就可以用这些来计算文本词语的信息熵、互信息等。
再之后可以用这些来选择机器学习的特征,构建分类器,对文本进行分类(商品评论是由多个独立评论组成的多维数组,网上有很多情感分类的实现例子用的就是nltk 中的商品评论语料库,不过是英文的。但整个思想是可以一致的)。
另外还有一个困扰很多人的Python 中文编码问题。多次失败后我总结出一些经验。
Python 解决中文编码问题基本可以用以下逻辑:
utf8(输入) ——> unicode(处理) ——> (输出)utf8
Python 里面处理的字符都是都是unicode 编码,因此解决编码问题的方法是把输入的文本(无论是什么编码)解码为(decode)unicode编码,然后输出时再编码(encode)成所需编码。
由于处理的一般为txt 文档,所以最简单的方法,是把txt 文档另存为utf-8 编码,然后使用Python 处理的时候解码为unicode(sometexts.decode('utf8')),输出结果回txt 的时候再编码成utf8(直接用str() 函数就可以了)。
楼主遇到的只是编码的问题…
有很多好用的中文处理包:
Jieba:可以用来做分词,词性标注,TextRank
HanLP:分词,命名实体识别,依存句法分析,还有FudanNLP,NLPIR
个人觉得都比NLTK好用~
中文分词用结巴就好了,我做了个小例子 nltk-比较中文文档相似度你说这个跟NLTK无关,换Python3,就没有这些鬼了!中文还得UTF8!
大爱NLTK!其它包,除了固定任务的,java就算了,
使用:text.decode('gbk')
分词:你找相应的中文分词包 https://github.com/fxsjy/jieba因为nltk不能对中文进行分词的原因吧,最近也在学习这方面的东西,推荐一个工具中文處理工具,可以研究一下
我遇到同样的问题,在看《Python自然语言处理》一书,成功加载自己的文档后,却看到里面的中文如你所示,应该是编码设置的问题,但是不知道该设置哪里。这方面的资料太少了

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics
![GeForce Now error code 0x0000012E [FIXED]](https://img.php.cn/upload/article/000/000/164/170834836989999.jpg?x-oss-process=image/resize,m_fill,h_207,w_330)
If you encounter error code 0x0000012E on NVIDIA GeForceNOW, we'll share the solution. We've encountered the same issue, and here's how we fixed it so you can enjoy gaming on GeForce smoothly. Fix GeForce Now Error Code 0x0000012E Now To fix GeForceNow error code 0x0000012E on Windows computer, follow these solutions: Check internet connection requirements Verify hardware requirements Run as administrator Additional suggestions. Before starting, we recommend that you wait patiently for a while, as many users did not take action to resolve the issue. Sometimes, a malfunction may cause this

What is 0x0000004e failure? Failure is a common problem in computer systems. When a computer encounters a fault, the system usually shuts down, crashes, or displays error messages because it cannot run properly. In Windows systems, there is a specific fault code 0x0000004e, which is a blue screen error code indicating that the system has encountered a serious error. The 0x0000004e blue screen error is caused by system kernel or driver issues. This error usually causes the computer system to

The blue screen problem is when the operating system is unable to recover from the error and forces a screen image to appear in order to protect the computer data files. And provide the user with the error code for easier investigation. What is the reason for the win7 blue screen code 0x万万3b? How to fix it? On this issue, the following small series introduces the repair method of the blue screen code 0x million million 3b in win7 ultimate version. See details below. What to do with the blue screen code 0x000003b in Win7 Ultimate Edition? 1. When win7 system stops error prompting with 0x000003B, you usually need to manually obtain a patch to fix the problem. Judging from the blue screen code alone, the code 0x000003B is caused by an unexpected network error. 2. Therefore, it is necessary to clear the system browser cache in time

Netizens who often play computer games sometimes encounter computer blue screens, but they don’t know what the cause is because the circumstances are different and the reasons are unclear. So what is the blue screen code 0xWanwan8e? How to solve it? In order to better use these users, Xiaobian will tell you the reasons and solutions for the blue screen code 0x1008e. I believe that many netizens are confused by the wrong code when their computer fails. In order to quickly solve the confusion and problems caused by the code and eliminate faults, what does the blue screen code 0x1008e mean? Let us tell you how to solve the problem of blue screen code 0x万万8e. Picture and text to solve the blue screen code 0x00008e. Found the blue screen code at the fault point (Figure 1) Code: 0x000008e Reason: The system is currently unable to execute JOI

Many friends have encountered the 0x0000007b error code when using computers, so how to solve this kind of situation? We can solve this problem by changing the hard disk mode or resetting the settings. Let’s take a look at the solutions below. What to do if 0x0000007b appears: Method 1: 1. It is likely that the wrong graphics card driver has been updated. Restart the computer after the error code appears. 2. Then press "F8" to enter the safe mode selection interface, select "Last Known Good Configuration" to enter the system. 3. This can solve the application conflict problem. If it cannot be solved, you can try to change the hard disk startup mode. Method 2: 1. First restart the computer, then use the hotkey to enter b

I believe that many users will often use photocopiers when using computers. However, recently, many consumers have experienced 0x0000011b prompts when using their company's win7 computers to connect to shared printers. How to solve this kind of problem? The editor will take everyone to take a look below! Original design by System Home www.xitongzhijia.net, the source must be indicated when reprinting. Method 1: Already know the patch number, use a third-party tool to uninstall the patch. This tool is a patch uninstallation tool officially produced by Lenovo. It is suitable for any computer. Users only need to enter the data after updating the KB of the patch. Just uninstall it. Download link: http://www.xito

0x0000004e Blue Screen Solution When we use the computer, we may occasionally encounter a blue screen. A blue screen means that the Windows operating system encounters an unrecoverable error, causing the computer to be unable to operate normally and displaying an error message in the form of a blue screen. One of the common blue screen error codes is 0x0000004e. This article will introduce some methods to solve this problem. First, we need to understand the cause of the 0x0000004e blue screen. This error is usually caused by a problem with the system driver.
![Xbox.com/ErrorHelp 0x87e50007 [FIXED]](https://img.php.cn/upload/article/000/887/227/170831504115537.jpg?x-oss-process=image/resize,m_fill,h_207,w_330)
This article will guide you to resolve Xbox error code 0x87e50007, which can be caused by server issues, network instability, or missing updates. Before you proceed with more troubleshooting, make sure to check the XboxLive service status. If there is an Xbox service issue, it may be caused by a server failure. In this case, please wait patiently for the issue to be resolved and check the Xbox status page regularly for the latest information. How to Fix Xbox Error Code 0x87e50007 Use the following fixes to resolve Xbox.com/ErrorHelp0x87e50007: Check your internet connection Update games Uninstall games or apps Reset your console Let’s get started. 1] Check your internet
