Python text statistics function: Journey to the West uses word statistics operations-Python Tutorial-php.cn

Python text statistics function: Journey to the West uses word statistics operations

不言

Release： 2018-05-07 13:53:27

Original

3626 people have browsed it

This article mainly introduces the word statistics operation of Journey to the West with the text statistics function of Python, and analyzes the related operation skills of Python text reading, traversal, statistics and so on in the form of examples. Friends who need it can refer to it

The example of this article describes the word statistics operation of Journey to the West of Python text statistics function. Share it with everyone for your reference, the details are as follows:

1. Data

##xyj.txt, the text of "Journey to the West", 2.2MB

Tribute to Master Wu Chengen, 4020 lines (paragraphs)

2. Goal

Statistics in "Journey to the West":

1. How many different Chinese characters appear in total;

2. How many times each Chinese character appears;
3. What are the Chinese characters that appear most frequently.

3. Contents involved:

1. Reading files;

2. Using dictionaries;
3. Dictionary usage Sort;
4. Write file

4. Effect

# #5. Source code

# coding:utf8
import sys
reload(sys)
sys.setdefaultencoding("utf8")
fr = open(&#39;xyj.txt&#39;, &#39;r&#39;)
characters = []
stat = {}
for line in fr:
  # 去掉每一行两边的空白
  line = line.strip()
  # 如果为空行则跳过该轮循环
  if len(line) == 0:
    continue
  # 将文本转为unicode，便于处理汉字
  line = unicode(line)
  # 遍历该行的每一个字
  for x in xrange(0, len(line)):
    # 去掉标点符号和空白符
    if line[x] in [&#39; &#39;,&#39;&#39;, &#39;\t&#39;, &#39;\n&#39;, &#39;。&#39;, &#39;，&#39;, &#39;(&#39;, &#39;)&#39;, &#39;（&#39;, &#39;）&#39;, &#39;：&#39;, &#39;□&#39;, &#39;？&#39;, &#39;！&#39;, &#39;《&#39;, &#39;》&#39;, &#39;、&#39;, &#39;；&#39;, &#39;“&#39;, &#39;”&#39;, &#39;……&#39;]:
      continue
    # 尚未记录在characters中
    if not line[x] in characters:
      characters.append(line[x])
    # 尚未记录在stat中
    if not stat.has_key(line[x]):
      stat[line[x]] = 0
    # 汉字出现次数加1
    stat[line[x]] += 1
print len(characters)
print len(stat)
# lambda生成一个临时函数
# d表示字典的每一对键值对，d[0]为key，d[1]为value
# reverse为True表示降序排序
stat = sorted(stat.items(), key=lambda d:d[1], reverse=True)
fw = open(&#39;result.csv&#39;, &#39;w&#39;)
for item in stat:
  # 进行字符串拼接之前，需要将int转为str
  fw.write(item[0] + &#39;,&#39; + str(item[1]) + &#39;\n&#39;)
fr.close()
fw.close()

Copy after login

Related recommendations:

Python text feature extraction and vectorization algorithm Detailed explanation of learning examples

Detailed explanation of edit distance for Python text similarity calculation

The above is the detailed content of Python text statistics function: Journey to the West uses word statistics operations. For more information, please follow other related articles on the PHP Chinese website!