Python 텍스트 통계 기능: Journey to the West는 단어 통계 작업을 사용합니다.-파이썬 튜토리얼-php.cn

Python 텍스트 통계 기능: Journey to the West는 단어 통계 작업을 사용합니다.

不言

풀어 주다： 2018-05-07 13:53:27

원래의

3566명이 탐색했습니다.

이 글은 Python의 텍스트 통계 기능인 Journey to the West의 단어 통계 연산을 주로 소개하고, Python 텍스트 읽기, 순회, 통계 및 기타 관련 연산 기술을 예제 형식으로 분석하여 도움이 필요한 친구들이 참고할 수 있습니다

이 기사의 예는 Python을 설명합니다. Journey to the West의 텍스트 통계 기능은 단어 통계 연산을 사용합니다. 참고를 위해 모든 사람과 공유하세요. 세부 사항은 다음과 같습니다.

1. 데이터

xyj.txt, "Journey to the West" 텍스트, 2.2MB

Master Wu Chengen에 대한 찬사, 4020줄 (단락)

2, 목표

"서유기" 통계:

1. 한자가 총 몇 번 나오는지; 가장 많이 등장하는 한자.

3. 관련 내용: 1. 파일 읽기

3. 파일 정렬

5. 소스 코드

# coding:utf8
import sys
reload(sys)
sys.setdefaultencoding("utf8")
fr = open(&#39;xyj.txt&#39;, &#39;r&#39;)
characters = []
stat = {}
for line in fr:
  # 去掉每一行两边的空白
  line = line.strip()
  # 如果为空行则跳过该轮循环
  if len(line) == 0:
    continue
  # 将文本转为unicode，便于处理汉字
  line = unicode(line)
  # 遍历该行的每一个字
  for x in xrange(0, len(line)):
    # 去掉标点符号和空白符
    if line[x] in [&#39; &#39;,&#39;&#39;, &#39;\t&#39;, &#39;\n&#39;, &#39;。&#39;, &#39;，&#39;, &#39;(&#39;, &#39;)&#39;, &#39;（&#39;, &#39;）&#39;, &#39;：&#39;, &#39;□&#39;, &#39;？&#39;, &#39;！&#39;, &#39;《&#39;, &#39;》&#39;, &#39;、&#39;, &#39;；&#39;, &#39;“&#39;, &#39;”&#39;, &#39;……&#39;]:
      continue
    # 尚未记录在characters中
    if not line[x] in characters:
      characters.append(line[x])
    # 尚未记录在stat中
    if not stat.has_key(line[x]):
      stat[line[x]] = 0
    # 汉字出现次数加1
    stat[line[x]] += 1
print len(characters)
print len(stat)
# lambda生成一个临时函数
# d表示字典的每一对键值对，d[0]为key，d[1]为value
# reverse为True表示降序排序
stat = sorted(stat.items(), key=lambda d:d[1], reverse=True)
fw = open(&#39;result.csv&#39;, &#39;w&#39;)
for item in stat:
  # 进行字符串拼接之前，需要将int转为str
  fw.write(item[0] + &#39;,&#39; + str(item[1]) + &#39;\n&#39;)
fr.close()
fw.close()

로그인 후 복사