Table of Contents
下载jar
maven
Home Java javaTutorial Method to implement Chinese word frequency statistics in Java (with code)

Method to implement Chinese word frequency statistics in Java (with code)

Sep 19, 2018 pm 01:56 PM
java

 本篇文章给大家带来的内容是关于Java中实现中文词频统计的方法(附代码),有一定的参考价值,有需要的朋友可以参考一下,希望对你有所帮助。

昨日有个中文词频统计的需求, 百度一番后, 发现一大堆标题党文章, 讲的与内容严重不符, 这里就简单记录下自己实现的流程吧!

与英文单词的词频统计不同, 中文的难点在于如何分词, 不过好在有许多优秀的现成库供调用,这里就使用了 ansj_seg 插件.

首先添加依赖:

下载jar

访问 http://maven.nlpcn.org/org/ansj/ 最好下载最新版 ansj_seg/

同时下载nlp-lang.jar 需要和ansj_seg 配套..配套关系可以看jar包中的maven依赖,一般最新的ansj配最新的nlp-lang不会有错。

导入到eclipse ,开始你的程序吧。

maven

1

2

3

4

5

<dependency>

     <groupId>org.ansj</groupId>

     <artifactId>ansj_seg</artifactId>

     <version>5.1.1</version>

 </dependency>

Copy after login

基本用法为:

1

2

3

String str = "欢迎使用ansj_seg,(ansj中文分词)在这里如果你遇到什么问题都可以联系我.我一定尽我所能.帮助大家.ansj_seg更快,更准,更自由!" ;

System.out.println(ToAnalysis.parse(str));

 欢迎/v,使用/v,ansj/en,_,seg/en,,,(,ansj/en,中文/nz,分词/n,),在/p,这里/r,如果/c,你/r,遇到/v,什么/r,问题/n,都/d,可以/v,联系/v,我/r,./m,我/r,一定/d,尽我所能/l,./m,帮助/v,大家/r,./m,ansj/en,_,seg/en,更快/d,,,更/d,准/a,,,更/d,自由/a,!

Copy after login

下面就贴上代码:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

public static void wordFrequency() throws IOException {

       Map<String, Integer> map = new HashMap<>();

 

       String article = getString();

       String result = ToAnalysis.parse(article).toStringWithOutNature();

       String[] words = result.split(",");

 

 

       for(String word: words){

           String str = word.trim();

           // 过滤空白字符

           if (str.equals(""))

               continue;

           // 过滤一些高频率的符号

           else if(str.matches("[)|(|.|,|。|+|-|“|”|:|?|\\s]"))

               continue;

           // 此处过滤长度为1的str

           else if (str.length() < 2)

               continue;

 

           if (!map.containsKey(word)){

               map.put(word, 1);

           } else {

               int n = map.get(word);

               map.put(word, ++n);

           }

       }

 

       Iterator<Map.Entry<String, Integer>> iterator = map.entrySet().iterator();

       while (iterator.hasNext()){

           Map.Entry<String, Integer> entry = iterator.next();

           System.out.println(entry.getKey() + ": " + entry.getValue());

       }

 

       List<Map.Entry<String, Integer>> list = new ArrayList<>();

       Map.Entry<String, Integer> entry;

   

       while ((entry = getMax(map)) != null){

           list.add(entry);

       }

 

       System.out.println(Arrays.toString(list.toArray()));

 

   }

 

 

   /**

    * 找出map中value最大的entry, 返回此entry, 并在map删除此entry

    * @param map

    * @return

    */

   public static Map.Entry<String, Integer> getMax(Map<String, Integer> map){

       if (map.size() == 0){

           return null;

       }

       Map.Entry<String, Integer> maxEntry = null;

       boolean flag = false;

       Iterator<Map.Entry<String, Integer>> iterator = map.entrySet().iterator();

       while (iterator.hasNext()){

           Map.Entry<String, Integer> entry = iterator.next();

           if (!flag){

               maxEntry = entry;

               flag = true;

           }

           if (entry.getValue() > maxEntry.getValue()){

               maxEntry = entry;

           }

       }

       map.remove(maxEntry.getKey());

       return maxEntry;

   }

 

   /**

    * 从文件中读取待分割的文章素材.

  * 文件内容来自简书热门文章: https://www.jianshu.com/p/5b37403f6ba6

    * @return

    * @throws IOException

    */

   public static String getString() throws IOException {

       FileInputStream inputStream = new FileInputStream(new File("/home/as_/IdeaProjects/SpringMaven/article-txt"));

       BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));

       StringBuilder strBuilder = new StringBuilder();

 

       String line;

       while((line = reader.readLine()) != null){

           strBuilder.append(line);

       }

       reader.close();

       inputStream.close();

       return strBuilder.toString();

   }

Copy after login

最后依旧附上图片:

The above is the detailed content of Method to implement Chinese word frequency statistics in Java (with code). For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress AI Tool

Undress images for free

Clothoff.io

Clothoff.io

AI clothes remover

Video Face Swap

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Tools

Notepad++7.3.1

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Perfect Number in Java Perfect Number in Java Aug 30, 2024 pm 04:28 PM

Guide to Perfect Number in Java. Here we discuss the Definition, How to check Perfect number in Java?, examples with code implementation.

Weka in Java Weka in Java Aug 30, 2024 pm 04:28 PM

Guide to Weka in Java. Here we discuss the Introduction, how to use weka java, the type of platform, and advantages with examples.

Smith Number in Java Smith Number in Java Aug 30, 2024 pm 04:28 PM

Guide to Smith Number in Java. Here we discuss the Definition, How to check smith number in Java? example with code implementation.

Java Spring Interview Questions Java Spring Interview Questions Aug 30, 2024 pm 04:29 PM

In this article, we have kept the most asked Java Spring Interview Questions with their detailed answers. So that you can crack the interview.

Break or return from Java 8 stream forEach? Break or return from Java 8 stream forEach? Feb 07, 2025 pm 12:09 PM

Java 8 introduces the Stream API, providing a powerful and expressive way to process data collections. However, a common question when using Stream is: How to break or return from a forEach operation? Traditional loops allow for early interruption or return, but Stream's forEach method does not directly support this method. This article will explain the reasons and explore alternative methods for implementing premature termination in Stream processing systems. Further reading: Java Stream API improvements Understand Stream forEach The forEach method is a terminal operation that performs one operation on each element in the Stream. Its design intention is

TimeStamp to Date in Java TimeStamp to Date in Java Aug 30, 2024 pm 04:28 PM

Guide to TimeStamp to Date in Java. Here we also discuss the introduction and how to convert timestamp to date in java along with examples.

Java Program to Find the Volume of Capsule Java Program to Find the Volume of Capsule Feb 07, 2025 am 11:37 AM

Capsules are three-dimensional geometric figures, composed of a cylinder and a hemisphere at both ends. The volume of the capsule can be calculated by adding the volume of the cylinder and the volume of the hemisphere at both ends. This tutorial will discuss how to calculate the volume of a given capsule in Java using different methods. Capsule volume formula The formula for capsule volume is as follows: Capsule volume = Cylindrical volume Volume Two hemisphere volume in, r: The radius of the hemisphere. h: The height of the cylinder (excluding the hemisphere). Example 1 enter Radius = 5 units Height = 10 units Output Volume = 1570.8 cubic units explain Calculate volume using formula: Volume = π × r2 × h (4

PHP vs. Python: Understanding the Differences PHP vs. Python: Understanding the Differences Apr 11, 2025 am 12:15 AM

PHP and Python each have their own advantages, and the choice should be based on project requirements. 1.PHP is suitable for web development, with simple syntax and high execution efficiency. 2. Python is suitable for data science and machine learning, with concise syntax and rich libraries.

See all articles