Home > Web Front-end > JS Tutorial > body text

Detailed explanation of how to use Node.js to segment text content and extract keywords

黄舟
Release: 2017-05-28 10:36:08
Original
1935 people have browsed it

This article mainly introduces the use of Node.js to segment text content and extract keywords. Friends who need it can refer to

before discussing technology. Let’s be cute first, you don’t understand the world of foodies~~

Zhongcheng translated articles have tags, users can quickly filter articles of interest based on tags, and the articles are also Relevant recommendations can be made based on tag associations. But now Zhongcheng Translation’s tags are set when recommending articles, and they are all in English, and manual settings are inevitably not standardized and complete. Although articles can be manually edited after publishing, we cannot expect users or administrators to edit appropriate tags all the time, so we need to use tools to automatically generate tags.

Among the current open source word segmentation tools, jieba is a word segmentation component with powerful functions and excellent performance. Fortunately, it has a node version.

nodejieba's installation and use are very simple:

npm install nodejieba
var nodejieba = require("nodejieba");
var result = nodejieba.cut("帝国主义要把我们的地瓜分掉");
console.log(result);
//[ '帝国主义', '要', '把', '我们', '的', '地', '瓜分', '掉' ]
result = nodejieba.cut('土地,俺老孙的金箍棒在哪里?');
console.log(result);
//[ '土地', ',', '俺', '老', '孙', '的', '金箍棒', '在', '哪里', '?' ]
result = nodejieba.cut('大圣,您的金箍棒就棒在特别配您的头型!');
console.log(result); 
//[ '大圣',',','您','的','金箍棒','就','棒','在','特别','配','您','的','头型','!' ]
Copy after login

We can load our own dictionary and set the weight and part of speech for each word in the dictionary:

Edit user.uft8
Sweet Potato 9999 n
Golden Hoop 9999 n
stick is great in 9999
Then load the dictionary through nodejieba.load.

var nodejieba = require("nodejieba");
nodejieba.load({
 userDict: './user.utf8',
});
var result = nodejieba.cut("帝国主义要把我们的地瓜分掉");
console.log(result);
//[ '帝国主义', '要', '把', '我们', '的', '地瓜', '分', '掉' ]
result = nodejieba.cut('土地,俺老孙的金箍棒在哪里?');
console.log(result);
//[ '土地', ',', '俺', '老', '孙', '的', '金箍棒', '在', '哪里', '?' ]
result = nodejieba.cut('大圣,您的金箍棒就棒在特别配您的头型!');
console.log(result); 
//[ '大圣', ',', '您', '的', '金箍', '棒就棒在', '特别', '配', '您', '的', '头型', '!' ]
Copy after login

In addition to word segmentation, we can use nodejieba to extract keywords:

const content = `
Copy after login
Copy after login

HTTP, HTTP/2 and Performance optimization

The purpose of this article is Through comparison, I will tell you why you should migrate from HTTP to HTTPS and why support for HTTP/2 should be added. Before comparing HTTP and HTTP/2, let’s first look at what HTTP is.

What is HTTP

HTTP is a set of rules for communication on the World Wide Web. HTTP is an application layer protocol and runs on top of the TCP/IP layer. When a user requests a web page through a browser, HTTP is responsible for processing the request and establishing a connection between the web server and the client.

With HTTP/2, performance can be improved without using sprite images, compression, or splicing. However, this does not mean that these techniques should not be used. But this has clearly demonstrated the necessity for us to move from HTTP/1.1 to HTTP/2.
`;

const nodejieba = require("nodejieba");
const result = nodejieba.extract(content, 20);
console.log(result);
Copy after login

The output result is similar to the following:

[ { word: 'HTTP', weight: 140.8704516850025 },
 { word: '请求', weight: 14.23018001394 },
 { word: '应该', weight: 14.052171126120001 },
 { word: '万维网', weight: 12.2912397395 },
 { word: 'TCP', weight: 11.739204307083542 },
 { word: '1.1', weight: 11.739204307083542 },
 { word: 'Web', weight: 11.739204307083542 },
 { word: '雪碧图', weight: 11.739204307083542 },
 { word: 'HTTPS', weight: 11.739204307083542 },
 { word: 'IP', weight: 11.739204307083542 },
 { word: '应用层', weight: 11.2616203224 },
 { word: '客户端', weight: 11.1926274509 },
 { word: '浏览器', weight: 10.8561552143 },
 { word: '拼接', weight: 9.85762638414 },
 { word: '比较', weight: 9.5435285574 },
 { word: '网页', weight: 9.53122979951 },
 { word: '服务器', weight: 9.41204128224 },
 { word: '使用', weight: 9.03259988558 },
 { word: '必要性', weight: 8.81927328699 },
 { word: '添加', weight: 8.0484751722 } ]
Copy after login

We add some new keywords to the dictionary:

Performance
HTTP/2

The output results are as follows:

[ { word: 'HTTP', weight: 105.65283876375187 },
 { word: 'HTTP/2', weight: 58.69602153541771 },
 { word: '请求', weight: 14.23018001394 },
 { word: '应该', weight: 14.052171126120001 },
 { word: '性能', weight: 12.61259281884 },
 { word: '万维网', weight: 12.2912397395 },
 { word: 'IP', weight: 11.739204307083542 },
 { word: 'HTTPS', weight: 11.739204307083542 },
 { word: '1.1', weight: 11.739204307083542 },
 { word: 'TCP', weight: 11.739204307083542 },
 { word: 'Web', weight: 11.739204307083542 },
 { word: '雪碧图', weight: 11.739204307083542 },
 { word: '应用层', weight: 11.2616203224 },
 { word: '客户端', weight: 11.1926274509 },
 { word: '浏览器', weight: 10.8561552143 },
 { word: '拼接', weight: 9.85762638414 },
 { word: '比较', weight: 9.5435285574 },
 { word: '网页', weight: 9.53122979951 },
 { word: '服务器', weight: 9.41204128224 },
 { word: '使用', weight: 9.03259988558 } ]
Copy after login

On this basis, we use the whitelist method to filter out some words that can be used as tags:

const content = `
Copy after login
Copy after login

HTTP, HTTP/2 and performance optimization

The purpose of this article is to tell you through comparison why you should migrate from HTTP to HTTPS, and why support for HTTP/2 should be added. Before comparing HTTP and HTTP/2, let’s first look at what HTTP is.

What is HTTP

HTTP is a set of rules for communication on the World Wide Web. HTTP is an application layer protocol that runs on top of the TCP/IP layer. When a user requests a web page through a browser, HTTP is responsible for processing the request and establishing a connection between the web server and the client.

With HTTP/2, performance can be improved without using sprite images, compression, or splicing. However, this does not mean that these techniques should not be used. But this has clearly demonstrated the necessity for us to move from HTTP/1.1 to HTTP/2.
`;

const nodejieba = require("nodejieba");
nodejieba.load({
 userDict: './user.utf8',
});
const result = nodejieba.extract(content, 20);
const tagList = ['HTTPS', 'HTTP', 'HTTP/2', 'Web', '浏览器', '性能'];
console.log(result.filter(item => tagList.indexOf(item.word) >= 0));
Copy after login

Finally we get:

[ { word: 'HTTP', weight: 105.65283876375187 },
 { word: 'HTTP/2', weight: 58.69602153541771 },
 { word: '性能', weight: 12.61259281884 },
 { word: 'HTTPS', weight: 11.739204307083542 },
 { word: 'Web', weight: 11.739204307083542 },
 { word: '浏览器', weight: 10.8561552143 } ]
Copy after login

This is the result we want.

The above is the basic method of using the word segmentation library nodejieba. In the future, we can use it to automatically analyze and add corresponding tags to the translations published by Zhongcheng Translation, so as to provide translators and readers with a better user experience.

The above is the detailed content of Detailed explanation of how to use Node.js to segment text content and extract keywords. For more information, please follow other related articles on the PHP Chinese website!

source:php.cn
Statement of this Website
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn
Popular Tutorials
More>
Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template