word2vec实践及对关键词聚类
在搜索领域query的处理变得越来越重要,其中分类就是很重要的一环,对query分类是比较难的工程,因为query普遍较短,含有的信息(熵)很少,所以很难进行分类,普遍的方法是对query进行扩展,例如抓取搜索引擎的结果,或是直接将query扩展到对应的doc,然后
在搜索领域query的处理变得越来越重要,其中分类就是很重要的一环,对query分类是比较难的工程,因为query普遍较短,含有的信息(熵)很少,所以很难进行分类,普遍的方法是对query进行扩展,例如抓取搜索引擎的结果,或是直接将query扩展到对应的doc,然后对doc进行分类,对doc分类就变得容易了,而且准确率比较高,最近看到word2vec很火,使用的是无监督的机器学习,也就是不需要标注数据,于是就研究了一下,看是否可以使用结果用于query分类扩展。
where is word2vec?
https://code.google.com/p/word2vec/
可以在上面下载具体的代码进行编译,生成相关的分析工具,上面的C代码写的有些“抽象”,以下有C++版本,看起来比较直观
https://github.com/jdeng/word2vec
训练语料获取
可以在搜狗试验室中获取一些新闻数据,尽管比较老但是将就着用,其实感觉微博的数据会好些,一是数据量大,二是信息含量比较高(新鲜东西比较多),新闻的语料可以在
http://www.sogou.com/labs/dl/ca.html 上获取,只要简单的注册一下就可以,在windows下下载还是比较麻烦的,需要用ftp工具,实际上可以用windows自带的ftp.exe就可以下载。
1、在cmd窗口下执行 ftp ftp.labs.sogou.com
2、输入注册生成的用户名
3、输入注册生成的密码,然后就可以连接到ftp上
4、cd到对应的目录,执行dir或ls就可以看到具体的文件
5、get news_tensite_xml.full.tar.gz 就可以下载文件到个人文档目录了
处理语料及分词
语料是xml结构的,需要将新闻内容清洗出来
cat news_tensite_xml.dat | iconv -f gbk -t utf-8 -c | grep "<content>" | sed 's\<content>\\' | sed 's\</content>\\' > news.txt</content>
这样就可以将新闻内容清洗出来,一行一篇文章,接下来就对对语料进行分词了,找了一些开源的分词,java版本的有些比较难用,有时莫名其妙的乱码问题就要折腾半天,这里就是用了中科院的分词ICTCLAS,C++版本的,在linux下运行比较简单,我已经写好了分词的程序,放到CSDN上,需要的可以直接下载,包括库,分词词典,还有二进制程序,分词工具,点此进入下载。ICTCLAS分词器相关资料可以查看http://hi.baidu.com/drkevinzhang/
语料总计有1143394篇文章,分词后数据文件有2.2G,分词后的情况如下:
运行word2vec进行分析
./word2vec -train out.txt -output vectors.bin -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 1

这个过程可能需要一段时间的等待,运行完成后,会生成vectors.bin文件,接着就可以利用提供的余弦计算工具查看关键词的相关词了
执行./distance vectors.bin 然后输入想看的查询词就可以看到效果了。
可以看到针对实体名称,分析的结果还是很靠谱的,如果针对语料做些预处理相信结果会更好。
可以通过
./word2vec -train out.txt -output classes.txt -cbow 0 -size 200 -window 5 -negative 0 -hs 1 -sample 1e-3 -threads 12 -classes 500
对分析结果进行聚类用于query方面的分类,结果如下:
将单词去除后,结果还是比较可观的。
参考:
http://blog.csdn.net/zhaoxinfan/article/details/11069485
https://code.google.com/p/word2vec/
请关注我的博客 word2vec实践及对关键词聚类

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

AI Hentai Generator
Generate AI Hentai for free.

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics



Adjusting the aperture size has a crucial impact on the photo effect. Xiaomi Mi 14 Ultra provides unprecedented flexibility in camera aperture adjustment. In order to allow everyone to adjust the aperture smoothly and realize the free adjustment of the aperture size, the editor here brings you a detailed tutorial on how to set the aperture on Xiaomi Mi 14Ultra. How to adjust the aperture on Xiaomi Mi 14Ultra? Start the camera, switch to "Professional Mode", and select the main camera - W lens. Click on the aperture, open the aperture dial, A is automatic, select f/1.9 or f/4.0 as needed.

Ce Modifier (CheatEngine) is a game modification tool dedicated to modifying and editing game memory. So how to set Chinese in CheatEngine? Next, the editor will tell you how to set Chinese in Ce Modifier. I hope it can Help friends in need. In the new software we download, it can be confusing to find that the interface is not in Chinese. Even though this software was not developed in China, there are ways to convert it to the Chinese version. This problem can be solved by simply applying the Chinese patch. After downloading and installing the CheatEngine (ce modifier) software, open the installation location and find the folder named languages, as shown in the figure below

In the era dominated by intelligence, office software has also become popular, and Wps forms are adopted by the majority of office workers due to their flexibility. At work, we are required not only to learn simple form making and text entry, but also to master more operational skills in order to complete the tasks in actual work. Reports with data and using forms are more convenient, clear and accurate. The lesson we bring to you today is: The WPS table cannot find the data you are searching for. Why please check the search option location? 1. First select the Excel table and double-click to open it. Then in this interface, select all cells. 2. Then in this interface, click the "Edit" option in "File" in the top toolbar. 3. Secondly, in this interface, click "

Honor 90GT is a cost-effective smartphone with excellent performance and excellent user experience. However, sometimes we may encounter some problems, such as how to update Honor MagicOS8.0 on Honor 90GT? This step may be different for different mobile phones and different models. So, let us discuss how to upgrade the system correctly. How to update Honor MagicOS 8.0 on Honor 90GT? According to news on February 28, Honor today pushed the MagicOS8.0 public beta update for its three mobile phones 90GT/100/100Pro. The package version number is 8.0.0.106 (C00E106R3P1) 1. Ensure your Honor The battery of the 90GT is fully charged;

Popular Metaverse game projects founded in the last crypto cycle are accelerating their expansion. On March 4, PlanetMojo, the Web3 game metaverse platform, announced a number of important developments in its game ecology, including the announcement of the upcoming parkour game GoGoMojo, the launch of the new season "Way of War" in the flagship auto-chess game MojoMelee, and the celebration of the new The first ETH series "WarBannerNFT" launched this season in cooperation with MagicEden. In addition, PlanetMojo also revealed that they plan to launch Android and iOS mobile versions of MojoMelee later this year. This project will be launched at the end of 2021. After nearly two years of hard work in the bear market, it will soon be completed.

With the rapid development of the Internet, the self-media industry has become the focus of more and more people's attention. In this industry, some areas have attracted much attention due to their broad market prospects and profitability. This article will reveal to you the five most profitable areas of self-media, while discussing the direction of Douyin’s support in 2024 to help you better grasp the development opportunities of self-media. 1. What are the five most profitable areas of self-media? With the rise of online education, the field of education and training has become increasingly popular. People are willing to invest in acquiring knowledge and skills, not only in academic courses but also in skills training and workplace advancement. Self-media creators can achieve profitability by creating high-quality educational content to attract students to pay for learning. This trend shows that people are interested in lifelong learning

The mobile Taobao app software provides a lot of good products. You can buy them anytime and anywhere, and everything is genuine. The price tag of each product is clear. There are no complicated operations at all, making you enjoy more convenient shopping. . You can search and purchase freely as you like. The product sections of different categories are all open. Add your personal delivery address and contact number to facilitate the courier company to contact you, and check the latest logistics trends in real time. Then some new users are using it for the first time. If you don’t know how to search for products, of course you only need to enter keywords in the search bar to find all the product results. You can’t stop shopping freely. Now the editor will provide detailed online methods for mobile Taobao users to search for store names. 1. First open the Taobao app on your mobile phone,

Answer: Yes, Golang provides functions that simplify file upload processing. Details: The MultipartFile type provides access to file metadata and content. The FormFile function gets a specific file from the form request. The ParseForm and ParseMultipartForm functions are used to parse form data and multipart form data. Using these functions simplifies the file processing process and allows developers to focus on business logic.
