This article mainly introduces the use of the gensim library word2vec in Python. It has a certain reference value. Now I share it with you. Friends in need can refer to it.
pip install gensim After installing the library , you can import and use:
1. Training model definition
##
from gensim.models import Word2Vec model = Word2Vec(sentences, sg=1, size=100, window=5, min_count=5, negative=3, sample=0.001, hs=1, workers=4)
Parameter explanation:
1.sg=1 is the skip-gram algorithm, which is sensitive to low-frequency words; the default sg=0 is the CBOW algorithm. 2.size is the dimension of the output word vector. If the value is too small, the word mapping will affect the results due to conflicts. If the value is too large, it will consume memory and slow down the algorithm calculation. Generally, the value is 100 to between 200. 3.window is the maximum distance between the current word and the target word in the sentence. 3 means looking at 3-b words before the target word and b words after it (b is random between 0-3 ). 4.min_count is used to filter words. Words with a frequency less than min-count will be ignored. The default value is 5. 5. Negative and sample can be fine-tuned based on the training results. Sample indicates that higher frequency words are randomly downsampled to the set threshold. The default value is 1e-3. 6.hs=1 means hierarchical softmax will be used. By default hs=0 and negative is not 0, negative sampling will be selected. 7. Workers control the parallelism of training. This parameter is only valid after Cpython is installed, otherwise only a single core can be used.For detailed parameter description, please view the word2vec source code.
2. Saving and loading the model after training
model.save(fname) model = Word2Vec.load(fname)
3. Model use (word similarity calculation, etc.)
model.most_similar(positive=['woman', 'king'], negative=['man']) #输出[('queen', 0.50882536), ...] model.doesnt_match("breakfast cereal dinner lunch".split()) #输出'cereal' model.similarity('woman', 'man') #输出0.73723527 model['computer'] # raw numpy vector of a word #输出array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
The above is the detailed content of Use of gensim library word2vec in Python. For more information, please follow other related articles on the PHP Chinese website!