gensim训练的几个重要方法:
model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet
model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator
model.train(other_sentences) # can be a non-repeatable, 1-pass generator
使用Genism进行词向量训练:教程版
1.起源
将文本换成向量后可以方便的使用SVM\logistics\deep learnig 等方法完成文本分类、标签、情感分析等实际任务.如何获得有效的词向量成为重要基础性工作.本博展示使用gensim包训练词向量的相关基础知识.
训练方法参考论文:
训练方法主要有CBOW与Skip-gram两种.
Deep learningvia word2vec’s “skip-gram and CBOW models”, using either hierarchical softmaxor negative sampling [1] [2].
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.
2.完成训练的重要类、方法及参数含义
(1)model主类
sg defines thetraining algorithm. By default (sg=0), CBOW is used. Otherwise (sg=1), skip-gram isemployed.
size is thedimensionalityof the feature vectors.
window is the maximum distance between the current andpredicted word within a sentence.
alpha is the initiallearning rate (will linearly drop to zero as trainingprogresses).
min_count =低于该词频的词会被drop. ignore all words with total frequency lowerthan this.
workers = use thismanyworker threads to train the model (=faster training with multicoremachines).
hs = if 1,hierarchical softmax will be used for model training. If set to0 (default), and negative is non-zero,negative sampling will be used.
negative =if > 0, negative sampling will be used, the int for negativespecifies how many “noise words” should be drawn (usually between 5-20).Default is 5. If set to 0, no negative samping is used.
cbow_mean = if 0, use the sum of the context wordvectors. If 1 (default), use the mean. Only applies when cbow is used.
iter = number of iterations (epochs) over the corpus.
示例:
model = Word2Vec(sentences,size=100, window=5, min_count=5, workers=4)
(2)模型构建后(类实例化),再调用类的train方法之前一定要构建词树.
build_vocab(sentences, keep_raw_vocab=False, trim_rule=None)
Build vocabulary from a sequence of sentences (can be aonce-only generator stream). Each sentence must be a list of unicode strings.
词树就是基于词在语料中出现的词频构建的哈夫曼树,这样那些经常出现的词汇会在训练时更快被检索到,节省训练的搜索时间.
(3)class gensim.models.word2vec.LineSentence(source, max_sentence_length=10000, limit=None)
一行一句话,每个单词以空格分隔,所以需要@XXX换成@”XX中”字符去除
(4)模型训练好后,可以load/save model
model.save(fname)
model = Word2Vec.load(fname) # you can continue training with the loaded model!
或者保存输出词向量
save_word2vec_format(fname, fvocab=None, binary=False)
Store theinput-hidden weight matrix in the same format used by theoriginal C word2vec-tool, for compatibility.
save_word2vec_format(fname, fvocab=None, binary=False)
(5)模型结果使用
model.most_similar(positive=['woman', 'king'], negative=['man'])
('queen', 0.50882536), ...]
model.doesnt_match("breakfast cereal dinner lunch".split())
cereal'#返回这个list中与其他词最大搭的词
model.similarity('woman', 'man')
0.73723527
model['computer'] # raw numpy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)
trained_model.similarity('woman', 'man')
0.73723527#两单词cosine相似度
trained_model.n_similarity(['restaurant', 'japanese'], ['japanese', 'restaurant'])#两对词向量的cos相似度
1.0000000000000004
(6)其他相关知识
modelparameters are stored as matrices (NumPy arrays)
#vocabulary times 词汇计数
#size (size parameter)of floats 维度
内存估计:
主要内存占用:
100,000 uniquewords、200维
模型参数占用内存:100,000*200*4(float占4bytes)*3=~229MB
再加一部分内存:几MB
存储:词汇哈夫曼树(节省搜索空间)
加速:
model=Word2Vec(sentences,workers=4)
The workersparameter has only effect if you have Cython installed
#!/usr/bin/python # -*- coding: utf-8 -*- import gensim.models import time import pandas as pd from nltk.tokenize import TweetTokenizer time1 = time.time() import logging import numpy as np logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) def loaddata(inputfile): file = open(inputfile) tknzr = TweetTokenizer() sentences=[] while 1: line = file.readline().strip() if not line: break sentences.append(tknzr.tokenize(line)) return sentences def WordFrequencyAnalysi(): #load data # sentences = [['first', 'sentence'], ['second', 'sentence']] sentences = gensim.models.word2vec.LineSentence("Tweets10")#导入词汇 print sentences modelbase = gensim.models.Word2Vec(min_count=1) modelbase.build_vocab(sentences)#构建词汇哈夫曼树,节省训练时找词汇的搜索时间 #根据词频分布\确定min_count 如取前1000个词汇 wordCount=[] for i in modelbase.vocab.keys(): wordCount.append((i,modelbase.vocab[i].count)) print wordCount def trainModel(inputfile,outVectorFile): #load data sentences=loaddata(inputfile) modelbase = gensim.models.Word2Vec(min_count=1) modelbase.build_vocab(sentences) modelbase.train(sentences)#必须构建完词汇哈弗曼树才可以train model #model save modelbase.save_word2vec_format(outVectorFile)#存为词向量 #model using def transformVectorToGraphTable(outVectorFile,yuzhi): print "load and transform data" file = open(outVectorFile) Vectors=[] while 1: line = file.readline().strip() if not line: break if len(line.split(" "))!=101:print len(line.split(" ")),line Vectors.append(line.split(" ")) matrix=np.matrix(Vectors[1:]).T Vectors=pd.DataFrame(matrix[1:],dtype='float64') Vectors.columns=matrix[0].tolist()[0] print "---Compute Euclidean Result---" #自定义实现欧几里得距离计算节点相似度 distance = lambda column1, column2: pd.np.linalg.norm(column1-column2) EuclideanResult = Vectors.apply(lambda col1: Vectors.apply(lambda col2: distance(col1, col2))) print "---output graph edge-------" GraphEdge=[] index=1 for idx,row in EuclideanResult.iterrows(): for col in Vectors.columns[:index]: if row[col]
上一篇:gensim Word2Vec 训练和使用
下一篇:python的dataframe和matrix的互换