Gensim Word2vec 使用指南

Blog Content

自然语言处理 Python 2014-02-08 22:19:02

本文主要内容翻译自 Word2vec Tutorial : https://rare-technologies.com/word2vec-tutorial/

Gemsim 安装
快速安装

    easy install -U gensim
    pip install --upgrade gensim

依赖

    Python >= 2.6
    NumPy >= 1.3
    SciPy >= 0.7

输入

Gensim Word2vec 使用一个句子序列作为其输入，每个句子包含一个单词列表。

sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)

用 Python 内置的 list 类型作为输入很方便，但当输入内容较多时，会占用很大的内存空间。Gemsim 的输入只要求序列化的句子，而不需要将所有输入都存储在内存中。简单来说，可以输入一个句子，处理它，删除它，再载入另外一个句子。

举例来说，假如输入分散在硬盘的多个文件中，每个句子一行，那么不需要将所有输入先行存储在内存中，Word2vec 可以一个文件一个文件，一行一行地进行处理。

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname

    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()

sentences = MySentences('/some/directory') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

如果希望对文件中的内容进行预处理，举例来说，转换编码，大小写转换，去除数字等操作，均可以在 MySentences 迭代器中完成，完全独立于 Word2vec。Word2vec 只负责接收 yield 的输入。

针对高级用户：调用 Word2Vec(sentences, iter=1) 会调用句子迭代器运行两次（一般来说，会运行 iter+1 次，默认情况下 iter=5）。第一次运行负责收集单词和它们的出现频率，从而构造一个内部字典树。第二次以及以后的运行负责训练神经模型。这两次运行（iter+1）也可以被手动初始化，如果输入流是无法重复利用的，也可以用下面的方式对其进行初始化。

model = gensim.models.Word2Vec(iter=1) # an empty model, no training yet
model.build_vocab(some_sentences) # can be a non-repeatable, 1-pass generator
model.train(other_sentences) # can be a non-repeatable, 1-pass generator

如果对 Python 中迭代器，可迭代的，生成器这些概念不是很理解，可以参考下文。

Python关键字yield的解释: http://pyzh.readthedocs.io/en/latest/the-python-yield-keyword-explained.html#id8

训练

Word2vec 有多个影响训练速度和质量的参数。

其中之一是用来修剪内部字典树的。在一个数以亿计的预料中出现一到两次的单词非常有可能是噪音或不需要被关注的。另外，也没有足够的数据对他们进行有意义的训练。因此，最好的办法就是直接将他们忽略掉。
model = Word2Vec(sentences, min_count=10) # default value is 5
对于设定 min_count 的值，合理的范围是0 - 100，可以根据数据集的规模进行调整。

另一个参数是神经网络 NN 层单元数，它也对应了训练算法的自由程度。
model = Word2Vec(sentences, size=200) # default value is 100
更大的 size 值需要更多的训练数据，但也同时可以得到更准确的模型。合理的取值范围是几十到几百。

最后一个主要参数是训练并行粒度，用来加速训练。
model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization
该参数只有在机器已安装 Cython 情况下才会起到作用。如没有 Cython，则只能单核运行。

完整参数列表

内存

在内部，Word2vec 模型的参数以矩阵形式存储（NumPy 数组），数组的大小为 #vocabulary 乘以 #size 的浮点数（4 bytes）。

三个如上的矩阵被存储在内存中（将其简化为两个或一个的工作进行中）。如果输入中存在 100,000 个互异的词，神经网络规模 size 设为200，则该模型大致需要内存
100,000 * 200 * 3 * 4 bytes = ~229MB。

除此之外，还需要一些额外的空间存储字典树，但除非输入内容极端长，内存主要仍被上文所提到的矩阵所占用。
评估

Word2vec 训练是一个非监督任务，很难客观地评估结果。评估要依赖于后续的实际应用场景。Google 公布了一个包含 20,000 语法语义的测试样例，形式为 “A is to B as C is to D”。

需要注意的是，如在此测试样例上展示良好性能并不意味着在其它应用场景依然有效，反之亦然。

存储和载入模型

使用 Gensim 的方法进行存储和载入模型

model.save('/tmp/mymodel')
new_model = gensim.models.Word2Vec.load('/tmp/mymodel')

该方法将模型内部的 NumPy 矩阵从硬盘载入到虚拟内存。另外，可以使用如下的方法载入原生 C 工具生成的模型，文本和二进制形式的均可。

model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)
# using gzipped/bz2 input works too, no need to unzip:
model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)

在线训练和恢复训练

高级用户可以载入模型后用更多的预料对其进行训练，你可能要对参数 total_words 进行调整，取决于希望达到的学习率。

model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)

从原生 C 工具生成的模型载入后无法继续进行训练，仍然可以对该模型进行查询和相关度计算操作，但由于字典树的丢失，无法继续进行训练。

使用模型

Word2vec 支持以下多种词语相似度任务：

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
model.doesnt_match("breakfast cereal dinner lunch";.split())
'cereal'
model.similarity('woman', 'man')
0.73723527

可以用如下方法查询词向量：

model['computer'] # raw NumPy vector of a word
array([-0.00449447, -0.00310097, 0.02421786, ...], dtype=float32)

如需要全体的词向量，可以调用 model.syn0 返回一个 2D 的 NumPy 矩阵。

参考文献

Word2vec Tutorial :https://rare-technologies.com/word2vec-tutorial/

I never got round to writing a tutorial on how to use word2vec in gensim. It’s simple enough and the API docs are straightforward, but I know some people prefer more verbose formats. Let this post be a tutorial and a reference example.

UPDATE: the complete HTTP server code for the interactive word2vec demo below is now open sourced on Github. For a high-performance similarity server for documents, see ScaleText.com.

Preparing the Input

Starting from the beginning, gensim’s word2vec expects a sequence of sentences as its input. Each sentence a list of words (utf8 strings):

# import modules & set up logging
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
 
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)

Keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.

Gensim only requires that the input must provide sentences sequentially, when iterated over. No need to keep everything in RAM: we can provide one sentence, process it, forget it, load another sentence…

For example, if our input is strewn across several files on disk, with one sentence per line, then instead of loading everything into an in-memory list, we can process the input file by file, line by line:

class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname
 
    def __iter__(self):
        for fname in os.listdir(self.dirname):
            for line in open(os.path.join(self.dirname, fname)):
                yield line.split()
 
sentences = MySentences('/some/directory') # a memory-friendly iterator
model = gensim.models.Word2Vec(sentences)

Say we want to further preprocess the words from the files — convert to unicode, lowercase, remove numbers, extract named entities… All of this can be done inside the MySentences iterator and word2vec doesn’t need to know. All that is required is that the input yields one sentence (list of utf8 words) after another.

Note to advanced users: calling Word2Vec(sentences, iter=1) will run two passes over the sentences iterator (or, in general iter+1 passes; default iter=5). The first pass collects words and their frequencies to build an internal dictionary tree structure. The second and subsequent passes train the neural model. These two (or, iter+1) passes can also be initiated manually, in case your input stream is non-repeatable (you can only afford one pass), and you’re able to initialize the vocabulary some other way:

model = gensim.models.Word2Vec(iter=1)  # an empty model, no training yet
model.build_vocab(some_sentences)  # can be a non-repeatable, 1-pass generator
model.train(other_sentences)  # can be a non-repeatable, 1-pass generator

In case you’re confused about iterators, iterables and generators in Python, check out our tutorial on Data Streaming in Python.

Training

Word2vec accepts several parameters that affect both training speed and quality.

One of them is for pruning the internal dictionary. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them:

1	`model` `=` `Word2Vec(sentences, min_count=10)` `# default value is 5`

A reasonable value for min_count is between 0-100, depending on the size of your dataset.

Another parameter is the size of the NN layers, which correspond to the “degrees” of freedom the training algorithm has:

1	`model` `=` `Word2Vec(sentences, size=200)` `# default value is 100`

Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.

The last of the major parameters (full list here) is for training parallelization, to speed up training:

model = Word2Vec(sentences, workers=4) # default = 1 worker = no parallelization

The workers parameter has only effect if you have Cython installed. Without Cython, you’ll only be able to use one core because of the GIL (and word2vec training will be miserably slow).

Memory

At its core, word2vec model parameters are stored as matrices (NumPy arrays). Each array is #vocabulary (controlled by min_count parameter) times #size (size parameter) of floats (single precision aka 4 bytes).

Three such matrices are held in RAM (work is underway to reduce that number to two, or even one). So if your input contains 100,000 unique words, and you asked for layer size=200, the model will require approx. 100,000*200*4*3 bytes = ~229MB.

There’s a little extra memory needed for storing the vocabulary tree (100,000 words would take a few megabytes), but unless your words are extremely loooong strings, memory footprint will be dominated by the three matrices above.

Evaluating

Word2vec training is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

Google have released their testing set of about 20,000 syntactic and semantic test examples, following the “A is to B as C is to D” task: https://raw.githubusercontent.com/RaRe-Technologies/gensim/develop/gensim/test/test_data/questions-words.txt.

Gensim support the same evaluation set, in exactly the same format:

model.accuracy('/tmp/questions-words.txt')
2014-02-01 22:14:28,387 : INFO : family: 88.9% (304/342)
2014-02-01 22:29:24,006 : INFO : gram1-adjective-to-adverb: 32.4% (263/812)
2014-02-01 22:36:26,528 : INFO : gram2-opposite: 50.3% (191/380)
2014-02-01 23:00:52,406 : INFO : gram3-comparative: 91.7% (1222/1332)
2014-02-01 23:13:48,243 : INFO : gram4-superlative: 87.9% (617/702)
2014-02-01 23:29:52,268 : INFO : gram5-present-participle: 79.4% (691/870)
2014-02-01 23:57:04,965 : INFO : gram7-past-tense: 67.1% (995/1482)
2014-02-02 00:15:18,525 : INFO : gram8-plural: 89.6% (889/992)
2014-02-02 00:28:18,140 : INFO : gram9-plural-verbs: 68.7% (482/702)
2014-02-02 00:28:18,140 : INFO : total: 74.3% (5654/7614)

This accuracy takes an optional parameter restrict_vocab which limits which test examples are to be considered.

Once again, good performance on this test set doesn’t mean word2vec will work well in your application, or vice versa. It’s always best to evaluate directly on your intended task.

Storing and loading models

You can store/load models using the standard gensim methods:

model.save('/tmp/mymodel')
new_model = gensim.models.Word2Vec.load('/tmp/mymodel')

which uses pickle internally, optionally mmap‘ing the model’s internal large NumPy matrices into virtual memory directly from disk files, for inter-process memory sharing.

In addition, you can load models created by the original C tool, both using its text and binary formats:

model = Word2Vec.load_word2vec_format('/tmp/vectors.txt', binary=False)
# using gzipped/bz2 input works too, no need to unzip:
model = Word2Vec.load_word2vec_format('/tmp/vectors.bin.gz', binary=True)

Online training / Resuming training

Advanced users can load a model and continue training it with more sentences:

model = gensim.models.Word2Vec.load('/tmp/mymodel')
model.train(more_sentences)

You may need to tweak the total_words parameter to train(), depending on what learning rate decay you want to simulate.

Note that it’s not possible to resume training with models generated by the C tool, load_word2vec_format(). You can still use them for querying/similarity, but information vital for training (the vocab tree) is missing there.

Using the model

Word2vec supports several word similarity tasks out of the box:

model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
model.doesnt_match("breakfast cereal dinner lunch";.split())
'cereal'
model.similarity('woman', 'man')
0.73723527

If you need the raw output vectors in your application, you can access these either on a word-by-word basis

model['computer']  # raw NumPy vector of a word
array([-0.00449447, -0.00310097,  0.02421786, ...], dtype=float32)

…or en-masse as a 2D NumPy matrix from model.syn0.

Bonus app

As before with finding similar articles in the English Wikipedia with Latent Semantic Analysis, here’s a bonus web app for those who managed to read this far. It uses the word2vec model trained by Google on the Google News dataset, on about 100 billion words:

The model contains 3,000,000 unique phrases built with layer size of 300.

Note that the similarities were trained on a news dataset, and that Google did very little preprocessing there. So the phrases are case sensitive: watch out! Especially with proper nouns.

On a related note, I noticed about half the queries people entered into the LSA@Wiki demo contained typos/spelling errors, so they found nothing. Ouch.

To make it a little less challenging this time, I added phrase suggestions to the forms above. Start typing to see a list of valid phrases from the actual vocabulary of Google News’ word2vec model.

The “suggested” phrases are simply ten phrases starting from whatever bisect_left(all_model_phrases_alphabetically_sorted, prefix_you_typed_so_far) from Python’s built-in bisect module returns.

See the complete HTTP server code for this “bonus app” on github (using CherryPy).

Outro

Full word2vec API docs here; get gensim here. Original C toolkit and word2vec papers by Google here.

And here’s me talking about the optimizations behind word2vec at PyData Berlin 2014

上一篇：神经网络语言模型（NNLM）
下一篇：gensim Word2Vec 训练和使用

One - One Code All

Blog Content