SGNS-PyTorch

SkipGram NegativeSampling implemented in PyTorch.

Usage

See test.ipynb

Requirements

Python 3
PyTorch 1.x

Paper

Efficient Estimation of Word Representations in Vector Space (original word2vec paper)
Distributed Representations of Words and Phrases and their Compositionality (negative sampling paper)

Notes

Word2Vec是用无监督方式从文本中学习词向量来表征语义信息的模型，语义相近的词在嵌入空间中距离相近。类似于auto-encoder，Word2Vec训练的神经网络不用于处理新任务，真正需要的是模型参数，即隐层的权重矩阵。

Skip-gram是在给定目标单词的情况下，预测其上下文单词。

用两个word matrix，W表示目标单词向量矩阵(V*N)，W'表示上下文单词向量矩阵（N*V），词向量维度N，词汇表维度V。

模型：

投影：$h_i=Wx_k$
计算相似度：$z=W'h_i$
转换为概率分布：$\hat y=\text{softmax}(z)$

高效训练的三个trick（来自第二篇paper）：

subsampling of the frequent words
nagative sampling (alternative to hierarchical softmax)
treat word pairs / phases as one word

Subsampling

高频词数量远超训练所需，所以进行抽样，基于词频以一定概率丢弃词汇（论文中公式）： $$ P\left(w_{i}\right)=1-\sqrt{\frac{t}{f\left(w_{i}\right)}} $$

作者实际使用的公式（t默认0.0001）： $$P\left(w_{i}\right)=\sqrt{\frac{t}{f\left(w_{i}\right)}} + \frac{t}{f\left(w_{i}\right)}$$

Negative Sampling

负采样使得每个训练样本仅更新一小部分权重。negative word指期望概率为0的单词，选取概率为： $$ P_n(w_i)=f(w_i)^{3 / 4} / Z $$

训练

在 text8 语料上训练，默认采用词向量维数为100，词典大小为50000，window_size为5，负采样数为10。

评估

基于词向量的语言学特性
- similarity task 词相似
- analogy task 词类比 (A-B=C-D)
Task-specific
- 对具体任务的性能提升

这里基于词相似，在 WordSim-353、Stanford Rare Word (RW) 和 SimLex-999 上利用 Spearman's rank correlation coefficient 进行评估。

结果

训练1小时（4个epoch），尚未完全拟合的情况下效果如下。对照 Gensim Word2vec 默认训练结果和 GoogleNews-vectors-negative300 ：

	WordSim353	RW	SimLex-999	Corpus	embed_dim	vocab_size	Time
Gensim	0.624	0.320	0.250	text8	100	71290	1min
SGNS-PyTorch	0.661	0.343	0.265	text8	100	50000	1h
GoogleNews	0.659	0.553	0.436	GoogleNews	300	3000000	-

测试过程和结果在 test.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
images		images
test_data		test_data
word2vec		word2vec
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
test.ipynb		test.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SGNS-PyTorch

Usage

Requirements

Paper

Notes

Subsampling

Negative Sampling

训练

评估

结果

References

About

Releases

Packages

Languages

License

ZubinGou/SGNS-PyTorch

Folders and files

Latest commit

History

Repository files navigation

SGNS-PyTorch

Usage

Requirements

Paper

Notes

Subsampling

Negative Sampling

训练

评估

结果

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages