What is Word2vec?
There are some keywords about the model for you to consider: unsupervised learning, LSTM, encode(translate) a word to a vector. Actually, I don’t know exactly about the theory of word2vec. What I can tell you is that, using models training from a corpus by this method, you can find related words a specific word or a words list. In addition, the most famous point of this theory is that the vectors of retlated words translated by this model can be caculated. For example, the (vector of) king - man = queen - woman. That is, you can caculate Taiwan’s vector by France - Paris + Taipei.
To be installed
- pandas
- gensim
nltk (using , I am using python3.5 32bit, and I am not sure if it works for other edition of python)
1$ python -m pip install nltk
Test gensim word2cev function
Preprocess nltk predownload corpus(5 mins)
|
|
train them and see the result:
|
|
Result
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
mon_b | (‘imagin’, ‘0.9841’) | (‘job’, ‘0.9712’) | (‘learn’, ‘0.9674’) | (‘chant’, ‘0.9655’) | (‘child’, ‘0.9646’) |
mon_mr | (‘away’, ‘0.8631’) | (‘someon’, ‘0.8479’) | (‘rewind’, ‘0.8439’) | (‘wait’, ‘0.8361’) | (‘try’, ‘0.8236’) |
mon_t | (‘reg’, ‘0.9999’) | (‘ag’, ‘0.9999’) | (‘may’, ‘0.9999’) | (‘institut’, ‘0.9999’) | (‘stat’, ‘0.9999’) |
com_b | (‘county’, ‘0.9796’) | (‘highway’, ‘0.9754’) | (‘railroad’, ‘0.9752’) | (‘district’, ‘0.9744’) | (‘redevelop’, ‘0.969’) |
com_mr | (‘england’, ‘0.9799’) | (‘london’, ‘0.9774’) | (‘californ’, ‘0.9738’) | (‘mexico’, ‘0.9723’) | (‘chicago’, ‘0.9712’) |
com_t | (‘ad’, ‘0.9999’) | (‘could’, ‘0.9999’) | (‘may’, ‘0.9999’) | (‘ag’, ‘0.9999’) | (‘also’, ‘0.9999’) |
Start train Wiki corpus
Download wiki Corpus
- Download the same corpus with me from this link(about 13 GB )
- ps. I download this corpus at 2017/06/02. Also, you can download the latested one, find the link named: enwiki-latest-pages-articles.xml.bz2
Notice
Before doing the following works, I have to notice you. The following tasks will take you about more than 15 hours, during which the cpu of your computer will run up to 90% to 100%. Take my computer for example, my CPU is Intel(R) Core(TM) i7-6500 CPU, and Ram is 8.00GB. And all following tasks took me up to 15 hours, during which I put two fans around my computer to help cold down. As a result, you have to check your working environment first, then do the following tasks.
Preprocess Wiki Corpus(3.5 hours)
process the xml format wikipedia to text format using process_wiki.py, download it and put it into the directory where you save ‘enwiki-xxxx-pages-articles.xml.bz2’. Then, Open your command line: go to the directory, and type:
Output will be:
2017-06-03 07:06:22,577: INFO: Saved 10000 articles
2017-06-03 07:07:54,501: INFO: Saved 20000 articles
2017-06-03 07:09:07,157: INFO: Saved 30000 articles
2017-06-03 07:10:08,793: INFO: Saved 40000 articles
2017-06-03 07:10:58,331: INFO: Saved 50000 articles
2017-06-03 07:11:30,901: INFO: Saved 60000 articles
2017-06-03 07:12:02,807: INFO: Saved 70000 articles
...
2017-06-03 10:38:51,985: INFO: Saved 4270000 articles
2017-06-03 10:39:08,530: INFO: finished iterating over Wikipedia corpus of 4275675 documents with 2346581879 positions (total 17544907 articles, 2410007372 positions before pruning articles shorter than 50 words)
2017-06-03 10:39:08,538: INFO: Finished Saved 4275675 articles
Train the word2vec model(13.5 hours)
download train_word2vec_model.py and put it into the directory where you save ‘wiki.en.text’. Then, open your command line: go to the directory, and type:
I forget to record the output message, but it takes about 13 hours.
Use the word2vec model
|
|
result:
[('schoolteacher', 0.683679461479187),
('instructor', 0.6685980558395386),
('tutor', 0.667843759059906),
('educator', 0.6145995855331421),
('teachers', 0.6062737107276917),
('teaching', 0.5911847352981567),
('lecturer', 0.5808757543563843),
('librarian', 0.5734823346138),
('pupil', 0.571779727935791),
('professor', 0.5539975166320801),
('schoolmaster', 0.547014057636261),
('student', 0.545426607131958),
('pedagogue', 0.5427858233451843),
('headmaster', 0.5372023582458496),
('mentor', 0.5237945318222046),
('taught', 0.5192644596099854),
('headmistress', 0.5052664279937744),
('school', 0.500705361366272),
('scholar', 0.4987880289554596),
('classmate', 0.49570029973983765)]
Note
I modify the code from this page because there are some encoding problems in windows system when using chinese as first language.