Our BengWord2Vec is technically a Word2Vec model, which is based on 250 million articles. This is based on https://github.com/Kyubyong/wordvectors, implementation but with additional midifications and updates.
- nltk >= 1.11.1
- regex >= 2016.6.24
- lxml >= 3.3.3
- numpy >= 1.11.2
- gensim > =0.13.1 (for Word2Vec)
- Check this to know how to prepare Word2Vec model for other languages such as Korean, Japanese, French etc.
- STEP 1. Download this raw Bengali texts in case you don't want to collect all the raw text.
- STEP 2. Extract running texts to
data/folder. - STEP 3. Run
build_corpus.pyusingpython3 build_corpus.py - STEP 4. Run
sudo ./make_wordvector.shto get the BengWord2Vec word vectors but make sure it's already executable.
| Language | UTF-8 | Vector Size | Corpus Size | Vocabulary Size |
|---|---|---|---|---|
| Bengali (BengWord2Vec) | Bengali (f) | bn | 300 | 250M | 30059 |
- Check this for the pre-trained BengWord2Vec model.