Wikipedia link points to "latest"

#1
by AngledLuffa - opened

Thank you for providing these word vectors. I tried using them for training an NER model using Stanza, and they are a huge improvement in terms of accuracy over the original fasttext vectors.

One minor nitpick: the link to Wikipedia points to the "latest" Wikipedia dump, which means that the link no longer represents the data used to build these vectors.

Hello,
Thanks for the feedback.
You are right. I built these vectors three years ago. The data version, I used was the latest at that time. And put the latest link. But now that increase a lot. I think the latest version won't impact but rather improve your result. Sorry for not keeping the data version track.

Thanks for the explanation!

Can I ask a question in a slightly different direction? In terms of tokenization, would it be sufficient to separate words using whitespace and separate punctuation, or is there a more complicated technique needed for tokenizing Bangla? For example, English and German have single tokens which are effectively multiple words connected to each other, and Chinese of course requires segmentation with almost no guiding whitespace.

In terms of tokenization, I can say whitespace and punctuation based is enough for simple tasks. But you need a better vocabulary size because there are too many unique words in Bengali. You can check this library for different tokenization options.
https://github.com/sagorbrur/bnlp

Besides you can check the subword-based tokenizer like sentence piece. But you need to handle it while training and getting vector in subword level.

Thank you! Again, very helpful.

AngledLuffa changed discussion status to closed

Sign up or log in to comment