sagorsarker
/

bangla-glove-vectors

Model card Files Files and versions Community

Wikipedia link points to "latest"

by AngledLuffa - opened Jul 10, 2022

Jul 10, 2022

Thank you for providing these word vectors. I tried using them for training an NER model using Stanza, and they are a huge improvement in terms of accuracy over the original fasttext vectors.

One minor nitpick: the link to Wikipedia points to the "latest" Wikipedia dump, which means that the link no longer represents the data used to build these vectors.

sagorsarker

Owner Jul 10, 2022

Hello,
Thanks for the feedback.
You are right. I built these vectors three years ago. The data version, I used was the latest at that time. And put the latest link. But now that increase a lot. I think the latest version won't impact but rather improve your result. Sorry for not keeping the data version track.

AngledLuffa

Jul 11, 2022

Thanks for the explanation!

Can I ask a question in a slightly different direction? In terms of tokenization, would it be sufficient to separate words using whitespace and separate punctuation, or is there a more complicated technique needed for tokenizing Bangla? For example, English and German have single tokens which are effectively multiple words connected to each other, and Chinese of course requires segmentation with almost no guiding whitespace.

sagorsarker

Owner Jul 11, 2022

In terms of tokenization, I can say whitespace and punctuation based is enough for simple tasks. But you need a better vocabulary size because there are too many unique words in Bengali. You can check this library for different tokenization options.
https://github.com/sagorbrur/bnlp

Besides you can check the subword-based tokenizer like sentence piece. But you need to handle it while training and getting vector in subword level.

AngledLuffa

Jul 11, 2022

Thank you! Again, very helpful.

AngledLuffa changed discussion status to closed Jul 11, 2022

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment