Indic-Transformers Bengali DistilBERT
This is a DistilBERT language model pre-trained on ~6 GB of monolingual training corpus. The pre-training data was majorly taken from OSCAR. This model can be fine-tuned on various downstream tasks like text-classification, POS-tagging, question-answering, etc. Embeddings from this model can also be used for feature-based training.
Intended uses & limitations
How to use
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert') model = AutoModel.from_pretrained('neuralspace-reverie/indic-transformers-bn-distilbert') text = "আপনি কেমন আছেন?" input_ids = tokenizer(text, return_tensors='pt')['input_ids'] out = model(input_ids) print(out.shape) # out = [1, 5, 768]
Limitations and bias
The original language model has been trained using
PyTorch and hence the use of
pytorch_model.bin weights file is recommended. The h5 file for
Tensorflow has been generated manually by commands suggested here.
- Downloads last month