Spaces:
Running
Running
Datasets for Base and Instruction Models
#1
by
meherajj
- opened
I noticed from the titles of the models that you’ve used datasets like uonlp-culturax and Bangla 2B+ BERT for base models, and the bangla-alpaca-orca dataset for instruction models. Could you specify the portions of these datasets that were used for training? Also, did you rely solely on the bangla-alpaca-orca dataset for instruction tuning?
I believe, so far, we have only 1 model that uses Bangla 2B+ BERT. But, we are primarily focusing on the uonlp-culturaX. For pretraining, I think we used the 100% of bn subset (about 12.4 mil text rows). For finetune with bangla-alpaca-orca, we use 5% for validation and rest for instruction finetuning/training.