indobenchmark/indobert-base-p1 · Difference b/w indolem/indobert-base-uncased and indobenchmark/indobert-base-p1

Jun 6, 2022

I see two IndoBERT model on HF and wanted to understand how indobenchmark/indobert-base-p1 differes from indolem/indobert-base-uncased. I want to use one of them for my pre-training task.

samuelcahyawijaya

Indo Benchmark org Jun 6, 2022

Hi there, so basically the pretraining dataset for indobenchmark/indobert-base-p1 and indolem/indobert-base-uncased are completely different.
indobenchmark/indobert-base-p1 is trained on Indo4B corpus which is around ~3.6B words, while indolem/indobert-base-uncased is trained on Wikipedia, news articles and Indonesian Web Corpus with a total of 220M words. Pretraining hyperparameter of each model is also different.

We also have an existing project working which compares several Indonesian models for multiple downstream tasks (the project is still on-going so we cannot share the detail of the result yet). Based on our result, we found that indobenchmark/indobert-base-p1 outperforms indolem/indobert-base-uncased on most of the tasks. Additionally, there is also indobenchmark/indobert-large-p1 which is the larger version of the indobenchmark/indobert-base-p1. So, if a better evaluation result of the model is critical in your use case, I would suggest to use indobenchmark/indobert-base-p1.

Hope it helps!

99sbr

Jun 8, 2022

Thanks a lot !! Looking forward for comparative analysis project details to be published soon.