English-Doc-Topic-BERT
Engish-Doc-Topic-BERT model is a BERT-Base-uncased model fine-tuned on Engish documents from the L3Cube-IndicNews Corpus [dataset link]https://github.com/l3cube-pune/indic-nlp.
This dataset consists of sub-datasets like LDC (Long Document Classification), LPC (Long Paragraph Classification), and SHC (Short Headlines Classification), each having different document lengths.
This model is trained on a combination of all three variants and works well across different document sizes.
More details on the dataset, models, and baseline results can be found in our [paper]https://arxiv.org/abs/2401.02254
Citing:
@article{mirashi2024l3cube,
title={L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages},
author={Mirashi, Aishwarya and Sonavane, Srushti and Lingayat, Purva and Padhiyar, Tejas and Joshi, Raviraj},
journal={arXiv preprint arXiv:2401.02254},
year={2024}
}
Other document topic models for different Indic languages are listed below:
Hindi-Doc-Topic-BERT
Marathi-Doc-Topic-BERT
Bengali-Doc-Topic-BERT
Telugu-Doc-Topic-BERT
Tamil-Doc-Topic-BERT
Gujarati-Doc-Topic-BERT
Kannada-Doc-Topic-BERT
Odia-Doc-Topic-BERT
Malayalam-Doc-Topic-BERT
Punjabi-Doc-Topic-BERT
English-Doc-Topic-BERT
- Downloads last month
- 2