Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,7 @@
|
|
|
|
|
|
|
|
|
|
1 |
# TaMillion
|
2 |
|
3 |
This is a first attempt at a Tamil language model trained with
|
@@ -7,7 +11,7 @@ Tokenization and pre-training CoLab: https://colab.research.google.com/drive/1Gn
|
|
7 |
|
8 |
V2 (current): 190,000 steps; (V1 was 100,000 steps)
|
9 |
|
10 |
-
##
|
11 |
|
12 |
Sudalai Rajkumar's Tamil-NLP page contains classification and regression tasks:
|
13 |
https://www.kaggle.com/sudalairajkumar/tamil-nlp
|
@@ -22,6 +26,11 @@ The model slightly outperformed mBERT on movie reviews:
|
|
22 |
|
23 |
Equivalent accuracy on the Tirukkural topic task.
|
24 |
|
|
|
|
|
|
|
|
|
|
|
25 |
## Corpus
|
26 |
|
27 |
Trained on a web crawl from https://oscar-corpus.com/ (deduped version, 5.1GB) and 1 July 2020 dump of ta.wikipedia.org (476MB)
|
|
|
1 |
+
---
|
2 |
+
language: ta
|
3 |
+
---
|
4 |
+
|
5 |
# TaMillion
|
6 |
|
7 |
This is a first attempt at a Tamil language model trained with
|
|
|
11 |
|
12 |
V2 (current): 190,000 steps; (V1 was 100,000 steps)
|
13 |
|
14 |
+
## Classification
|
15 |
|
16 |
Sudalai Rajkumar's Tamil-NLP page contains classification and regression tasks:
|
17 |
https://www.kaggle.com/sudalairajkumar/tamil-nlp
|
|
|
26 |
|
27 |
Equivalent accuracy on the Tirukkural topic task.
|
28 |
|
29 |
+
## Question Answering
|
30 |
+
|
31 |
+
I didn't find a Tamil-language question answering dataset, but this model could be used
|
32 |
+
to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar
|
33 |
+
|
34 |
## Corpus
|
35 |
|
36 |
Trained on a web crawl from https://oscar-corpus.com/ (deduped version, 5.1GB) and 1 July 2020 dump of ta.wikipedia.org (476MB)
|