system HF staff commited on
Commit
5229414
1 Parent(s): ecf0125

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -1
README.md CHANGED
@@ -1,3 +1,7 @@
 
 
 
 
1
  # TaMillion
2
 
3
  This is a first attempt at a Tamil language model trained with
@@ -7,7 +11,7 @@ Tokenization and pre-training CoLab: https://colab.research.google.com/drive/1Gn
7
 
8
  V2 (current): 190,000 steps; (V1 was 100,000 steps)
9
 
10
- ## Usage
11
 
12
  Sudalai Rajkumar's Tamil-NLP page contains classification and regression tasks:
13
  https://www.kaggle.com/sudalairajkumar/tamil-nlp
@@ -22,6 +26,11 @@ The model slightly outperformed mBERT on movie reviews:
22
 
23
  Equivalent accuracy on the Tirukkural topic task.
24
 
 
 
 
 
 
25
  ## Corpus
26
 
27
  Trained on a web crawl from https://oscar-corpus.com/ (deduped version, 5.1GB) and 1 July 2020 dump of ta.wikipedia.org (476MB)
 
1
+ ---
2
+ language: ta
3
+ ---
4
+
5
  # TaMillion
6
 
7
  This is a first attempt at a Tamil language model trained with
 
11
 
12
  V2 (current): 190,000 steps; (V1 was 100,000 steps)
13
 
14
+ ## Classification
15
 
16
  Sudalai Rajkumar's Tamil-NLP page contains classification and regression tasks:
17
  https://www.kaggle.com/sudalairajkumar/tamil-nlp
 
26
 
27
  Equivalent accuracy on the Tirukkural topic task.
28
 
29
+ ## Question Answering
30
+
31
+ I didn't find a Tamil-language question answering dataset, but this model could be used
32
+ to train a QA model. See Hindi and Bengali examples here: https://colab.research.google.com/drive/1i6fidh2tItf_-IDkljMuaIGmEU6HT2Ar
33
+
34
  ## Corpus
35
 
36
  Trained on a web crawl from https://oscar-corpus.com/ (deduped version, 5.1GB) and 1 July 2020 dump of ta.wikipedia.org (476MB)