system HF staff commited on
Commit
7fb6fc4
1 Parent(s): 832ceef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +66 -0
README.md ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Releasing Hindi ELECTRA model
2
+
3
+ This is a first attempt at a Hindi language model trained with Google Research's [ELECTRA](https://github.com/google-research/electra). **I don't modify ELECTRA until we get into finetuning**
4
+
5
+ Tokenization and training CoLab: https://colab.research.google.com/drive/1R8TciRSM7BONJRBc9CBZbzOmz39FTLl_
6
+
7
+ Blog post: https://medium.com/@mapmeld/teaching-hindi-to-electra-b11084baab81
8
+
9
+ I was greatly influenced by: https://huggingface.co/blog/how-to-train
10
+
11
+ ## Corpus
12
+
13
+ Download: https://drive.google.com/drive/u/1/folders/1WikYHHMI72hjZoCQkLPr45LDV8zm9P7p
14
+
15
+ The corpus is two files:
16
+ - Hindi CommonCrawl deduped by OSCAR https://traces1.inria.fr/oscar/
17
+ - latest Hindi Wikipedia ( https://dumps.wikimedia.org/hiwiki/20200420/ ) + WikiExtractor to txt
18
+
19
+ Bonus notes:
20
+ - Adding English wiki text or parallel corpus could help with cross-lingual tasks and training
21
+
22
+ ## Vocabulary
23
+
24
+ https://drive.google.com/file/d/1-02Um-8ogD4vjn4t-wD2EwCE-GtBjnzh/view?usp=sharing
25
+
26
+ Bonus notes:
27
+ - Created with HuggingFace Tokenizers; could be longer or shorter, review ELECTRA vocab_size param
28
+
29
+ ## Pretrain TF Records
30
+
31
+ [build_pretraining_dataset.py](https://github.com/google-research/electra/blob/master/build_pretraining_dataset.py) splits the corpus into training documents
32
+
33
+ Set the ELECTRA model size and whether to split the corpus by newlines. This process can take hours on its own.
34
+
35
+ https://drive.google.com/drive/u/1/folders/1--wBjSH59HSFOVkYi4X-z5bigLnD32R5
36
+
37
+ Bonus notes:
38
+ - I am not sure of the meaning of the corpus newline split (what is the alternative?) and given this corpus, which creates the better training docs
39
+
40
+ ## Training
41
+
42
+ Structure your files, with data-dir named "trainer" here
43
+
44
+ ```
45
+ trainer
46
+ - vocab.txt
47
+ - pretrain_tfrecords
48
+ -- (all .tfrecord... files)
49
+ - models
50
+ -- modelname
51
+ --- checkpoint
52
+ --- graph.pbtxt
53
+ --- model.*
54
+ ```
55
+
56
+ CoLab notebook gives examples of GPU vs. TPU setup
57
+
58
+ [configure_pretraining.py](https://github.com/google-research/electra/blob/master/configure_pretraining.py)
59
+
60
+ Model https://drive.google.com/drive/folders/1cwQlWryLE4nlke4OixXA7NK8hzlmUR0c?usp=sharing
61
+
62
+ ## Using this model with Transformers
63
+
64
+ Sample movie reviews classifier: https://colab.research.google.com/drive/1mSeeSfVSOT7e-dVhPlmSsQRvpn6xC05w
65
+
66
+ Slightly outperforms Multilingual BERT on these Hindi Movie Reviews from https://github.com/sid573/Hindi_Sentiment_Analysis