rahular commited on
Commit
9f4c911
1 Parent(s): 46108ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -19
README.md CHANGED
@@ -11,21 +11,17 @@
11
  ### Model Description
12
 
13
  <!-- Provide a longer summary of what this model is. -->
14
- Varta BERT is a model pre-trained on full training set from Varta on English and 14 Indic languages (Assamese, Bhojpuri, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, and Urdu) using a masked language modeling (MLM) objective from scratch.
15
- Varta is a large-scale headline-generation dataset for Indic languages, including 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources.
16
 
17
-
18
- The dataset and the model were introduced in [this paper](https://arxiv.org/pdf/2305.05858.pdf). The code was released in [this repository](https://github.com/rahular/varta). The data was released in [this bucket](https://console.cloud.google.com/storage/browser/varta-eu/data-release).
19
 
20
 
21
  ## Uses
22
-
23
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
24
 
25
- You can use the raw model for masked language modelling, but it's mostly intended to be fine-tuned on a downstream task. <br>
26
-
27
-
28
- Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our Varta T5 model.
29
 
30
  ## Bias, Risks, and Limitations
31
 
@@ -55,22 +51,22 @@ model = AutoModelForMaskedLM.from_pretrained("rahular/varta-bert")
55
 
56
  ### Training Data
57
  Varta contains 41.8 million high-quality news articles in 14 Indic languages and English.
58
- With 34.5 million non-English article-headline pairs, it is the largest headline-generation dataset of its kind.
59
 
60
  ### Pretraining
61
- We pretrain the Varta-BERT model using the standard BERT-Base architecture with 12 encoder layers.
62
- We train with a maximum sequence length of 512 tokens with an embedding dimension of 768.
63
- We use 12 attention heads with feed-forward width of 3072.
64
- To support all the 15 languages in dataset we use a wordpiece vocabulary of size 128K.
65
- In total, the model has 184M parameters. The model is trained with AdamW optimizer with alpha=0.9 and beta=0.98.
66
- We use an initial learning rate of 1e-4 with a warm-up of 10K steps and linearly decay the learning rate till the end of training.
67
- We train the model for a total of 1M steps which takes 10 days to finish.
68
- We use an effective batch size of 4096 and train the model on TPU v3-128 chips.
69
 
70
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
71
 
72
  ### Evaluation Results
73
- To come.
74
 
75
  ## Citation
76
 
 
11
  ### Model Description
12
 
13
  <!-- Provide a longer summary of what this model is. -->
14
+ Varta-BERT is a model pre-trained on the `full` training set of Varta in 14 Indic languages (Assamese, Bhojpuri, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, and Urdu) and English, using a masked language modeling (MLM) objective.
 
15
 
16
+ Varta is a large-scale news corpus for Indic languages, including 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources.
17
+ The dataset and the model are introduced in [this paper](https://arxiv.org/abs/2305.05858). The code is released in [this repository](https://github.com/rahular/varta). The data is released in [this bucket](https://console.cloud.google.com/storage/browser/varta-eu/data-release).
18
 
19
 
20
  ## Uses
 
21
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
22
+ You can use the raw model for masked language modelling, but it is mostly intended to be fine-tuned on a downstream task.
23
 
24
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our [Varta-T5](https://huggingface.co/rahular/varta-t5) model.
 
 
 
25
 
26
  ## Bias, Risks, and Limitations
27
 
 
51
 
52
  ### Training Data
53
  Varta contains 41.8 million high-quality news articles in 14 Indic languages and English.
54
+ With 34.5 million non-English article-headline pairs, it is the largest document-level dataset of its kind.
55
 
56
  ### Pretraining
57
+ - We pretrain the Varta-BERT model using the standard BERT-Base architecture with 12 encoder layers.
58
+ - We train with a maximum sequence length of 512 tokens with an embedding dimension of 768.
59
+ - We use 12 attention heads with feed-forward width of 3072.
60
+ - To support all the 15 languages in dataset we use a wordpiece vocabulary of size 128K.
61
+ - In total, the model has 184M parameters. The model is trained with AdamW optimizer with alpha=0.9 and beta=0.98.
62
+ - We use an initial learning rate of 1e-4 with a warm-up of 10K steps and linearly decay the learning rate till the end of training.
63
+ - We train the model for a total of 1M steps which takes 10 days to finish.
64
+ - We use an effective batch size of 4096 and train the model on TPU v3-128 chips.
65
 
66
  <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
67
 
68
  ### Evaluation Results
69
+ Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).
70
 
71
  ## Citation
72