Mavkif commited on
Commit
2f43805
·
verified ·
1 Parent(s): eb8e2d8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +123 -4
README.md CHANGED
@@ -1,7 +1,7 @@
1
  ---
2
- metrics: null
3
- Recall @10: 0.438
4
- MRR @10: 0.247
5
  base_model:
6
  - unicamp-dl/mt5-base-mmarco-v2
7
  tags:
@@ -9,4 +9,123 @@ tags:
9
  - Natural Language Processing
10
  - Question Answering
11
  license: apache-2.0
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ metrics:
3
+ - Recall @10 0.438
4
+ - MRR @10 0.247
5
  base_model:
6
  - unicamp-dl/mt5-base-mmarco-v2
7
  tags:
 
9
  - Natural Language Processing
10
  - Question Answering
11
  license: apache-2.0
12
+ ---
13
+
14
+ # Urdu-mT5-mmarco: Fine-Tuned mT5 Model for Urdu Information Retrieval
15
+
16
+ As part of ongoing efforts to make Information Retrieval (IR) more inclusive, this model addresses the needs of low-resource languages, focusing specifically on Urdu.
17
+ We created this model by translating the MS-Marco dataset into Urdu using the IndicTrans2 model.
18
+ To establish baseline performance, we initially tested for zero-shot learning for IR in Urdu using the unicamp-dl/mt5-base-mmarco-v2 model
19
+ and then applied fine-tuning with the mMARCO multilingual IR methodology on the translated dataset.
20
+
21
+ ## Model Details
22
+
23
+ ### Model Description
24
+
25
+ <!-- Provide a longer summary of what this model is. -->
26
+
27
+
28
+
29
+ - **Developed by:** Umer Butt
30
+ - **Model type:** MT5ForConditionalGeneration
31
+ - **Language(s) (NLP):** Python/pytorch
32
+
33
+
34
+
35
+ ## Uses
36
+
37
+
38
+
39
+ ### Direct Use
40
+
41
+
42
+
43
+
44
+ ## Bias, Risks, and Limitations
45
+
46
+ Although this model performs well and is state-of-the-art for now. But still this model is finetuned on mmarco model and a translated dataset(which was created using indicTrans2 model). Hence the limitations of those apply here too.
47
+
48
+ ### Recommendations
49
+
50
+
51
+ ## How to Get Started with the Model
52
+
53
+ Use the code below to get started with the model.
54
+
55
+
56
+
57
+ ## Evaluation
58
+
59
+ The evaluation was done using the scripts in the pygaggle library. Specifically these files:
60
+ evaluate_monot5_reranker.py
61
+ ms_marco_eval.py
62
+
63
+ #### Metrics
64
+ Following the approach in the mmarco work. The same two metrics were used.
65
+
66
+ Recal @10 : 0.438
67
+ MRR @10 : 0.247
68
+
69
+
70
+ ### Results
71
+ ## Detailed Results
72
+
73
+ | Model | Name | Data | Recall@10 | MRR@10 | Queries Ranked |
74
+ |---------------------------------------|---------------------------------------|--------------|-----------|--------|----------------|
75
+ | bm25 (k = 1000) | BM25 - Baseline from mmarco paper | English data | 0.391 | 0.187 | 6980 |
76
+ | unicamp-dl/mt5-base-mmarco-v2 | mmarco reranker - Baseline from paper | English data | | 0.370 | 6980 |
77
+ | bm25 (k = 1000) | BM25 | Urdu data | 0.2675 | 0.129 | 6980 |
78
+ | unicamp-dl/mt5-base-mmarco-v2 | Zero-shot mmarco | Urdu data | 0.408 | 0.204 | 6980 |
79
+ | This work | Mavkif/urdu-mt5-mmarco | Urdu data | 0.438 | 0.247 | 6980 |
80
+
81
+
82
+ #### Summary
83
+
84
+
85
+
86
+ ### Model Architecture and Objective
87
+ From config.json :
88
+
89
+ {
90
+ "_name_or_path": "unicamp-dl/mt5-base-mmarco-v2",
91
+ "architectures": [
92
+ "MT5ForConditionalGeneration"
93
+ ],
94
+ "classifier_dropout": 0.0,
95
+ "d_ff": 2048,
96
+ "d_kv": 64,
97
+ "d_model": 768,
98
+ "decoder_start_token_id": 0,
99
+ "dense_act_fn": "gelu_new",
100
+ "dropout_rate": 0.1,
101
+ "eos_token_id": 1,
102
+ "feed_forward_proj": "gated-gelu",
103
+ "initializer_factor": 1.0,
104
+ "is_encoder_decoder": true,
105
+ "is_gated_act": true,
106
+ "layer_norm_epsilon": 1e-06,
107
+ "model_type": "mt5",
108
+ "num_decoder_layers": 12,
109
+ "num_heads": 12,
110
+ "num_layers": 12,
111
+ "output_past": true,
112
+ "pad_token_id": 0,
113
+ "relative_attention_max_distance": 128,
114
+ "relative_attention_num_buckets": 32,
115
+ "tie_word_embeddings": false,
116
+ "tokenizer_class": "T5Tokenizer",
117
+ "torch_dtype": "float32",
118
+ "transformers_version": "4.38.2",
119
+ "use_cache": true,
120
+ "vocab_size": 250112
121
+ }
122
+
123
+
124
+ ## Model Card Authors [optional]
125
+
126
+ Umer Butt
127
+
128
+
129
+ ## Model Card Contact
130
+
131
+ mumertbutt@gmail.com