nreimers commited on
Commit
a6945f8
1 Parent(s): 5cb35e2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +96 -82
README.md CHANGED
@@ -1,82 +1,96 @@
1
- ---
2
- language: en, ar, zh, nl, fr, de, hi, in, it, ja, pt ru, es, vi
3
- datasets:
4
- - unicamp-dl/mmarco
5
- widget:
6
- - text: "Python ist eine universelle, üblicherweise interpretierte, höhere Programmiersprache. Sie hat den Anspruch, einen gut lesbaren, knappen Programmierstil zu fördern. So werden beispielsweise Blöcke nicht durch geschweifte Klammern, sondern durch Einrückungen strukturiert."
7
-
8
- license: apache-2.0
9
- ---
10
-
11
- # doc2query/msmarco-14langs-mt5-base-v1
12
-
13
- This is a [doc2query](https://arxiv.org/abs/1904.08375) model based on mT5 (also known as [docT5query](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf)). It was trained on all 14 languages of [mMARCO dataset](https://github.com/unicamp-dl/mMARCO), i.e. you can input a passage in any of the 14 languages, and it will generate a query in the same language.
14
-
15
- It can be used for:
16
- - **Document expansion**: You generate for your paragraphs 20-40 queries and index the paragraphs and the generates queries in a standard BM25 index like Elasticsearch, OpenSearch, or Lucene. The generated queries help to close the lexical gap of lexical search, as the generate queries contain synonyms. Further, it re-weights words giving important words a higher weight even if they appear seldomn in a paragraph. In our [BEIR](https://arxiv.org/abs/2104.08663) paper we showed that BM25+docT5query is a powerful search engine. In the [BEIR repository](https://github.com/beir-cellar/beir) we have an example how to use docT5query with Pyserini.
17
- - **Domain Specific Training Data Generation**: It can be used to generate training data to learn an embedding model. In our [GPL-Paper](https://arxiv.org/abs/2112.07577) / [GPL Example on SBERT.net](https://www.sbert.net/examples/domain_adaptation/README.html#gpl-generative-pseudo-labeling) we have an example how to use the model to generate (query, text) pairs for a given collection of unlabeled texts. These pairs can then be used to train powerful dense embedding models.
18
-
19
- ## Usage
20
- ```python
21
- from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
22
- import torch
23
-
24
- model_name = 'doc2query/msmarco-14langs-mt5-base-v1'
25
- tokenizer = AutoTokenizer.from_pretrained(model_name)
26
- model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
27
-
28
- text = "Python ist eine universelle, üblicherweise interpretierte, höhere Programmiersprache. Sie hat den Anspruch, einen gut lesbaren, knappen Programmierstil zu fördern. So werden beispielsweise Blöcke nicht durch geschweifte Klammern, sondern durch Einrückungen strukturiert."
29
-
30
-
31
- def create_queries(para):
32
- input_ids = tokenizer.encode(para, return_tensors='pt')
33
- with torch.no_grad():
34
- # Here we use top_k / top_k random sampling. It generates more diverse queries, but of lower quality
35
- sampling_outputs = model.generate(
36
- input_ids=input_ids,
37
- max_length=64,
38
- do_sample=True,
39
- top_p=0.95,
40
- top_k=10,
41
- num_return_sequences=5
42
- )
43
-
44
- # Here we use Beam-search. It generates better quality queries, but with less diversity
45
- beam_outputs = model.generate(
46
- input_ids=input_ids,
47
- max_length=64,
48
- num_beams=5,
49
- no_repeat_ngram_size=2,
50
- num_return_sequences=5,
51
- early_stopping=True
52
- )
53
-
54
-
55
- print("Paragraph:")
56
- print(para)
57
-
58
- print("\nBeam Outputs:")
59
- for i in range(len(beam_outputs)):
60
- query = tokenizer.decode(beam_outputs[i], skip_special_tokens=True)
61
- print(f'{i + 1}: {query}')
62
-
63
- print("\nSampling Outputs:")
64
- for i in range(len(sampling_outputs)):
65
- query = tokenizer.decode(sampling_outputs[i], skip_special_tokens=True)
66
- print(f'{i + 1}: {query}')
67
-
68
- create_queries(text)
69
-
70
- ```
71
-
72
- **Note:** `model.generate()` is non-deterministic for top_k/top_n sampling. It produces different queries each time you run it.
73
-
74
- ## Training
75
- This model fine-tuned [google/mt5-base](https://huggingface.co/google/mt5-base) for 525k training steps on all 14 languages from [mMARCO dataset](https://github.com/unicamp-dl/mMARCO). For the training script, see the `train_script.py` in this repository.
76
-
77
- The input-text was truncated to 320 word pieces. Output text was generated up to 64 word pieces.
78
-
79
- This model was trained on a (query, passage) from the [mMARCO dataset](https://github.com/unicamp-dl/mMARCO).
80
-
81
-
82
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - ar
5
+ - zh
6
+ - nl
7
+ - fr
8
+ - de
9
+ - hi
10
+ - in
11
+ - it
12
+ - ja
13
+ - pt
14
+ - ru
15
+ - es
16
+ - vi
17
+ datasets:
18
+ - unicamp-dl/mmarco
19
+ widget:
20
+ - text: "Python ist eine universelle, üblicherweise interpretierte, höhere Programmiersprache. Sie hat den Anspruch, einen gut lesbaren, knappen Programmierstil zu fördern. So werden beispielsweise Blöcke nicht durch geschweifte Klammern, sondern durch Einrückungen strukturiert."
21
+
22
+ license: apache-2.0
23
+ ---
24
+
25
+ # doc2query/msmarco-14langs-mt5-base-v1
26
+
27
+ This is a [doc2query](https://arxiv.org/abs/1904.08375) model based on mT5 (also known as [docT5query](https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf)). It was trained on all 14 languages of [mMARCO dataset](https://github.com/unicamp-dl/mMARCO), i.e. you can input a passage in any of the 14 languages, and it will generate a query in the same language.
28
+
29
+ It can be used for:
30
+ - **Document expansion**: You generate for your paragraphs 20-40 queries and index the paragraphs and the generates queries in a standard BM25 index like Elasticsearch, OpenSearch, or Lucene. The generated queries help to close the lexical gap of lexical search, as the generate queries contain synonyms. Further, it re-weights words giving important words a higher weight even if they appear seldomn in a paragraph. In our [BEIR](https://arxiv.org/abs/2104.08663) paper we showed that BM25+docT5query is a powerful search engine. In the [BEIR repository](https://github.com/beir-cellar/beir) we have an example how to use docT5query with Pyserini.
31
+ - **Domain Specific Training Data Generation**: It can be used to generate training data to learn an embedding model. In our [GPL-Paper](https://arxiv.org/abs/2112.07577) / [GPL Example on SBERT.net](https://www.sbert.net/examples/domain_adaptation/README.html#gpl-generative-pseudo-labeling) we have an example how to use the model to generate (query, text) pairs for a given collection of unlabeled texts. These pairs can then be used to train powerful dense embedding models.
32
+
33
+ ## Usage
34
+ ```python
35
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
36
+ import torch
37
+
38
+ model_name = 'doc2query/msmarco-14langs-mt5-base-v1'
39
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
40
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
41
+
42
+ text = "Python ist eine universelle, üblicherweise interpretierte, höhere Programmiersprache. Sie hat den Anspruch, einen gut lesbaren, knappen Programmierstil zu fördern. So werden beispielsweise Blöcke nicht durch geschweifte Klammern, sondern durch Einrückungen strukturiert."
43
+
44
+
45
+ def create_queries(para):
46
+ input_ids = tokenizer.encode(para, return_tensors='pt')
47
+ with torch.no_grad():
48
+ # Here we use top_k / top_k random sampling. It generates more diverse queries, but of lower quality
49
+ sampling_outputs = model.generate(
50
+ input_ids=input_ids,
51
+ max_length=64,
52
+ do_sample=True,
53
+ top_p=0.95,
54
+ top_k=10,
55
+ num_return_sequences=5
56
+ )
57
+
58
+ # Here we use Beam-search. It generates better quality queries, but with less diversity
59
+ beam_outputs = model.generate(
60
+ input_ids=input_ids,
61
+ max_length=64,
62
+ num_beams=5,
63
+ no_repeat_ngram_size=2,
64
+ num_return_sequences=5,
65
+ early_stopping=True
66
+ )
67
+
68
+
69
+ print("Paragraph:")
70
+ print(para)
71
+
72
+ print("\nBeam Outputs:")
73
+ for i in range(len(beam_outputs)):
74
+ query = tokenizer.decode(beam_outputs[i], skip_special_tokens=True)
75
+ print(f'{i + 1}: {query}')
76
+
77
+ print("\nSampling Outputs:")
78
+ for i in range(len(sampling_outputs)):
79
+ query = tokenizer.decode(sampling_outputs[i], skip_special_tokens=True)
80
+ print(f'{i + 1}: {query}')
81
+
82
+ create_queries(text)
83
+
84
+ ```
85
+
86
+ **Note:** `model.generate()` is non-deterministic for top_k/top_n sampling. It produces different queries each time you run it.
87
+
88
+ ## Training
89
+ This model fine-tuned [google/mt5-base](https://huggingface.co/google/mt5-base) for 525k training steps on all 14 languages from [mMARCO dataset](https://github.com/unicamp-dl/mMARCO). For the training script, see the `train_script.py` in this repository.
90
+
91
+ The input-text was truncated to 320 word pieces. Output text was generated up to 64 word pieces.
92
+
93
+ This model was trained on a (query, passage) from the [mMARCO dataset](https://github.com/unicamp-dl/mMARCO).
94
+
95
+
96
+