ccsasuke commited on
Commit
cdf4184
1 Parent(s): ce2716d

Initial commit

Browse files
Files changed (5) hide show
  1. README.md +36 -1
  2. config.json +21 -0
  3. pytorch_model.bin +3 -0
  4. tokenizer.json +0 -0
  5. vocab.json +0 -0
README.md CHANGED
@@ -1,3 +1,38 @@
1
  ---
2
- license: cc-by-nc-4.0
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ tags:
3
+ - feature-extraction
4
+ pipeline_tag: feature-extraction
5
  ---
6
+ DRAGON-RoBERTa is a BERT-base sized dense retriever initialized from [RoBERTa](https://huggingface.co/roberta-base) and further trained on the data augmented from MS MARCO corpus, following the approach described in [How to Train Your DRAGON:
7
+ Diverse Augmentation Towards Generalizable Dense Retrieval](\url). The associated GitHub repository is available here https://github.com/facebookresearch/dpr-scale/tree/dragon. We use asymmetric dual encoder, with two distinctly parameterized encoders.
8
+ The following models are also available:
9
+ Model | Initialization | Query Encoder Path | Context Encoder Path
10
+ |---|---|---
11
+ DRAGON-RoBERTa | roberta-base | facebook/dragon-roberta-query-encoder | facebook/dragon-roberta-context-encoder
12
+
13
+ ## Usage (HuggingFace Transformers)
14
+ Using the model directly available in HuggingFace transformers .
15
+
16
+ ```python
17
+ import torch
18
+ from transformers import AutoTokenizer, AutoModel
19
+ tokenizer = AutoTokenizer.from_pretrained('facebook/dragon-roberta-query-encoder')
20
+ query_encoder = AutoModel.from_pretrained('facebook/dragon-roberta-query-encoder')
21
+ context_encoder = AutoModel.from_pretrained('facebook/dragon-roberta-context-encoder')
22
+
23
+ # We use msmarco query and passages as an example
24
+ query = "Where was Marie Curie born?"
25
+ contexts = [
26
+ "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
27
+ "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
28
+ ]
29
+ # Apply tokenizer
30
+ query_input = tokenizer(query, return_tensors='pt')
31
+ ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
32
+ # Compute embeddings: take the last-layer hidden state of the [CLS] token
33
+ query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
34
+ ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]
35
+ # Compute similarity scores using dot product
36
+ score1 = query_emb @ ctx_emb[0] # 385.1422
37
+ score2 = query_emb @ ctx_emb[1] # 383.6051
38
+ ```
config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "eos_token_id": 2,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 768,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 3072,
13
+ "layer_norm_eps": 1e-05,
14
+ "max_position_embeddings": 514,
15
+ "model_type": "roberta",
16
+ "num_attention_heads": 12,
17
+ "num_hidden_layers": 12,
18
+ "pad_token_id": 1,
19
+ "type_vocab_size": 1,
20
+ "vocab_size": 50265
21
+ }
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:05670fab6852730bdfdf0f810fe2abfa5bc8dacfe54bf8f57990dfa24bfc82e2
3
+ size 498649201
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff