bconsolvo eduardo-alvarez commited on
Commit
869bc25
1 Parent(s): 506766e

Enriching model card for improved discoverability and consumption (#1)

Browse files

- Enriching model card for improved discoverability and consumption (8939575c7b45e53fa780ce22d1dcf114dc47d510)


Co-authored-by: Eduardo Alvarez <eduardo-alvarez@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +47 -10
README.md CHANGED
@@ -6,24 +6,52 @@ language:
6
  - en
7
  tags:
8
  - colbert
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  ---
10
- # ColBERT NQ Checkpoint
11
 
12
- This trained model is based on the [ColBERT](https://github.com/stanford-futuredata/ColBERT) model, trained on the [Natural Questions](https://huggingface.co/datasets/natural_questions) dataset.
13
 
14
- # Model Details
15
 
16
- Model is based on ColBERT, which in turn is based around a BERT encoder. The model is trained for text retrieval using a contrastive loss; given a query there's a relevant and non relevant passages.
17
 
18
- The corpus is based on [Wikipeida](https://huggingface.co/datasets/wiki_dpr).
 
 
 
 
 
 
 
 
19
 
20
- # Uses
 
 
 
 
21
 
22
- Model can be used by the [ColBERT](https://github.com/stanford-futuredata/ColBERT) codebase to initiate a retriever; one needs to build a vector index and then queries can be ran.
23
 
24
  # Evaluation
25
-
26
- Evaluation results on NQ dev:
27
 
28
  <table>
29
  <colgroup>
@@ -64,4 +92,13 @@ Evaluation results on NQ dev:
64
  <td class="org-right">52.5</td>
65
  </tr>
66
  </tbody>
67
- </table>
 
 
 
 
 
 
 
 
 
 
6
  - en
7
  tags:
8
  - colbert
9
+ - natural questions
10
+ - checkpoint
11
+ - text retrieval
12
+ metrics:
13
+ - type: NQ 10 Recall
14
+ - value: 71.1
15
+ - type: NQ 20 Recall
16
+ - value: 76.3
17
+ - type: NQ 50 Recall
18
+ - value: 80.4
19
+ - type: NQ 100 Recall
20
+ - value: 82.7
21
+ - type: NQ 10 MRR
22
+ - value: 52.1
23
+ - type: NQ 20 MRR
24
+ - value: 52.3
25
+ - type: NQ 50 MRR
26
+ - value: 52.5
27
+ - type: NQ 100 MRR
28
+ - value: 52.5
29
  ---
 
30
 
31
+ # ColBERT NQ Checkpoint
32
 
33
+ The ColBERT NQ Checkpoint is a trained model based on the ColBERT architecture, which itself leverages a BERT encoder for its operations. This model has been specifically trained on the Natural Questions (NQ) dataset, focusing on text retrieval tasks.
34
 
 
35
 
36
+ | Model Detail | Description |
37
+ | ----------- | ----------- |
38
+ | Model Authors | ? |
39
+ | Date | Feb 7, 2023 |
40
+ | Version | Checkpoint |
41
+ | Type | Text retrieval |
42
+ | Paper or Other Resources | Base Mode: [ColBERT](https://github.com/stanford-futuredata/ColBERT) Dataset: [Natural Questions](https://huggingface.co/datasets/natural_questions) |
43
+ | License | Other |
44
+ | Questions or Comments | [Community Tab](https://huggingface.co/Intel/ColBERT-NQ/discussions) and [Intel DevHub Discord](https://discord.gg/rv2Gp55UJQ)|
45
 
46
+ | Intended Use | Description |
47
+ | ----------- | ----------- |
48
+ | Primary intended uses | This model is designed for text retrieval tasks, allowing users to submit queries and receive relevant passages from a corpus, in this case, Wikipedia. It can be integrated into applications requiring efficient and accurate retrieval of information based on user queries. |
49
+ | Primary intended users | Researchers, developers, and organizations looking for a powerful text retrieval solution that can be integrated into their systems or workflows, especially those requiring retrieval from large, diverse corpora like Wikipedia. |
50
+ | Out-of-scope uses | The model is not intended for tasks beyond text retrieval, such as text generation, sentiment analysis, or other forms of natural language processing not related to retrieving relevant text passages. |
51
 
 
52
 
53
  # Evaluation
54
+ The ColBERT NQ Checkpoint model has been evaluated on the NQ dev dataset with the following results, showcasing its effectiveness in retrieving relevant passages across varying numbers of retrieved documents:
 
55
 
56
  <table>
57
  <colgroup>
 
92
  <td class="org-right">52.5</td>
93
  </tr>
94
  </tbody>
95
+ </table>
96
+
97
+ These metrics demonstrate the model's ability to accurately retrieve relevant information from a corpus, with both recall and mean reciprocal rank (MRR) improving as more passages are considered.
98
+
99
+ # Ethical Considerations
100
+ While not specifically mentioned, ethical considerations for using the ColBERT NQ Checkpoint model should include awareness of potential biases present in the training corpus (Wikipedia), and the implications of those biases on retrieved results. Users should also consider the privacy and data use implications when deploying this model in applications.
101
+
102
+ Caveats and Recommendations
103
+ - Index Creation: Users need to build a vector index from their corpus using the ColBERT codebase before running queries. This process requires computational resources and expertise in setting up and managing search indices.
104
+ - Data Bias and Fairness: Given the Wikipedia-based training corpus, users should be mindful of potential biases and the representation of information within Wikipedia, adjusting their use case or implementation as necessary to address these concerns.