Sheshera commited on
Commit
0379fd1
1 Parent(s): 340520d

Add language and domain.

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -26,7 +26,7 @@ Model included in a paper for modeling fine grained similarity between documents
26
  >
27
  > "Distant supervision [31, 43, 21, 49] generates training data automatically by aligning texts and a knowledge base (KB) (see Fig. 1 )."
28
 
29
- **Training procedure:** The model was trained with the Adam Optimizer and a learning rate of 2e-5 with 1000 warm-up steps followed by linear decay of the learning rate. The model training convergence is checked with the loss on a held out dev set consisting of co-citation context pairs.
30
 
31
  **Intended uses & limitations:** This model is trained for sentence similarity tasks in scientific text and is best used as a sentence encoder. However with appropriate fine-tuning the model can also be used for other tasks such as classification. Note that about 50% of the training data consists of text from biomedical text and performance may be superior on text from bio-medicine and similar domains.
32
 
@@ -54,7 +54,7 @@ clsrep_sb = sentbert_model.encode([s])
54
  ```
55
 
56
  **Variable and metrics:**
57
- Since the paper this model was trained for proposes methods for similarity of scientific abstracts, this model is evaluated on information retrieval datasets with document level queries. The datasets used for the paper include RELISH, TRECCOVID, and CSFCube. These are all detailed on [github](https://github.com/allenai/aspire) and in our [paper](https://arxiv.org/abs/2111.08366). RELISH and TRECCOVID represent a abstract level retrieval task, where given a query scientific abstract the task requires the retrieval of relevant candidate abstracts. CSFCube presents a slightly different task and presents a set of finer-grained sentences in the abstract based on which a finer-grained retrieval must be made. This task represents the closest task to a sentence similarity task.
58
 
59
  In using this sentence level model for abstract level retrieval we rank documents by the minimal L2 distance between the sentences in the query and candidate abstract.
60
 
 
26
  >
27
  > "Distant supervision [31, 43, 21, 49] generates training data automatically by aligning texts and a knowledge base (KB) (see Fig. 1 )."
28
 
29
+ **Training procedure:** The model was trained with the Adam Optimizer and a learning rate of 2e-5 with 1000 warm-up steps followed by linear decay of the learning rate. The model training convergence is checked with the loss on a held out dev set consisting of co-citation context pairs. All the training data used was in English.
30
 
31
  **Intended uses & limitations:** This model is trained for sentence similarity tasks in scientific text and is best used as a sentence encoder. However with appropriate fine-tuning the model can also be used for other tasks such as classification. Note that about 50% of the training data consists of text from biomedical text and performance may be superior on text from bio-medicine and similar domains.
32
 
 
54
  ```
55
 
56
  **Variable and metrics:**
57
+ Since the paper this model was trained for proposes methods for similarity of scientific abstracts, this model is evaluated on information retrieval datasets with document level queries. The datasets used for the paper include RELISH (biomedical/English), TRECCOVID (biomedical/English), and CSFCube (computer science/English). These are all detailed on [github](https://github.com/allenai/aspire) and in our [paper](https://arxiv.org/abs/2111.08366). RELISH and TRECCOVID represent a abstract level retrieval task, where given a query scientific abstract the task requires the retrieval of relevant candidate abstracts. CSFCube presents a slightly different task and presents a set of finer-grained sentences in the abstract based on which a finer-grained retrieval must be made. This task represents the closest task to a sentence similarity task.
58
 
59
  In using this sentence level model for abstract level retrieval we rank documents by the minimal L2 distance between the sentences in the query and candidate abstract.
60