davebulaval commited on
Commit
131851c
·
verified ·
1 Parent(s): e2eaeef

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -9
README.md CHANGED
@@ -18,38 +18,84 @@ Its goal is to assess meaning preservation between two sentences that correlate
18
  checks. For more details, refer to our publicly available article.
19
 
20
  > This public version of our model uses the best model trained (where in our article, we present the performance results
21
- > of an average of 10 models) for a more extended period (1,000 epochs instead of 250). We have observed later that the
22
- > model can further reduce dev loss and increase performance.
 
 
 
23
 
24
  ## Sanity Check
25
 
26
  Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric.
27
- However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive, since it requires
28
  a large dataset
29
  annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between
30
  identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving).
31
  In these tests, the meaning preservation target value is not subjective and does not require human annotation to
32
- measure. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to
33
  achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are
34
  compared and return a null score (i.e., 0%) if two sentences are completely unrelated.
35
 
36
- ### Identical sentences
37
 
38
  The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass
39
  this test, we count the number of times a metric rating was greater or equal to a threshold value X∈[95, 99] and divide
40
- it by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account
41
  for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of
42
  100%.
43
 
44
- ### Unrelated sentences
45
 
46
  Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large
47
  language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely
48
  irrelevant sentence mainly composed of irrelevant words (also known as word soup). Since this test's expected rating is
49
  0, we check that the metric rating is lower or equal to a threshold value X∈[5, 1].
50
- Again, to account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a
51
  a threshold value of 0%.
52
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
  ## Cite
54
 
55
  Use the following citation to cite MeaningBERT
@@ -67,7 +113,17 @@ ISSN={2624-8212},
67
  }
68
  ```
69
 
 
 
 
 
 
 
 
 
70
  ## License
71
 
72
  MeaningBERT is MIT licensed, as found in
73
- the [LICENSE file](https://github.com/GRAAL-Research/risc/blob/main/LICENSE).
 
 
 
18
  checks. For more details, refer to our publicly available article.
19
 
20
  > This public version of our model uses the best model trained (where in our article, we present the performance results
21
+ > of an average of 10 models) for a more extended period (500 epochs instead of 250). We have observed later that the
22
+ > model can further reduce dev loss and increase performance. Also, we have changed the data augmentation technique used
23
+ > in the article for a more robust one, that also includes the commutative property of the meaning function. Namely, Meaning(Sent_a, Sent_b) = Meaning(Sent_b, Sent_a).
24
+
25
+ - [HuggingFace Model Card](https://huggingface.co/davebulaval/MeaningBERT)
26
 
27
  ## Sanity Check
28
 
29
  Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric.
30
+ However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive since it requires
31
  a large dataset
32
  annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between
33
  identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving).
34
  In these tests, the meaning preservation target value is not subjective and does not require human annotation to
35
+ be measured. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to
36
  achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are
37
  compared and return a null score (i.e., 0%) if two sentences are completely unrelated.
38
 
39
+ ### Identical Sentences
40
 
41
  The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass
42
  this test, we count the number of times a metric rating was greater or equal to a threshold value X∈[95, 99] and divide
43
+ It is calculated by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account
44
  for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of
45
  100%.
46
 
47
+ ### Unrelated Sentences
48
 
49
  Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large
50
  language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely
51
  irrelevant sentence mainly composed of irrelevant words (also known as word soup). Since this test's expected rating is
52
  0, we check that the metric rating is lower or equal to a threshold value X∈[5, 1].
53
+ Again, to account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use
54
  a threshold value of 0%.
55
 
56
+ ## Use MeaningBERT
57
+
58
+ You can use MeaningBERT as a [model](https://huggingface.co/davebulaval/MeaningBERT) that you can retrain or use for
59
+ inference using the following with HuggingFace
60
+
61
+ ```python
62
+ # Load model directly
63
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained("davebulaval/MeaningBERT")
66
+ model = AutoModelForSequenceClassification.from_pretrained("davebulaval/MeaningBERT")
67
+ ```
68
+
69
+ or you can use MeaningBERT as a metric for evaluation (no retrain) using the following with HuggingFace
70
+
71
+ ## Code Examples
72
+
73
+ ```python
74
+ import torch
75
+
76
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
77
+
78
+ tokenizer = AutoTokenizer.from_pretrained("davebulaval/MeaningBERT")
79
+ scorer = AutoModelForSequenceClassification.from_pretrained("davebulaval/MeaningBERT")
80
+ scorer.eval()
81
+
82
+ documents = ["He wanted to make them pay.", "This sandwich looks delicious.", "He wants to eat."]
83
+ simplifications = ["He wanted to make them pay.", "This sandwich looks delicious.",
84
+ "Whatever, whenever, this is a sentence."]
85
+
86
+ # We tokenize the text as a pair and return Pytorch Tensors
87
+ tokenize_text = tokenizer(documents, simplifications, truncation=True, padding=True, return_tensors="pt")
88
+
89
+ with torch.no_grad():
90
+ # We process the text
91
+ scores = scorer(**tokenize_text)
92
+
93
+ print(scores.logits.tolist())
94
+ ```
95
+
96
+
97
+ ------------------
98
+
99
  ## Cite
100
 
101
  Use the following citation to cite MeaningBERT
 
113
  }
114
  ```
115
 
116
+ ------------------
117
+
118
+ ## Contributing to MeaningBERT
119
+
120
+ We welcome user input, whether it regards bugs found in the library or feature propositions! Make sure to have a
121
+ look at our [contributing guidelines](https://github.com/GRAAL-Research/MeaningBERT/blob/main/.github/CONTRIBUTING.md)
122
+ for more details on this matter.
123
+
124
  ## License
125
 
126
  MeaningBERT is MIT licensed, as found in
127
+ the [LICENSE file](https://github.com/GRAAL-Research/risc/blob/main/LICENSE).
128
+
129
+ ------------------