davebulaval
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -18,38 +18,84 @@ Its goal is to assess meaning preservation between two sentences that correlate
|
|
18 |
checks. For more details, refer to our publicly available article.
|
19 |
|
20 |
> This public version of our model uses the best model trained (where in our article, we present the performance results
|
21 |
-
> of an average of 10 models) for a more extended period (
|
22 |
-
> model can further reduce dev loss and increase performance.
|
|
|
|
|
|
|
23 |
|
24 |
## Sanity Check
|
25 |
|
26 |
Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric.
|
27 |
-
However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive
|
28 |
a large dataset
|
29 |
annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between
|
30 |
identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving).
|
31 |
In these tests, the meaning preservation target value is not subjective and does not require human annotation to
|
32 |
-
|
33 |
achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are
|
34 |
compared and return a null score (i.e., 0%) if two sentences are completely unrelated.
|
35 |
|
36 |
-
### Identical
|
37 |
|
38 |
The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass
|
39 |
this test, we count the number of times a metric rating was greater or equal to a threshold value X∈[95, 99] and divide
|
40 |
-
|
41 |
for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of
|
42 |
100%.
|
43 |
|
44 |
-
### Unrelated
|
45 |
|
46 |
Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large
|
47 |
language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely
|
48 |
irrelevant sentence mainly composed of irrelevant words (also known as word soup). Since this test's expected rating is
|
49 |
0, we check that the metric rating is lower or equal to a threshold value X∈[5, 1].
|
50 |
-
Again, to account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use
|
51 |
a threshold value of 0%.
|
52 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
53 |
## Cite
|
54 |
|
55 |
Use the following citation to cite MeaningBERT
|
@@ -67,7 +113,17 @@ ISSN={2624-8212},
|
|
67 |
}
|
68 |
```
|
69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
## License
|
71 |
|
72 |
MeaningBERT is MIT licensed, as found in
|
73 |
-
the [LICENSE file](https://github.com/GRAAL-Research/risc/blob/main/LICENSE).
|
|
|
|
|
|
18 |
checks. For more details, refer to our publicly available article.
|
19 |
|
20 |
> This public version of our model uses the best model trained (where in our article, we present the performance results
|
21 |
+
> of an average of 10 models) for a more extended period (500 epochs instead of 250). We have observed later that the
|
22 |
+
> model can further reduce dev loss and increase performance. Also, we have changed the data augmentation technique used
|
23 |
+
> in the article for a more robust one, that also includes the commutative property of the meaning function. Namely, Meaning(Sent_a, Sent_b) = Meaning(Sent_b, Sent_a).
|
24 |
+
|
25 |
+
- [HuggingFace Model Card](https://huggingface.co/davebulaval/MeaningBERT)
|
26 |
|
27 |
## Sanity Check
|
28 |
|
29 |
Correlation to human judgment is one way to evaluate the quality of a meaning preservation metric.
|
30 |
+
However, it is inherently subjective, since it uses human judgment as a gold standard, and expensive since it requires
|
31 |
a large dataset
|
32 |
annotated by several humans. As an alternative, we designed two automated tests: evaluating meaning preservation between
|
33 |
identical sentences (which should be 100% preserving) and between unrelated sentences (which should be 0% preserving).
|
34 |
In these tests, the meaning preservation target value is not subjective and does not require human annotation to
|
35 |
+
be measured. They represent a trivial and minimal threshold a good automatic meaning preservation metric should be able to
|
36 |
achieve. Namely, a metric should be minimally able to return a perfect score (i.e., 100%) if two identical sentences are
|
37 |
compared and return a null score (i.e., 0%) if two sentences are completely unrelated.
|
38 |
|
39 |
+
### Identical Sentences
|
40 |
|
41 |
The first test evaluates meaning preservation between identical sentences. To analyze the metrics' capabilities to pass
|
42 |
this test, we count the number of times a metric rating was greater or equal to a threshold value X∈[95, 99] and divide
|
43 |
+
It is calculated by the number of sentences to create a ratio of the number of times the metric gives the expected rating. To account
|
44 |
for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use a threshold value of
|
45 |
100%.
|
46 |
|
47 |
+
### Unrelated Sentences
|
48 |
|
49 |
Our second test evaluates meaning preservation between a source sentence and an unrelated sentence generated by a large
|
50 |
language model.3 The idea is to verify that the metric finds a meaning preservation rating of 0 when given a completely
|
51 |
irrelevant sentence mainly composed of irrelevant words (also known as word soup). Since this test's expected rating is
|
52 |
0, we check that the metric rating is lower or equal to a threshold value X∈[5, 1].
|
53 |
+
Again, to account for computer floating-point inaccuracy, we round the ratings to the nearest integer and do not use
|
54 |
a threshold value of 0%.
|
55 |
|
56 |
+
## Use MeaningBERT
|
57 |
+
|
58 |
+
You can use MeaningBERT as a [model](https://huggingface.co/davebulaval/MeaningBERT) that you can retrain or use for
|
59 |
+
inference using the following with HuggingFace
|
60 |
+
|
61 |
+
```python
|
62 |
+
# Load model directly
|
63 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
64 |
+
|
65 |
+
tokenizer = AutoTokenizer.from_pretrained("davebulaval/MeaningBERT")
|
66 |
+
model = AutoModelForSequenceClassification.from_pretrained("davebulaval/MeaningBERT")
|
67 |
+
```
|
68 |
+
|
69 |
+
or you can use MeaningBERT as a metric for evaluation (no retrain) using the following with HuggingFace
|
70 |
+
|
71 |
+
## Code Examples
|
72 |
+
|
73 |
+
```python
|
74 |
+
import torch
|
75 |
+
|
76 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
77 |
+
|
78 |
+
tokenizer = AutoTokenizer.from_pretrained("davebulaval/MeaningBERT")
|
79 |
+
scorer = AutoModelForSequenceClassification.from_pretrained("davebulaval/MeaningBERT")
|
80 |
+
scorer.eval()
|
81 |
+
|
82 |
+
documents = ["He wanted to make them pay.", "This sandwich looks delicious.", "He wants to eat."]
|
83 |
+
simplifications = ["He wanted to make them pay.", "This sandwich looks delicious.",
|
84 |
+
"Whatever, whenever, this is a sentence."]
|
85 |
+
|
86 |
+
# We tokenize the text as a pair and return Pytorch Tensors
|
87 |
+
tokenize_text = tokenizer(documents, simplifications, truncation=True, padding=True, return_tensors="pt")
|
88 |
+
|
89 |
+
with torch.no_grad():
|
90 |
+
# We process the text
|
91 |
+
scores = scorer(**tokenize_text)
|
92 |
+
|
93 |
+
print(scores.logits.tolist())
|
94 |
+
```
|
95 |
+
|
96 |
+
|
97 |
+
------------------
|
98 |
+
|
99 |
## Cite
|
100 |
|
101 |
Use the following citation to cite MeaningBERT
|
|
|
113 |
}
|
114 |
```
|
115 |
|
116 |
+
------------------
|
117 |
+
|
118 |
+
## Contributing to MeaningBERT
|
119 |
+
|
120 |
+
We welcome user input, whether it regards bugs found in the library or feature propositions! Make sure to have a
|
121 |
+
look at our [contributing guidelines](https://github.com/GRAAL-Research/MeaningBERT/blob/main/.github/CONTRIBUTING.md)
|
122 |
+
for more details on this matter.
|
123 |
+
|
124 |
## License
|
125 |
|
126 |
MeaningBERT is MIT licensed, as found in
|
127 |
+
the [LICENSE file](https://github.com/GRAAL-Research/risc/blob/main/LICENSE).
|
128 |
+
|
129 |
+
------------------
|