File size: 842 Bytes
a89c27d
 
 
 
 
ce01c83
5b9d4bf
 
 
 
 
 
 
 
 
a89c27d
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
---
license: apache-2.0
library_name: transformers
---

i wanted to learn more about exposure bias mitigation in language models and came across [ReMask](https://huggingface.co/euclaise/ReMask-3B).
it's a neat idea, and i wanted to give it a go. 

- during training, the model processes input sequences twice - once with the full sequence & once with masked sequence.
- computes model outputs for both.
- divergence loss is computed as the average of forward and backward KL divergences.
- final loss is a weighted sum of the cross entropy losses and the divergence loss.

impl on github

```
<|user|>
Could Moulin Rouge have been hypothetically used as Spain's Spanish American War triage center?
<|logic|>
The Moulin Rouge cabaret in France had a capacity of 850 people. Spain had 700-800 injured during Spanish American War.
<|answer|>
```