--- license: apache-2.0 library_name: transformers --- i wanted to learn more about exposure bias mitigation in language models and came across [ReMask](https://huggingface.co/euclaise/ReMask-3B). it's a neat idea, and i wanted to give it a go. - during training, the model processes input sequences twice - once with the full sequence & once with masked sequence. - computes model outputs for both. - divergence loss is computed as the average of forward and backward KL divergences. - final loss is a weighted sum of the cross entropy losses and the divergence loss. impl on github ``` <|user|> Could Moulin Rouge have been hypothetically used as Spain's Spanish American War triage center? <|logic|> The Moulin Rouge cabaret in France had a capacity of 850 people. Spain had 700-800 injured during Spanish American War. <|answer|> ```