arxiv:2010.04245

Query-Key Normalization for Transformers

Published on Oct 8, 2020

Upvote

Authors:

Prudhvi Raj Dachapally ,

Shubham Pawar ,

Abstract

Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer's normalization to this setting, we propose QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply ell_2 normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead of dividing by the square root of the embedding dimension. We show improvements averaging 0.928 BLEU over state-of-the-art bilingual benchmarks for 5 low-resource translation pairs from the TED Talks corpus and IWSLT'15.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 81

Browse 81 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2010.04245 in a dataset README.md to link it from this page.

Spaces citing this paper 182

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.