Text Generation
Transformers
Safetensors
dbrx
conversational
text-generation-inference

Why clamp qkv_states, is it common?

#44
by jay68 - opened

In line 318 of modeling_dbrx.py, along with the "clip_qkv": 8 configuration, dbrx will clamp the value of qkv_states between -8 and 8.
Is such config only for inference or for both training and inference?
Why dbrx does this, is there some citation works?

Sign up or log in to comment