Update modeling_moe_mistral.py

#1
by bjoernp - opened
Disco Research org
No description provided.

Hi, I implemented it originally to follow https://github.com/stanford-futuredata/megablocks/blob/main/megablocks/layers/router.py#L57 which does softmax and then topk. Not sure which one is correct. Do you get better results with it?

Disco Research org

Currently looking like better scores:

winogrande: 0.8019 -> 0.824
truthfulqa_mc2: 0.4406 -> 0.4855
arc_challenge: 0.6314 -> 0.6638
bjoernp changed pull request status to merged

Sign up or log in to comment