UNAversal - Uniform Neural Alignment (MoE)

This is just a beta, a first release so people can start working on franksteins and so. It does achieve high GSM/Math and TQA, so ideally you can merge it with other mixtrals and see what coming out of it

UNA Details

For this model we came out with the most obvious, placing UNA on the router_logit. It does work, but we saw a much better performance on SFT by doing so. So this model DOES have UNA-SFT phase, its highly experimental and it was merely using LLaMA-Factory datasets by example alpaca.

As the others:

Can be finetuned further, try 2e-5 or 1e-4 (since its MOE)
Can be merged, here you will have to improvise and please report findings on a discussion thread.

REMINDER: please.. cite, it does help on the research and the lab itself, seriously.

NEED YOUR HELP!!

I need a multi-turn trainloop for the Mixtral, that can squeeze the juice out of 8xH100's properly. Please feel free to reach @fblgit either discord or twitter. thanks!

Evals

Here there are some, but we also submitted it to the HF eval queue....

GSM8k 5-Shot

|Tasks|Version|  Filter  |n-shot|  Metric   |Value |   |Stderr|
|-----|-------|----------|-----:|-----------|-----:|---|-----:|
|gsm8k|Yaml   |get-answer|     5|exact_match|0.6603|±  | 0.013|

ARC 25-Shot

|    Tasks    |Version|Filter|n-shot| Metric |Value |   |Stderr|
|-------------|-------|------|-----:|--------|-----:|---|-----:|
|arc_challenge|Yaml   |none  |    25|acc     |0.6621|±  |0.0138|
|             |       |none  |    25|acc_norm|0.6962|±  |0.0134|

TruthfulQA 0-Shot (MC2)

|    Tasks     |Version|Filter|n-shot|Metric|Value |   |Stderr|
|--------------|-------|------|-----:|------|-----:|---|-----:|
|truthfulqa_mc2|Yaml   |none  |     0|acc   |0.7122|±  |0.0141|

0-Shots Evals

|    Tasks     |Version|Filter|n-shot|  Metric  |Value |   |Stderr|
|--------------|-------|------|-----:|----------|-----:|---|-----:|
|arc_challenge |Yaml   |none  |     0|acc       |0.6101|±  |0.0143|
|              |       |none  |     0|acc_norm  |0.6425|±  |0.0140|
|arc_easy      |Yaml   |none  |     0|acc       |0.8615|±  |0.0071|
|              |       |none  |     0|acc_norm  |0.8375|±  |0.0076|
|boolq         |Yaml   |none  |     0|acc       |0.8624|±  |0.0060|
|lambada_openai|Yaml   |none  |     0|perplexity|2.8318|±  |0.0507|
|              |       |none  |     0|acc       |0.7650|±  |0.0059|
|mathqa        |Yaml   |none  |     0|acc       |0.4472|±  |0.0091|
|              |       |none  |     0|acc_norm  |0.4436|±  |0.0091|
|piqa          |Yaml   |none  |     0|acc       |0.8292|±  |0.0088|
|              |       |none  |     0|acc_norm  |0.8422|±  |0.0085|
|pubmedqa      |Yaml   |none  |     0|acc       |0.7920|±  |0.0182|
|sciq          |Yaml   |none  |     0|acc       |0.9630|±  |0.0060|
|              |       |none  |     0|acc_norm  |0.9370|±  |0.0077|