metadata

license: apache-2.0
tags:
  - alignment-handbook
  - generated_from_trainer
  - juanako
  - mistral
  - UNA
datasets:
  - HuggingFaceH4/ultrafeedback_binarized
model-index:
  - name: juanako-7b-UNA
    results:
      - task:
          type: text-generation
          name: TruthfulQA (MC2)
        dataset:
          type: text-generation
          name: truthful_qa
          config: multiple_choice
          split: validation
        metrics:
          - type: accuracy
            value: 65.49
      - task:
          type: text-generation
          name: ARC-Challenge
        dataset:
          type: text-generation
          name: ai2_arc
          config: ARC-Challenge
          split: test
        metrics:
          - type: accuracy
            value: 68.09
      - task:
          type: text-generation
          name: HellaSwag
        dataset:
          type: text-generation
          name: Rowan/hellaswag
          split: test
        metrics:
          - type: accuracy
            value: 85.2
      - task:
          type: text-generation
          name: GSM8k
        dataset:
          type: text-generation
          name: gsm8k
          config: main
          split: test
        metrics:
          - type: accuracy
            value: 48.98
      - task:
          type: text-generation
          name: Winogrande
        dataset:
          type: text-generation
          name: winogrande
          config: winogrande_debiased
          split: test
        metrics:
          - type: accuracy
            value: 76.8
      - task:
          type: text-generation
          name: MMLU
        dataset:
          type: text-generation
          name: cais/mmlu
          config: all
          split: test
        metrics:
          - type: accuracy
            value: 61.37
      - task:
          type: text-generation
          name: PiQA
        dataset:
          type: text-generation
          name: piqa
          split: test
        metrics:
          - type: accuracy
            value: 83.57
      - task:
          type: text-generation
          name: DROP
        dataset:
          type: text-generation
          name: drop
          split: validation
        metrics:
          - type: accuracy
            value: 49.8
      - task:
          type: text-generation
          name: PubMedQA
        dataset:
          type: text-generation
          name: bigbio/pubmed_qa
          config: pubmed_qa_artificial_bigbio_qa
          split: validation
        metrics:
          - type: accuracy
            value: 76

juanako-7b-UNA-v2

This model is a fine-tuned version of fblgit/juanako-7b-UNA-v2-phase-1 on the HuggingFaceH4/ultrafeedback_binarized dataset. It outperforms in many aspects most of the current Mistral based models and is the latest and most powerful juanako version as of now.

Scoring and records (26-November-2023)

Here are some results:

Scores #1 7B Model
Scores #4 GSM8k
Scores #2 in TruthfulQA
Scores #6 in CoPa
Scores #2 in PiQA
Scores #9 in BoolQ

Many evaluations were performed, but it behaves very balanced in multiple fields. Feel free to submit more evaluation results.

It scores: 65.1 according HuggingFace LLM Leaderboard.

Author Xavier M. @fblgit

Model description

juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published.

TruthfulQA 0-Shot

|    Tasks     |Version|Filter|Metric|Value |   |Stderr|
|--------------|-------|------|------|-----:|---|-----:|
|truthfulqa_mc2|Yaml   |none  |acc   |0.6549|±  |0.0153|

ARC 25-Shot

|    Tasks    |Version|Filter| Metric |Value |   |Stderr|
|-------------|-------|------|--------|-----:|---|-----:|
|arc_challenge|Yaml   |none  |acc     |0.6476|±  |0.0140|
|             |       |none  |acc_norm|0.6809|±  |0.0136|

HellaSwag 10-Shot

|  Tasks  |Version|Filter| Metric |Value |   |Stderr|
|---------|-------|------|--------|-----:|---|-----:|
|hellaswag|Yaml   |none  |acc     |0.6703|±  |0.0047|
|         |       |none  |acc_norm|0.8520|±  |0.0035|

GSM8k 5-Shot

|Tasks|Version|  Filter  |  Metric   |Value |   |Stderr|
|-----|-------|----------|-----------|-----:|---|-----:|
|gsm8k|Yaml   |get-answer|exact_match|0.4898|±  |0.0138|

GPT Evaluations 0-Shot

|    Tasks     |Version|Filter|  Metric  |Value |   |Stderr|
|--------------|-------|------|----------|-----:|---|-----:|
|boolq         |Yaml   |none  |acc       |0.8703|±  |0.0059|
|lambada_openai|Yaml   |none  |perplexity|3.2598|±  |0.0705|
|              |       |none  |acc       |0.7336|±  |0.0062|
|piqa          |Yaml   |none  |acc       |0.8254|±  |0.0089|
|              |       |none  |acc_norm  |0.8292|±  |0.0088|
|sciq          |Yaml   |none  |acc       |0.9580|±  |0.0063|
|              |       |none  |acc_norm  |0.9130|±  |0.0089|

MathQA 0-Shot

|Tasks |Version|Filter| Metric |Value |   |Stderr|
|------|-------|------|--------|-----:|---|-----:|
|mathqa|Yaml   |none  |acc     |0.3752|±  |0.0089|
|      |       |none  |acc_norm|0.3772|±  |0.0089|

PiQa 1-Shot

|Tasks|Version|Filter| Metric |Value |   |Stderr|
|-----|-------|------|--------|-----:|---|-----:|
|piqa |Yaml   |none  |acc     |0.8308|±  |0.0087|
|     |       |none  |acc_norm|0.8357|±  |0.0086|

Winogrande 5-Shot

|  Tasks   |Version|Filter|Metric|Value|   |Stderr|
|----------|-------|------|------|----:|---|-----:|
|winogrande|Yaml   |none  |acc   |0.768|±  |0.0119|

PubMedQA 0-Shot

| Tasks  |Version|Filter|Metric|Value|   |Stderr|
|--------|-------|------|------|----:|---|-----:|
|pubmedqa|Yaml   |none  |acc   | 0.76|±  |0.0191|

RACE 1-Shot

|Tasks|Version|Filter|Metric|Value |   |Stderr|
|-----|-------|------|------|-----:|---|-----:|
|race |Yaml   |none  |acc   |0.5282|±  |0.0154|

MMLU 5-Shot (8-Bit)

|      Groups      |Version|Filter|Metric|Value |   |Stderr|
|------------------|-------|------|------|-----:|---|-----:|
|mmlu              |N/A    |none  |acc   |0.6137|±  |0.1243|
| - humanities     |N/A    |none  |acc   |0.5671|±  |0.1101|
| - other          |N/A    |none  |acc   |0.6859|±  |0.1164|
| - social_sciences|N/A    |none  |acc   |0.7195|±  |0.0713|
| - stem           |N/A    |none  |acc   |0.5087|±  |0.1297|

DROP 3-Shot (8-Bit) (Instruct-Eval)

{'score': 0.49801113762927607}
{'drop': 49.8}
drop: 49.8

CRASS 0-Shot (Instruct-Eval)

{'score': 0.8357664233576643}
{'crass': 83.58}
crass: 83.58

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 14
gradient_accumulation_steps: 16
total_train_batch_size: 224
total_eval_batch_size: 14
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.01
num_epochs: 1

Training results

Training Loss	Epoch	Step	Validation Loss	Rewards/chosen	Rewards/rejected	Rewards/accuracies	Rewards/margins	Logps/rejected	Logps/chosen	Logits/rejected	Logits/chosen
0.4795	0.2	56	0.4958	-1.3684	-2.6385	0.7552	1.2701	-265.3887	-241.2612	-2.2572	-2.4922
0.4642	0.4	112	0.4859	-1.0380	-1.9769	0.7273	0.9389	-258.7718	-237.9569	-2.2414	-2.4751
0.4758	0.61	168	0.4808	-1.2594	-2.3704	0.7343	1.1110	-262.7074	-240.1708	-2.2305	-2.4633
0.4549	0.81	224	0.4768	-1.1906	-2.3201	0.7552	1.1295	-262.2044	-239.4827	-2.2284	-2.4610

Framework versions

Transformers 4.35.0-UNA
Pytorch 2.1.0
Datasets 2.14.6
Tokenizers 0.14.1

Citations

If you find juanako useful please:

@misc{juanako7buna,
  title={Juanako: Uniform Neural Alignment}, 
  author={Xavier Murias},
  year={2023},
  publisher = {HuggingFace},
  journal = {HuggingFace repository},
  howpublished = {\url{https://huggingface.co/fblgit/juanako-7b-UNA}},
}

@misc{lin2021truthfulqa,
  title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
  author={Stephanie Lin and Jacob Hilton and Owain Evans},
  year={2021},
  eprint={2109.07958},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
@misc{tunstall2023zephyr,
      title={Zephyr: Direct Distillation of LM Alignment}, 
      author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf},
      year={2023},
      eprint={2310.16944},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
@article{cobbe2021gsm8k,
  title={Training Verifiers to Solve Math Word Problems},
  author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},
  journal={arXiv preprint arXiv:2110.14168},
  year={2021}
}
@inproceedings{Bisk2020,
  author = {Yonatan Bisk and Rowan Zellers and
            Ronan Le Bras and Jianfeng Gao
            and Yejin Choi},
  title = {PIQA: Reasoning about Physical Commonsense in
           Natural Language},
  booktitle = {Thirty-Fourth AAAI Conference on
               Artificial Intelligence},
  year = {2020},
}
@software{eval-harness,
  author       = {Gao, Leo and
                  Tow, Jonathan and
                  Biderman, Stella and
                  Black, Sid and
                  DiPofi, Anthony and
                  Foster, Charles and
                  Golding, Laurence and
                  Hsu, Jeffrey and
                  McDonell, Kyle and
                  Muennighoff, Niklas and
                  Phang, Jason and
                  Reynolds, Laria and
                  Tang, Eric and
                  Thite, Anish and
                  Wang, Ben and
                  Wang, Kevin and
                  Zou, Andy},
  title        = {A framework for few-shot language model evaluation},
  month        = sep,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v0.0.1},
  doi          = {10.5281/zenodo.5371628},
  url          = {https://doi.org/10.5281/zenodo.5371628}
}
@misc{rafailov2023direct,
    title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model}, 
    author={Rafael Rafailov and Archit Sharma and Eric Mitchell and Stefano Ermon and Christopher D. Manning and Chelsea Finn},
    year={2023},
    eprint={2305.18290},
    archivePrefix={arXiv},
}