--- license: apache-2.0 tags: - alignment-handbook - generated_from_trainer - juanako - mistral - UNA datasets: - HuggingFaceH4/ultrafeedback_binarized model-index: - name: juanako-7b-UNA results: - task: type: text-generation name: TruthfulQA (MC2) dataset: type: text-generation name: truthful_qa config: multiple_choice split: validation metrics: - type: accuracy value: 65.49 - task: type: text-generation name: ARC-Challenge dataset: type: text-generation name: ai2_arc config: ARC-Challenge split: test metrics: - type: accuracy value: 68.09 - task: type: text-generation name: HellaSwag dataset: type: text-generation name: Rowan/hellaswag split: test metrics: - type: accuracy value: 85.20 - task: type: text-generation name: GSM8k dataset: type: text-generation name: gsm8k config: main split: test metrics: - type: accuracy value: 48.98 - task: type: text-generation name: Winogrande dataset: type: text-generation name: winogrande config: winogrande_debiased split: test metrics: - type: accuracy value: 76.8 - task: type: text-generation name: MMLU dataset: type: text-generation name: cais/mmlu config: all split: test metrics: - type: accuracy value: 61.37 - task: type: text-generation name: PiQA dataset: type: text-generation name: piqa split: test metrics: - type: accuracy value: 83.57 - task: type: text-generation name: DROP dataset: type: text-generation name: drop split: validation metrics: - type: accuracy value: 49.8 - task: type: text-generation name: PubMedQA dataset: type: text-generation name: bigbio/pubmed_qa config: pubmed_qa_artificial_bigbio_qa split: validation metrics: - type: accuracy value: 76.0 --- # juanako-7b-UNA-v2 This model is a fine-tuned version of [fblgit/juanako-7b-UNA-v2-phase-1](https://huggingface.co/fblgit/juanako-7b-UNA-v2-phase-1) on the HuggingFaceH4/ultrafeedback_binarized dataset. It outperforms in many aspects most of the current Mistral based models and is the **latest and most powerful juanako version as of now**. ## Scoring and records (26-November-2023) Here are some results: * Scores #1 7B Model * Scores #4 GSM8k * Scores #2 in TruthfulQA * Scores #6 in CoPa * Scores #2 in PiQA * Scores #9 in BoolQ Many evaluations were performed, but it behaves very balanced in multiple fields. Feel free to submit more evaluation results. It scores: **65.1** according HuggingFace LLM Leaderboard. Author [Xavier M.](mailto:xavi@juanako.ai) @fblgit ## Model description juanako uses UNA, Uniform Neural Alignment. A training technique that ease alignment between transformer layers yet to be published. ## TruthfulQA 0-Shot ``` | Tasks |Version|Filter|Metric|Value | |Stderr| |--------------|-------|------|------|-----:|---|-----:| |truthfulqa_mc2|Yaml |none |acc |0.6549|± |0.0153| ``` ## ARC 25-Shot ``` | Tasks |Version|Filter| Metric |Value | |Stderr| |-------------|-------|------|--------|-----:|---|-----:| |arc_challenge|Yaml |none |acc |0.6476|± |0.0140| | | |none |acc_norm|0.6809|± |0.0136| ``` ## HellaSwag 10-Shot ``` | Tasks |Version|Filter| Metric |Value | |Stderr| |---------|-------|------|--------|-----:|---|-----:| |hellaswag|Yaml |none |acc |0.6703|± |0.0047| | | |none |acc_norm|0.8520|± |0.0035| ``` ## GSM8k 5-Shot ``` |Tasks|Version| Filter | Metric |Value | |Stderr| |-----|-------|----------|-----------|-----:|---|-----:| |gsm8k|Yaml |get-answer|exact_match|0.4898|± |0.0138| ``` ## GPT Evaluations 0-Shot ``` | Tasks |Version|Filter| Metric |Value | |Stderr| |--------------|-------|------|----------|-----:|---|-----:| |boolq |Yaml |none |acc |0.8703|± |0.0059| |lambada_openai|Yaml |none |perplexity|3.2598|± |0.0705| | | |none |acc |0.7336|± |0.0062| |piqa |Yaml |none |acc |0.8254|± |0.0089| | | |none |acc_norm |0.8292|± |0.0088| |sciq |Yaml |none |acc |0.9580|± |0.0063| | | |none |acc_norm |0.9130|± |0.0089| ``` ## MathQA 0-Shot ``` |Tasks |Version|Filter| Metric |Value | |Stderr| |------|-------|------|--------|-----:|---|-----:| |mathqa|Yaml |none |acc |0.3752|± |0.0089| | | |none |acc_norm|0.3772|± |0.0089| ``` ## PiQa 1-Shot ``` |Tasks|Version|Filter| Metric |Value | |Stderr| |-----|-------|------|--------|-----:|---|-----:| |piqa |Yaml |none |acc |0.8308|± |0.0087| | | |none |acc_norm|0.8357|± |0.0086| ``` ## Winogrande 5-Shot ``` | Tasks |Version|Filter|Metric|Value| |Stderr| |----------|-------|------|------|----:|---|-----:| |winogrande|Yaml |none |acc |0.768|± |0.0119| ``` ## PubMedQA 0-Shot ``` | Tasks |Version|Filter|Metric|Value| |Stderr| |--------|-------|------|------|----:|---|-----:| |pubmedqa|Yaml |none |acc | 0.76|± |0.0191| ``` ## RACE 1-Shot ``` |Tasks|Version|Filter|Metric|Value | |Stderr| |-----|-------|------|------|-----:|---|-----:| |race |Yaml |none |acc |0.5282|± |0.0154| ``` ## MMLU 5-Shot (8-Bit) ``` | Groups |Version|Filter|Metric|Value | |Stderr| |------------------|-------|------|------|-----:|---|-----:| |mmlu |N/A |none |acc |0.6137|± |0.1243| | - humanities |N/A |none |acc |0.5671|± |0.1101| | - other |N/A |none |acc |0.6859|± |0.1164| | - social_sciences|N/A |none |acc |0.7195|± |0.0713| | - stem |N/A |none |acc |0.5087|± |0.1297| ``` ## DROP 3-Shot (8-Bit) (Instruct-Eval) ``` {'score': 0.49801113762927607} {'drop': 49.8} drop: 49.8 ``` ## CRASS 0-Shot (Instruct-Eval) ``` {'score': 0.8357664233576643} {'crass': 83.58} crass: 83.58 ``` ### Training hyperparameters The following hyperparameters were used during training: - learning_rate: 0.0001 - train_batch_size: 1 - eval_batch_size: 1 - seed: 42 - distributed_type: multi-GPU - num_devices: 14 - gradient_accumulation_steps: 16 - total_train_batch_size: 224 - total_eval_batch_size: 14 - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08 - lr_scheduler_type: linear - lr_scheduler_warmup_ratio: 0.01 - num_epochs: 1 ### Training results | Training Loss | Epoch | Step | Validation Loss | Rewards/chosen | Rewards/rejected | Rewards/accuracies | Rewards/margins | Logps/rejected | Logps/chosen | Logits/rejected | Logits/chosen | |:-------------:|:-----:|:----:|:---------------:|:--------------:|:----------------:|:------------------:|:---------------:|:--------------:|:------------:|:---------------:|:-------------:| | 0.4795 | 0.2 | 56 | 0.4958 | -1.3684 | -2.6385 | 0.7552 | 1.2701 | -265.3887 | -241.2612 | -2.2572 | -2.4922 | | 0.4642 | 0.4 | 112 | 0.4859 | -1.0380 | -1.9769 | 0.7273 | 0.9389 | -258.7718 | -237.9569 | -2.2414 | -2.4751 | | 0.4758 | 0.61 | 168 | 0.4808 | -1.2594 | -2.3704 | 0.7343 | 1.1110 | -262.7074 | -240.1708 | -2.2305 | -2.4633 | | 0.4549 | 0.81 | 224 | 0.4768 | -1.1906 | -2.3201 | 0.7552 | 1.1295 | -262.2044 | -239.4827 | -2.2284 | -2.4610 | ### Framework versions - Transformers 4.35.0-UNA - Pytorch 2.1.0 - Datasets 2.14.6 - Tokenizers 0.14.1 ## Citations If you find juanako useful please: ``` @misc{juanako7buna, title={Juanako: Uniform Neural Alignment}, author={Xavier Murias}, year={2023}, publisher = {HuggingFace}, journal = {HuggingFace repository}, howpublished = {\url{https://huggingface.co/fblgit/juanako-7b-UNA}}, } ``` ``` @misc{lin2021truthfulqa, title={TruthfulQA: Measuring How Models Mimic Human Falsehoods}, author={Stephanie Lin and Jacob Hilton and Owain Evans}, year={2021}, eprint={2109.07958}, archivePrefix={arXiv}, primaryClass={cs.CL} } @misc{tunstall2023zephyr, title={Zephyr: Direct Distillation of LM Alignment}, author={Lewis Tunstall and Edward Beeching and Nathan Lambert and Nazneen Rajani and Kashif Rasul and Younes Belkada and Shengyi Huang and Leandro von Werra and Clémentine Fourrier and Nathan Habib and Nathan Sarrazin and Omar Sanseviero and Alexander M. Rush and Thomas Wolf}, year={2023}, eprint={2310.16944}, archivePrefix={arXiv}, primaryClass={cs.LG} } @article{cobbe2021gsm8k, title={Training Verifiers to Solve Math Word Problems}, author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John}, journal={arXiv preprint arXiv:2110.14168}, year={2021} } @inproceedings{Bisk2020, author = {Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi}, title = {PIQA: Reasoning about Physical Commonsense in Natural Language}, booktitle = {Thirty-Fourth AAAI Conference on Artificial Intelligence}, year = {2020}, } @software{eval-harness, author = {Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and Phang, Jason and Reynolds, Laria and Tang, Eric and Thite, Anish and Wang, Ben and Wang, Kevin and Zou, Andy}, title = {A framework for few-shot language model evaluation}, month = sep, year = 2021, publisher = {Zenodo}, version = {v0.0.1}, doi = {10.5281/zenodo.5371628}, url = {https://doi.org/10.5281/zenodo.5371628} } @misc{rafailov2023direct, title={Direct Preference Optimization: Your Language Model is Secretly a Reward Model}, author={Rafael Rafailov and Archit Sharma and Eric Mitchell and Stefano Ermon and Christopher D. Manning and Chelsea Finn}, year={2023}, eprint={2305.18290}, archivePrefix={arXiv}, } ```