Safetensors
leaderboard-pr-bot's picture
Adding Evaluation Results
22c4e69 verified
|
raw
history blame
6.37 kB
metadata
license: apache-2.0
datasets:
  - abacusai/MetaMathFewshot
  - shahules786/orca-chat
  - anon8231489123/ShareGPT_Vicuna_unfiltered
base_model: mistralai/Mistral-7B-v0.1
model-index:
  - name: Fewshot-Metamath-OrcaVicuna-Mistral-10B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 56.4
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 78.12
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 59.52
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 50.98
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 76.48
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 13.27
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B
          name: Open LLM Leaderboard
{
  "layer_map": [
    [0, 16],
    [8, 24],
    [16, 32]
  ]
}

image/png

This model is a variation of abacusai/Fewshot-Metamath-OrcaVicuna-Mistral that builds on the idea of scaling up models by duplicating layers of the base model, in this case mistralai/Mistral-7B-v0.1. It relies on the functionality added in this PR https://github.com/huggingface/peft/pull/1368 to train a model with replicated layers without much extra GPU memory. So although there are 48 layers that have lora adapters added, there are only 32 original layers so the memory usage is pretty much the same as the memory usage for the base 7B model.

This is just a demonstration model to indicate how this approach can be used and the goal is to apply it to much larger models. For example models like Goliath or MegaDolphin which are effectively 120B models but using this approach they will only use 70B of memory for the base model and a little extra for the LoRA adaption layers.

In our training runs we did find a difference in the behavior of the eval loss:

image/png

vs the loss curve for the original LoRA finetune of the 7B model

image/png

The larger model achieved a best eval loss of 0.3915 vs 0.3971 in a lot fewer steps.

Overall, we think this is a promising approach to accessing much larger models without significantly more resources.

Performance on Metrics

To do a proper abalation we compared the performance of 4 models trained for ~1 epoch on the combined datasets (Metamath, Orca, ShareGPT). Here are the results:

Model Trainable Params Train Loss Eval Loss GSM8K TruthfulQA
Mistral 7B 0 - - 0.374 0.426
Mistral 10B 0 - - 0.290 0.407
Mistral 7B + LoRA r=12 31M 0.412 0.366 0.514 0.499
Mistral 10B + LoRA r=8 31M 0.401 0.363 0.663 0.540

This ablation compares the base model (Mistral 7B), expansion using the layer map described here and fine tunes of a lora r=12 on the base model and r=8 (to match trainable params). The ablation demonstrates quite clearly that fine tuning the expanded model leads to a significant improvement in metrics even with the same number of trainable parameters (and training steps).

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 55.79
AI2 Reasoning Challenge (25-Shot) 56.40
HellaSwag (10-Shot) 78.12
MMLU (5-Shot) 59.52
TruthfulQA (0-shot) 50.98
Winogrande (5-shot) 76.48
GSM8k (5-shot) 13.27