OmniCorso-7B / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
a9a9640 verified
|
raw
history blame
10.9 kB
metadata
license: cc
tags:
  - mergekit
  - merge
base_model:
  - macadeliccc/MBX-7B-v3-DPO
  - mlabonne/OmniBeagle-7B
model-index:
  - name: OmniCorso-7B
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 72.7
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 88.7
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64.91
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 73.43
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 83.74
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 70.96
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=macadeliccc/OmniCorso-7B
          name: Open LLM Leaderboard

OmniCorso-7B

image/webp

Code Example

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("macadeliccc/OmniCorso-7B")
model = AutoModelForCausalLM.from_pretrained("macadeliccc/OmniCorso-7B")

messages = [
    {"role": "system", "content": "Respond to the users request like a pirate"},
    {"role": "user", "content": "Can you write me a quicksort algorithm?"}
]
gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")

The following models were included in the merge:

Configuration

The following YAML configuration was used to produce this model:

slices:
  - sources:
      - model: mlabonne/OmniBeagle-7B
        layer_range: [0, 32]
      - model: macadeliccc/MBX-7B-v3-DPO
        layer_range: [0, 32]
merge_method: slerp
base_model: macadeliccc/MBX-7B-v3-DPO
parameters:
  t:
    - filter: self_attn
      value: [0, 0.5, 0.3, 0.7, 1]
    - filter: mlp
      value: [1, 0.5, 0.7, 0.3, 0]
    - value: 0.5
dtype: bfloat16

Quantizations

GGUF

Exllamav2

Quants are available thanks to user bartowski, check them out here

Branch Bits lm_head bits VRAM (4k) VRAM (16k) VRAM (32k) Description
8_0 8.0 8.0 8.4 GB 9.8 GB 11.8 GB Maximum quality that ExLlamaV2 can produce, near unquantized performance.
6_5 6.5 8.0 7.2 GB 8.6 GB 10.6 GB Very similar to 8.0, good tradeoff of size vs performance, recommended.
5_0 5.0 6.0 6.0 GB 7.4 GB 9.4 GB Slightly lower quality vs 6.5, but usable on 8GB cards.
4_25 4.25 6.0 5.3 GB 6.7 GB 8.7 GB GPTQ equivalent bits per weight, slightly higher quality.
3_5 3.5 6.0 4.7 GB 6.1 GB 8.1 GB Lower quality, only use if you have to.

Evaluations

----Benchmark Complete----
2024-02-11 15:34:40
Time taken: 178.3 mins
Prompt Format: ChatML
Model: macadeliccc/OmniCorso-7B
Score (v2): 73.75
Parseable: 167.0
---------------
Batch completed
Time taken: 178.3 mins
---------------
Model AGIEval GPT4All TruthfulQA Bigbench Average
OmniCorso-7B 45.89 77.66 74.12 49.24 61.73

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 29.13 ± 2.86
acc_norm 27.17 ± 2.80
agieval_logiqa_en 0 acc 39.32 ± 1.92
acc_norm 39.63 ± 1.92
agieval_lsat_ar 0 acc 23.91 ± 2.82
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 53.14 ± 2.21
acc_norm 53.92 ± 2.21
agieval_lsat_rc 0 acc 66.54 ± 2.88
acc_norm 67.29 ± 2.87
agieval_sat_en 0 acc 80.58 ± 2.76
acc_norm 80.58 ± 2.76
agieval_sat_en_without_passage 0 acc 45.63 ± 3.48
acc_norm 43.69 ± 3.46
agieval_sat_math 0 acc 33.18 ± 3.18
acc_norm 30.91 ± 3.12

Average: 45.89%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 67.32 ± 1.37
acc_norm 68.43 ± 1.36
arc_easy 0 acc 87.46 ± 0.68
acc_norm 83.50 ± 0.76
boolq 1 acc 88.13 ± 0.57
hellaswag 0 acc 68.47 ± 0.46
acc_norm 86.96 ± 0.34
openbookqa 0 acc 38.80 ± 2.18
acc_norm 50.00 ± 2.24
piqa 0 acc 83.03 ± 0.88
acc_norm 85.31 ± 0.83
winogrande 0 acc 81.29 ± 1.10

Average: 77.66%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 58.26 ± 1.73
mc2 74.12 ± 1.43

Average: 74.12%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 56.84 ± 3.60
bigbench_date_understanding 0 multiple_choice_grade 63.41 ± 2.51
bigbench_disambiguation_qa 0 multiple_choice_grade 49.22 ± 3.12
bigbench_geometric_shapes 0 multiple_choice_grade 23.96 ± 2.26
exact_str_match 1.39 ± 0.62
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 34.20 ± 2.12
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 23.71 ± 1.61
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 60.33 ± 2.83
bigbench_movie_recommendation 0 multiple_choice_grade 49.00 ± 2.24
bigbench_navigate 0 multiple_choice_grade 55.20 ± 1.57
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 70.75 ± 1.02
bigbench_ruin_names 0 multiple_choice_grade 55.80 ± 2.35
bigbench_salient_translation_error_detection 0 multiple_choice_grade 36.97 ± 1.53
bigbench_snarks 0 multiple_choice_grade 72.38 ± 3.33
bigbench_sports_understanding 0 multiple_choice_grade 76.27 ± 1.36
bigbench_temporal_sequences 0 multiple_choice_grade 54.50 ± 1.58
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 23.12 ± 1.19
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 20.34 ± 0.96
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 60.33 ± 2.83

Average: 49.24%

Average score: 61.73%

Elapsed time: 02:20:06

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 75.74
AI2 Reasoning Challenge (25-Shot) 72.70
HellaSwag (10-Shot) 88.70
MMLU (5-Shot) 64.91
TruthfulQA (0-shot) 73.43
Winogrande (5-shot) 83.74
GSM8k (5-shot) 70.96