leaderboard-pr-bot's picture
Adding Evaluation Results
f259e24 verified
metadata
license: llama2
tags:
  - merge
  - mergekit
model-index:
  - name: llama-2-26b-trenchcoat-stack
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 55.03
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 79.9
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 53.73
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 40.48
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 74.74
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 2.88
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard

Llama 2 13b is a pretty decent language model. You know what's probably better? Two Llama 2 13b models. In a trenchcoat.

Produced by bakllama.py with this config file:

layer_slices:
  - model: TheBloke/Llama-2-13B-fp16
    start: 0
    end: 40
  - model: TheBloke/Llama-2-13B-fp16
    start: 0
    end: 40

No fine tuning was done on this model. Yes, it's still coherent somehow.

Benchmark results:

Benchmark Llama2-13b Llama2-26b-tcs Percent Change
ARC 59.3 55.03 -7.2%
HellaSwag 82.15 79.9 -2.74%
MMLU 55.67 53.73 -3.48%
TruthfulQA 37.39 40.48 +5.59%
Average 58.63 57.29 -2.29%
Average Minus TQA 65.70 62.85 -4.34%

This tells us two very important things:

  1. TruthfulQA is a perfect benchmark in every way.
  2. Llama models are amazingly robust to being fed their own output.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 51.13
AI2 Reasoning Challenge (25-Shot) 55.03
HellaSwag (10-Shot) 79.90
MMLU (5-Shot) 53.73
TruthfulQA (0-shot) 40.48
Winogrande (5-shot) 74.74
GSM8k (5-shot) 2.88