metadata

license: llama2
tags:
  - merge
  - mergekit
model-index:
  - name: llama-2-26b-trenchcoat-stack
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 55.03
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 79.9
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 53.73
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 40.48
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 74.74
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 2.88
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=chargoddard/llama-2-26b-trenchcoat-stack
          name: Open LLM Leaderboard

Llama 2 13b is a pretty decent language model. You know what's probably better? Two Llama 2 13b models. In a trenchcoat.

Produced by bakllama.py with this config file:

layer_slices:
  - model: TheBloke/Llama-2-13B-fp16
    start: 0
    end: 40
  - model: TheBloke/Llama-2-13B-fp16
    start: 0
    end: 40

No fine tuning was done on this model. Yes, it's still coherent somehow.

Benchmark results:

Benchmark	Llama2-13b	Llama2-26b-tcs	Percent Change
ARC	59.3	55.03	-7.2%
HellaSwag	82.15	79.9	-2.74%
MMLU	55.67	53.73	-3.48%
TruthfulQA	37.39	40.48	+5.59%
Average	58.63	57.29	-2.29%
Average Minus TQA	65.70	62.85	-4.34%

This tells us two very important things:

TruthfulQA is a perfect benchmark in every way.
Llama models are amazingly robust to being fed their own output.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	51.13
AI2 Reasoning Challenge (25-Shot)	55.03
HellaSwag (10-Shot)	79.90
MMLU (5-Shot)	53.73
TruthfulQA (0-shot)	40.48
Winogrande (5-shot)	74.74
GSM8k (5-shot)	2.88