LexGPT-V3 / README.md
lex-hue's picture
Adding Evaluation Results (#1)
a086fd1 verified
metadata
language:
  - en
  - de
license: mit
model-index:
  - name: LexGPT-V3
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 66.47
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lex-hue/LexGPT-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 85.91
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lex-hue/LexGPT-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 64.48
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lex-hue/LexGPT-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 59.98
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lex-hue/LexGPT-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 78.53
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lex-hue/LexGPT-V3
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 61.56
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lex-hue/LexGPT-V3
          name: Open LLM Leaderboard

This Model was just an Test Train to see how our new Training Algorithm and Data does like.

Model is based on Mistral v0.1

As this was an test run, we just tested it and heres the Data, the model hasnt Improved any better.

Model Turn 1 Score Turn 2 Score Average Score
gpt-4 8.95625 9.025000 8.990625
gpt-3.5-turbo 8.075000 7.943750 7.943750
claude-v1 8.150000 7.900000 8.025000
LexGPT-V3 8.14375 7.719355 7.926667
vicuna-13b-v1.3 6.812500 5.962500 6.387500

Open-LLM Leaderboard Results: Results

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 69.49
AI2 Reasoning Challenge (25-Shot) 66.47
HellaSwag (10-Shot) 85.91
MMLU (5-Shot) 64.48
TruthfulQA (0-shot) 59.98
Winogrande (5-shot) 78.53
GSM8k (5-shot) 61.56