leaderboard-pr-bot's picture
Adding Evaluation Results
53cda72 verified
|
raw
history blame
6.43 kB
metadata
language:
  - en
model-index:
  - name: Dans-CreepingSenseOfDoom
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 53.33
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=PocketDoc/Dans-CreepingSenseOfDoom
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 78.9
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=PocketDoc/Dans-CreepingSenseOfDoom
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 48.09
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=PocketDoc/Dans-CreepingSenseOfDoom
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 37.84
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=PocketDoc/Dans-CreepingSenseOfDoom
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 73.32
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=PocketDoc/Dans-CreepingSenseOfDoom
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 0
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=PocketDoc/Dans-CreepingSenseOfDoom
          name: Open LLM Leaderboard

What is the model for?

This model is proficient in crafting text-based adventure games. It can both concise replies and more expansive, novel-like descriptions. The ability to alternate between these two response styles can be triggered by a distinct system message.

What's in the sausage?

This model was trained on Holodeck-1 using a deduped version of the skein text adventure dataset augmented with system messages using the 'Metharme' prompting format.

PROMPT FORMAT:

Consistent with the Pygmalion Metharme format which is shown below.

<|system|>{system message here}<|user|>{user action here}<|model|>{model response}
<|system|>{system message here}<|model|>{model response}
<|system|>{system message here}<|user|>{user action here}<|model|>{model response}<|user|>{user action here}<|model|>{model response}

EXAMPLES:

For shorter responses:
<|system|>Mode: Adventure
Theme: Science Fiction, cats, money, aliens, space, stars, siblings, future, trade
Tense: Second person present
Extra: Short response length<|user|>you look around<|model|>{CURSOR HERE}
<|system|>You are a dungeon master of sorts, guiding the reader through a story based on the following themes: Lovecraftian, Horror, city, research. Do not be afraid to get creative with your responses or to tell them they can't do something when it doesnt make sense for the situation. Narrate their actions and observations as they occur and drive the story forward.<|user|>you look around<|model|>{CURSOR HERE}
For longer novel like responses:
<|system|>You're tasked with creating an interactive story around the genres of historical, historical, RPG, serious. Guide the user through this tale, describing their actions and surroundings using second person present tense. Lengthy and descriptive responses will enhance the experience.<|user|>you look around<|model|>{CURSOR HERE}
With a model message first:
<|system|>Mode: Story
Theme: fantasy, female protagonist, grimdark
Perspective and Tense: Second person present
Directions: Write something to hook the user into the story then narrate their actions and observations as they occur while driving the story forward.<|model|>{CURSOR HERE}

Some quick and dirty training details:

  • Built with Axolotl
  • Sequence length: 4096
  • # of epochs: 3
  • Training time: 8 hours
  • Hardware: 1x RTX 3090
  • Training type: QLoRA
  • PEFT R/A: 32/32

Credits:

Holodeck-1:

Thank you to Mr. Seeker and the Kobold AI team for the wonderful model Holodeck-1

Holodeck-1 Huggingface page

Skein Text Adventure Data:

Thank you to the Kobold AI community for curating the Skein dataset, which is pivotal to this model's capabilities.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 48.58
AI2 Reasoning Challenge (25-Shot) 53.33
HellaSwag (10-Shot) 78.90
MMLU (5-Shot) 48.09
TruthfulQA (0-shot) 37.84
Winogrande (5-shot) 73.32
GSM8k (5-shot) 0.00