metadata

language:
  - en
license: apache-2.0
tags:
  - not-for-all-audiences
datasets:
  - Intel/orca_dpo_pairs
  - athirdpath/DPO_Pairs-Roleplay-Alpaca-NSFW
  - Open-Orca/SlimOrca
  - MinervaAI/Aesir-Preview
  - allenai/ultrafeedback_binarized_cleaned
model-index:
  - name: NEBULA-23B-v1.0
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 66.72
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/NEBULA-23B-v1.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 86.98
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/NEBULA-23B-v1.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 65.4
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/NEBULA-23B-v1.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 57.6
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/NEBULA-23B-v1.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 82.95
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/NEBULA-23B-v1.0
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 0
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=TeeZee/NEBULA-23B-v1.0
          name: Open LLM Leaderboard

NEBULA-23.8B-v1.0

Technical notes

108 layers,DUS procedure, mistral(32)->SOLAR(48)->GALAXY(72)->NEBULA(108)
23.8B parameters
model created as a extension of depth upscaling procedure used for SOLAR by upstage

Results

model can and will produce NSFW content
GSM8k evaluation seems to be often broken, HellaSwag, Winograde and TQA show that its a smart model
RP and ERP work surprisingly good and I didn't encounter any GPTisms yet
lower memory footprint than 20B and 23B models
follows character card very well
NSFW output feels fresh comparing to existing models

Finetuning for RP

SFT using MinervaAI/Aesir-Preview dataset, 10 epochs
DPO using athirdpath/DPO_Pairs-Roleplay-Alpaca-NSFW dataset, 1 epoch
SFT using 1xAda6000, 10h
DPO using 1x3090, 30h
jupyter notebooks or mergekit configs for anyone wanting to reproduce/reuse scripts - just drop me a message

Prompt template

Alpaca
chat template is embedded in tokenizer config, should load automatically

Context size

4096

All comments are greatly appreciated, download, test and if you appreciate my work, consider buying me my fuel:

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	59.94
AI2 Reasoning Challenge (25-Shot)	66.72
HellaSwag (10-Shot)	86.98
MMLU (5-Shot)	65.40
TruthfulQA (0-shot)	57.60
Winogrande (5-shot)	82.95
GSM8k (5-shot)	0.00