Birbal-7B-V1 / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
dffc10f verified
|
raw
history blame
9.74 kB
metadata
language:
  - en
license: apache-2.0
library_name: peft
tags:
  - NeurIPS
  - NeurIPS LLM Efficiency Challenge
  - NeurIPS LLM Efficiency Challenge Winner Model
  - Team Upaya
datasets:
  - upaya07/NeurIPS-LLM-data
base_model: mistralai/Mistral-7B-v0.1
model-index:
  - name: Birbal-7B-V1
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 62.88
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 84.88
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 63.71
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 45.46
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 78.53
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 41.47
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard

Model Card for Model ID

Code License Model Weight License Python 3.9+

Model Details

Birbal-7B-V1 is fine-tuned on our curated dataset of 200k size for nearly 3 epochs. Our approach for dataset preparation is focused on finding most-relavant examples from large pool of tasks spanning across NLP, Maths, Commonsense, etc. Hence, we expect model to perform well on different tasks including unseen tasks.

Model Description

Model Sources [optional]

Uses

Birbal-7B-V1 is trained with the following format:

## Instruction:
<instruction>

## Input:
<input>

## Response:
<response>

If a record does not contain any instruction, here is the training format:

## Input:
<input>

## Response:
<response>

It will performed best if queried in the same way.

Downstream Use

Birbal-7B-V1 is fine-tuned on our curated dataset that contain examples from large number of tasks spanning across NLP, Maths, QA, etc. Hence, we expect the model to perform well on in general on various kinds of tasks.

How to Get Started with the Model

It is quite easy! Merge Birbal-7B-V1 peft model with Mistral-7B model and start running inference!

Training Details

We used Mistral-7B as a base model and fine-tuned it on a single RTX 4090 GPU for 24 hours as per the competition rules. Fine-tuning was performed using 4-bit QLoRA.

Training Data

Here is high-level diagram of our data preparation strategy: image/png

Please visit https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data for more details.

Training Hyperparameters

Refer to https://github.com/Upaya07/NeurIPS-llm-efficiency-challenge/blob/main/training/axolotl/examples/mistral/nips/nips_02.yml for example set of hyperparams used.

Evaluation

Results

Task Score
MMLU - EM 0.629
MMLU - EM (Robustness) 0.591
MMLU - EM (Fairness) 0.596
MMLU Mean Win Rate 0.417
TruthfulQA - EM 0.59
TruthfulQA - EM (Robustness) 0.541
TruthfulQA - EM (Fairness) 0.492
TruthfulQA Mean Win Rate 0.75
BIG-bench - EM 0.330
BIG-bench Mean Win Rate 0.75
GSM8K - EM 0.443
GSM8K Mean Win Rate 0.625
BBQ - EM 0.738
BBQ Mean Win Rate 0.25
sam_sum - ROUGE-2 0.127
sam_sum - Stereotypes (race) 0.667
sam_sum - Stereotypes (gender) 0.447
sam_sum - Representation (race) 0.458
sam_sum - Representation (gender) 0.013
sam_sum Mean Win Rate 0.383
corr2cause - EM 0.615
corr2cause Mean Win Rate 0.875
MATH (chain-of-thoughts) - Equivalent (chain of thought) 0.121
MATH Mean Win Rate 0.75
ethics_justice - EM 0.68
ethics_justice - EM (Robustness) 0.645
ethics_justice - EM (Fairness) 0.62
ethics_commonsense - EM 0.41
ethics_commonsense - EM (Robustness) 0.33
ethics_commonsense - EM (Fairness) 0.345
ethics_virtue - EM 0.895
ethics_virtue - EM (Robustness) 0.865
ethics_virtue - EM (Fairness) 0.86
ethics_deontology - EM 0.63
ethics_deontology - EM (Robustness) 0.585
ethics_deontology - EM (Fairness) 0.595
ethics_utilitarianism - EM 0.72
ethics_utilitarianism - EM (Robustness) 0.6
ethics_utilitarianism - EM (Fairness) 0.645
ethics Mean Win Rate 0.55
πŸ”₯ Score_full 0.579
πŸ”₯ Score_open 0.516
πŸ”₯ Score_hidden 0.61

Top-5 Teams

Position Score
5th rank 0.362
4th rank 0.371
3rd rank 0.381
2nd rank 0.424
πŸ”₯ Ours (1st) 0.579

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Training procedure

The following bitsandbytes quantization config was used during training:

  • quant_method: bitsandbytes
  • load_in_8bit: False
  • load_in_4bit: True
  • llm_int8_threshold: 6.0
  • llm_int8_skip_modules: None
  • llm_int8_enable_fp32_cpu_offload: False
  • llm_int8_has_fp16_weight: False
  • bnb_4bit_quant_type: nf4
  • bnb_4bit_use_double_quant: True
  • bnb_4bit_compute_dtype: bfloat16

Framework versions

  • PEFT 0.6.1

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 62.82
AI2 Reasoning Challenge (25-Shot) 62.88
HellaSwag (10-Shot) 84.88
MMLU (5-Shot) 63.71
TruthfulQA (0-shot) 45.46
Winogrande (5-shot) 78.53
GSM8k (5-shot) 41.47