metadata

language:
  - en
license: apache-2.0
library_name: peft
tags:
  - NeurIPS
  - NeurIPS LLM Efficiency Challenge
  - NeurIPS LLM Efficiency Challenge Winner Model
  - Team Upaya
datasets:
  - upaya07/NeurIPS-LLM-data
base_model: mistralai/Mistral-7B-v0.1
model-index:
  - name: Birbal-7B-V1
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 62.88
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 84.88
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 63.71
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 45.46
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 78.53
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 41.47
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=upaya07/Birbal-7B-V1
          name: Open LLM Leaderboard

Model Card for Model ID

🚀🚀🚀 Our model Birbal-7B-V1 achieved 🏆 first rank 🏆 in among 80+ global teams in NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day organized by Microsoft and Meta.
📣 P.S.: Please reach out to us, if you would be interested in supporting compute resources. Here are our recent achievements in LLM space: https://upaya.ai/

Model Details

Birbal-7B-V1 is fine-tuned on our curated dataset of 200k size for nearly 3 epochs. Our approach for dataset preparation is focused on finding most-relavant examples from large pool of tasks spanning across NLP, Maths, Commonsense, etc. Hence, we expect model to perform well on different tasks including unseen tasks.

Model Description

Project GitHub Page: https://github.com/Upaya07/NeurIPS-llm-efficiency-challenge
Developed by: ❤️ Team Upaya - Ashvini Kumar Jindal, Ankur Parikh, Pawan Rajpoot
Funded by: self-work
Model type: fine-tuned. It is a PEFT model and can be combined with Mistral-7B model.
Language(s) (NLP): English
License: Apache-2.0
Finetuned from model: mistralai/Mistral-7B-v0.1

Model Sources [optional]

Repository: https://github.com/Upaya07/NeurIPS-llm-efficiency-challenge

Uses

Birbal-7B-V1 is trained with the following format:

## Instruction:
<instruction>

## Input:
<input>

## Response:
<response>

If a record does not contain any instruction, here is the training format:

## Input:
<input>

## Response:
<response>

It will performed best if queried in the same way.

Downstream Use

Birbal-7B-V1 is fine-tuned on our curated dataset that contain examples from large number of tasks spanning across NLP, Maths, QA, etc. Hence, we expect the model to perform well on in general on various kinds of tasks.

How to Get Started with the Model

It is quite easy! Merge Birbal-7B-V1 peft model with Mistral-7B model and start running inference!

Training Details

We used Mistral-7B as a base model and fine-tuned it on a single RTX 4090 GPU for 24 hours as per the competition rules. Fine-tuning was performed using 4-bit QLoRA.

Training Data

Here is high-level diagram of our data preparation strategy:

Please visit https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data for more details.

Training Hyperparameters

Refer to https://github.com/Upaya07/NeurIPS-llm-efficiency-challenge/blob/main/training/axolotl/examples/mistral/nips/nips_02.yml for example set of hyperparams used.

Evaluation

Results

Task	Score
MMLU - EM	0.629
MMLU - EM (Robustness)	0.591
MMLU - EM (Fairness)	0.596
MMLU Mean Win Rate	0.417
TruthfulQA - EM	0.59
TruthfulQA - EM (Robustness)	0.541
TruthfulQA - EM (Fairness)	0.492
TruthfulQA Mean Win Rate	0.75
BIG-bench - EM	0.330
BIG-bench Mean Win Rate	0.75
GSM8K - EM	0.443
GSM8K Mean Win Rate	0.625
BBQ - EM	0.738
BBQ Mean Win Rate	0.25
sam_sum - ROUGE-2	0.127
sam_sum - Stereotypes (race)	0.667
sam_sum - Stereotypes (gender)	0.447
sam_sum - Representation (race)	0.458
sam_sum - Representation (gender)	0.013
sam_sum Mean Win Rate	0.383
corr2cause - EM	0.615
corr2cause Mean Win Rate	0.875
MATH (chain-of-thoughts) - Equivalent (chain of thought)	0.121
MATH Mean Win Rate	0.75
ethics_justice - EM	0.68
ethics_justice - EM (Robustness)	0.645
ethics_justice - EM (Fairness)	0.62
ethics_commonsense - EM	0.41
ethics_commonsense - EM (Robustness)	0.33
ethics_commonsense - EM (Fairness)	0.345
ethics_virtue - EM	0.895
ethics_virtue - EM (Robustness)	0.865
ethics_virtue - EM (Fairness)	0.86
ethics_deontology - EM	0.63
ethics_deontology - EM (Robustness)	0.585
ethics_deontology - EM (Fairness)	0.595
ethics_utilitarianism - EM	0.72
ethics_utilitarianism - EM (Robustness)	0.6
ethics_utilitarianism - EM (Fairness)	0.645
ethics Mean Win Rate	0.55
🔥 Score_full	0.579
🔥 Score_open	0.516
🔥 Score_hidden	0.61

Top-5 Teams

Position	Score
5th rank	0.362
4th rank	0.371
3rd rank	0.381
2nd rank	0.424
🔥 Ours (1st)	0.579

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Training procedure

The following bitsandbytes quantization config was used during training:

quant_method: bitsandbytes
load_in_8bit: False
load_in_4bit: True
llm_int8_threshold: 6.0
llm_int8_skip_modules: None
llm_int8_enable_fp32_cpu_offload: False
llm_int8_has_fp16_weight: False
bnb_4bit_quant_type: nf4
bnb_4bit_use_double_quant: True
bnb_4bit_compute_dtype: bfloat16

Framework versions

PEFT 0.6.1

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric	Value
Avg.	62.82
AI2 Reasoning Challenge (25-Shot)	62.88
HellaSwag (10-Shot)	84.88
MMLU (5-Shot)	63.71
TruthfulQA (0-shot)	45.46
Winogrande (5-shot)	78.53
GSM8k (5-shot)	41.47