Edit model card

Model Card for Model ID

Code License Model Weight License Python 3.9+

Model Details

Birbal-7B-V1 is fine-tuned on our curated dataset of 200k size for nearly 3 epochs. Our approach for dataset preparation is focused on finding most-relavant examples from large pool of tasks spanning across NLP, Maths, Commonsense, etc. Hence, we expect model to perform well on different tasks including unseen tasks.

Model Description

Model Sources [optional]

Uses

Birbal-7B-V1 is trained with the following format:

## Instruction:
<instruction>

## Input:
<input>

## Response:
<response>

If a record does not contain any instruction, here is the training format:

## Input:
<input>

## Response:
<response>

It will performed best if queried in the same way.

Downstream Use

Birbal-7B-V1 is fine-tuned on our curated dataset that contain examples from large number of tasks spanning across NLP, Maths, QA, etc. Hence, we expect the model to perform well on in general on various kinds of tasks.

How to Get Started with the Model

It is quite easy! Merge Birbal-7B-V1 peft model with Mistral-7B model and start running inference!

Training Details

We used Mistral-7B as a base model and fine-tuned it on a single RTX 4090 GPU for 24 hours as per the competition rules. Fine-tuning was performed using 4-bit QLoRA.

Training Data

Here is high-level diagram of our data preparation strategy: image/png

Please visit https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data for more details.

Training Hyperparameters

Refer to https://github.com/Upaya07/NeurIPS-llm-efficiency-challenge/blob/main/training/axolotl/examples/mistral/nips/nips_02.yml for example set of hyperparams used.

Evaluation

Results

Task Score
MMLU - EM 0.629
MMLU - EM (Robustness) 0.591
MMLU - EM (Fairness) 0.596
MMLU Mean Win Rate 0.417
TruthfulQA - EM 0.59
TruthfulQA - EM (Robustness) 0.541
TruthfulQA - EM (Fairness) 0.492
TruthfulQA Mean Win Rate 0.75
BIG-bench - EM 0.330
BIG-bench Mean Win Rate 0.75
GSM8K - EM 0.443
GSM8K Mean Win Rate 0.625
BBQ - EM 0.738
BBQ Mean Win Rate 0.25
sam_sum - ROUGE-2 0.127
sam_sum - Stereotypes (race) 0.667
sam_sum - Stereotypes (gender) 0.447
sam_sum - Representation (race) 0.458
sam_sum - Representation (gender) 0.013
sam_sum Mean Win Rate 0.383
corr2cause - EM 0.615
corr2cause Mean Win Rate 0.875
MATH (chain-of-thoughts) - Equivalent (chain of thought) 0.121
MATH Mean Win Rate 0.75
ethics_justice - EM 0.68
ethics_justice - EM (Robustness) 0.645
ethics_justice - EM (Fairness) 0.62
ethics_commonsense - EM 0.41
ethics_commonsense - EM (Robustness) 0.33
ethics_commonsense - EM (Fairness) 0.345
ethics_virtue - EM 0.895
ethics_virtue - EM (Robustness) 0.865
ethics_virtue - EM (Fairness) 0.86
ethics_deontology - EM 0.63
ethics_deontology - EM (Robustness) 0.585
ethics_deontology - EM (Fairness) 0.595
ethics_utilitarianism - EM 0.72
ethics_utilitarianism - EM (Robustness) 0.6
ethics_utilitarianism - EM (Fairness) 0.645
ethics Mean Win Rate 0.55
๐Ÿ”ฅ Score_full 0.579
๐Ÿ”ฅ Score_open 0.516
๐Ÿ”ฅ Score_hidden 0.61

Top-5 Teams

Position Score
5th rank 0.362
4th rank 0.371
3rd rank 0.381
2nd rank 0.424
๐Ÿ”ฅ Ours (1st) 0.579

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Training procedure

The following bitsandbytes quantization config was used during training:

  • quant_method: bitsandbytes
  • load_in_8bit: False
  • load_in_4bit: True
  • llm_int8_threshold: 6.0
  • llm_int8_skip_modules: None
  • llm_int8_enable_fp32_cpu_offload: False
  • llm_int8_has_fp16_weight: False
  • bnb_4bit_quant_type: nf4
  • bnb_4bit_use_double_quant: True
  • bnb_4bit_compute_dtype: bfloat16

Framework versions

  • PEFT 0.6.1

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 62.82
AI2 Reasoning Challenge (25-Shot) 62.88
HellaSwag (10-Shot) 84.88
MMLU (5-Shot) 63.71
TruthfulQA (0-shot) 45.46
Winogrande (5-shot) 78.53
GSM8k (5-shot) 41.47
Downloads last month
2
Unable to determine this modelโ€™s pipeline type. Check the docs .

Adapter for

Dataset used to train upaya07/Birbal-7B-V1

Evaluation results