Edit model card

MBX-7B-v3-DPO

This model is a finetune of flemmingmiguel/MBX-7B-v3 using jondurbin/truthy-dpo-v0.1

MBX-v3-orca

Code Example

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("macadeliccc/MBX-7B-v3-DPO")
model = AutoModelForCausalLM.from_pretrained("macadeliccc/MBX-7B-v3-DPO")

messages = [
    {"role": "system", "content": "Respond to the users request like a pirate"},
    {"role": "user", "content": "Can you write me a quicksort algorithm?"}
]
gen_input = tokenizer.apply_chat_template(messages, return_tensors="pt")

Example Output

image/png

GGUF

Available here

Exllamav2

Quants are available from bartowski, check them out here

Download the size you want below, VRAM figures are estimates.

Branch Bits lm_head bits VRAM (4k) VRAM (16k) VRAM (32k) Description
8_0 8.0 8.0 8.4 GB 9.8 GB 11.8 GB Maximum quality that ExLlamaV2 can produce, near unquantized performance.
6_5 6.5 8.0 7.2 GB 8.6 GB 10.6 GB Very similar to 8.0, good tradeoff of size vs performance, recommended.
5_0 5.0 6.0 6.0 GB 7.4 GB 9.4 GB Slightly lower quality vs 6.5, but usable on 8GB cards.
4_25 4.25 6.0 5.3 GB 6.7 GB 8.7 GB GPTQ equivalent bits per weight, slightly higher quality.
3_5 3.5 6.0 4.7 GB 6.1 GB 8.1 GB Lower quality, only use if you have to.

Evaluations

EQ-Bench Comparison

----Benchmark Complete----
2024-01-30 15:22:18
Time taken: 145.9 mins
Prompt Format: ChatML
Model: macadeliccc/MBX-7B-v3-DPO
Score (v2): 74.32
Parseable: 166.0
---------------
Batch completed
Time taken: 145.9 mins
---------------

Original Model

----Benchmark Complete----
2024-01-31 01:26:26
Time taken: 89.1 mins
Prompt Format: Mistral
Model: flemmingmiguel/MBX-7B-v3
Score (v2): 73.87
Parseable: 168.0
---------------
Batch completed
Time taken: 89.1 mins
---------------
Model AGIEval GPT4All TruthfulQA Bigbench Average
MBX-7B-v3-DPO 45.16 77.73 74.62 48.83 61.58

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 27.95 ± 2.82
acc_norm 26.77 ± 2.78
agieval_logiqa_en 0 acc 41.01 ± 1.93
acc_norm 40.55 ± 1.93
agieval_lsat_ar 0 acc 25.65 ± 2.89
acc_norm 23.91 ± 2.82
agieval_lsat_lr 0 acc 50.78 ± 2.22
acc_norm 52.94 ± 2.21
agieval_lsat_rc 0 acc 66.54 ± 2.88
acc_norm 65.80 ± 2.90
agieval_sat_en 0 acc 77.67 ± 2.91
acc_norm 77.67 ± 2.91
agieval_sat_en_without_passage 0 acc 43.20 ± 3.46
acc_norm 43.20 ± 3.46
agieval_sat_math 0 acc 32.27 ± 3.16
acc_norm 30.45 ± 3.11

Average: 45.16%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 68.43 ± 1.36
acc_norm 68.34 ± 1.36
arc_easy 0 acc 87.54 ± 0.68
acc_norm 82.11 ± 0.79
boolq 1 acc 88.20 ± 0.56
hellaswag 0 acc 69.76 ± 0.46
acc_norm 87.40 ± 0.33
openbookqa 0 acc 40.20 ± 2.19
acc_norm 49.60 ± 2.24
piqa 0 acc 83.68 ± 0.86
acc_norm 85.36 ± 0.82
winogrande 0 acc 83.11 ± 1.05

Average: 77.73%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 58.87 ± 1.72
mc2 74.62 ± 1.44

Average: 74.62%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 60.00 ± 3.56
bigbench_date_understanding 0 multiple_choice_grade 63.14 ± 2.51
bigbench_disambiguation_qa 0 multiple_choice_grade 47.67 ± 3.12
bigbench_geometric_shapes 0 multiple_choice_grade 22.56 ± 2.21
exact_str_match 0.84 ± 0.48
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 33.20 ± 2.11
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 23.00 ± 1.59
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 59.67 ± 2.84
bigbench_movie_recommendation 0 multiple_choice_grade 47.40 ± 2.24
bigbench_navigate 0 multiple_choice_grade 56.10 ± 1.57
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 71.25 ± 1.01
bigbench_ruin_names 0 multiple_choice_grade 56.47 ± 2.35
bigbench_salient_translation_error_detection 0 multiple_choice_grade 35.27 ± 1.51
bigbench_snarks 0 multiple_choice_grade 73.48 ± 3.29
bigbench_sports_understanding 0 multiple_choice_grade 75.46 ± 1.37
bigbench_temporal_sequences 0 multiple_choice_grade 52.10 ± 1.58
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 22.64 ± 1.18
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 19.83 ± 0.95
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 59.67 ± 2.84

Average: 48.83%

Average score: 61.58%

Elapsed time: 02:37:39

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 76.13
AI2 Reasoning Challenge (25-Shot) 73.55
HellaSwag (10-Shot) 89.11
MMLU (5-Shot) 64.91
TruthfulQA (0-shot) 74.00
Winogrande (5-shot) 85.56
GSM8k (5-shot) 69.67
Downloads last month
317
Safetensors
Model size
7.24B params
Tensor type
FP16
·
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train macadeliccc/MBX-7B-v3-DPO

Space using macadeliccc/MBX-7B-v3-DPO 1

Evaluation results