  • License: apache-2.0
  • Finetuned from model : unsloth/llama-3-8b-bnb-4bit
  • Single Epoch - Alpaca Dataset

import torch
from unsloth import FastLanguageModel

max_seq_length = 2048
dtype = None
load_in_4bit = True

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:

### Input:

### Response:

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "vincentoh/llama3-alpaca-dpo-instruct",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,


input_question = 'Why is the sky blue?'
inputs = tokenizer([alpaca_prompt.format(input_question,"","",)], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)

|                                          Model                                          |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|[llama3-alpaca-dpo-instruct](https://huggingface.co/vincentoh/llama3-alpaca-dpo-instruct)|  30.67|  70.01|     46.56|    37.3|  46.14|

### AGIEval
|             Task             |Version| Metric |Value|   |Stderr|
|agieval_aqua_rat              |      0|acc     |19.69|±  |  2.50|
|                              |       |acc_norm|22.83|±  |  2.64|
|agieval_logiqa_en             |      0|acc     |29.34|±  |  1.79|
|                              |       |acc_norm|32.72|±  |  1.84|
|agieval_lsat_ar               |      0|acc     |20.87|±  |  2.69|
|                              |       |acc_norm|20.43|±  |  2.66|
|agieval_lsat_lr               |      0|acc     |34.31|±  |  2.10|
|                              |       |acc_norm|31.57|±  |  2.06|
|agieval_lsat_rc               |      0|acc     |43.87|±  |  3.03|
|                              |       |acc_norm|35.32|±  |  2.92|
|agieval_sat_en                |      0|acc     |51.46|±  |  3.49|
|                              |       |acc_norm|39.81|±  |  3.42|
|agieval_sat_en_without_passage|      0|acc     |37.38|±  |  3.38|
|                              |       |acc_norm|28.16|±  |  3.14|
|agieval_sat_math              |      0|acc     |37.73|±  |  3.28|
|                              |       |acc_norm|34.55|±  |  3.21|

Average: 30.67%

### GPT4All
|    Task     |Version| Metric |Value|   |Stderr|
|arc_challenge|      0|acc     |50.60|±  |  1.46|
|             |       |acc_norm|53.41|±  |  1.46|
|arc_easy     |      0|acc     |79.84|±  |  0.82|
|             |       |acc_norm|78.28|±  |  0.85|
|boolq        |      1|acc     |80.18|±  |  0.70|
|hellaswag    |      0|acc     |59.39|±  |  0.49|
|             |       |acc_norm|78.33|±  |  0.41|
|openbookqa   |      0|acc     |34.40|±  |  2.13|
|             |       |acc_norm|45.20|±  |  2.23|
|piqa         |      0|acc     |79.54|±  |  0.94|
|             |       |acc_norm|80.85|±  |  0.92|
|winogrande   |      0|acc     |73.80|±  |  1.24|

Average: 70.01%

### TruthfulQA
|    Task     |Version|Metric|Value|   |Stderr|
|truthfulqa_mc|      1|mc1   |30.23|±  |  1.61|
|             |       |mc2   |46.56|±  |  1.40|

Average: 46.56%

### Bigbench
|                      Task                      |Version|       Metric        |Value|   |Stderr|
|bigbench_causal_judgement                       |      0|multiple_choice_grade|54.74|±  |  3.62|
|bigbench_date_understanding                     |      0|multiple_choice_grade|67.75|±  |  2.44|
|bigbench_disambiguation_qa                      |      0|multiple_choice_grade|29.07|±  |  2.83|
|bigbench_geometric_shapes                       |      0|multiple_choice_grade|27.86|±  |  2.37|
|                                                |       |exact_str_match      | 0.00|±  |  0.00|
|bigbench_logical_deduction_five_objects         |      0|multiple_choice_grade|24.80|±  |  1.93|
|bigbench_logical_deduction_seven_objects        |      0|multiple_choice_grade|17.00|±  |  1.42|
|bigbench_logical_deduction_three_objects        |      0|multiple_choice_grade|42.33|±  |  2.86|
|bigbench_movie_recommendation                   |      0|multiple_choice_grade|30.80|±  |  2.07|
|bigbench_navigate                               |      0|multiple_choice_grade|55.60|±  |  1.57|
|bigbench_reasoning_about_colored_objects        |      0|multiple_choice_grade|54.65|±  |  1.11|
|bigbench_ruin_names                             |      0|multiple_choice_grade|32.37|±  |  2.21|
|bigbench_salient_translation_error_detection    |      0|multiple_choice_grade|28.66|±  |  1.43|
|bigbench_snarks                                 |      0|multiple_choice_grade|46.41|±  |  3.72|
|bigbench_sports_understanding                   |      0|multiple_choice_grade|55.48|±  |  1.58|
|bigbench_temporal_sequences                     |      0|multiple_choice_grade|25.30|±  |  1.38|
|bigbench_tracking_shuffled_objects_five_objects |      0|multiple_choice_grade|21.36|±  |  1.16|
|bigbench_tracking_shuffled_objects_seven_objects|      0|multiple_choice_grade|14.91|±  |  0.85|
|bigbench_tracking_shuffled_objects_three_objects|      0|multiple_choice_grade|42.33|±  |  2.86|

Average: 37.3%

Average score: 46.14%
