asharsha30/LLAMA_Harsha_8_B_ORDP_10k
This model is the fine tune of NousResearch/Meta-Llama-3-8B using the 12,000 steps of mlabonne/orpo-dpo-mix-40k.
💻 Usage
# Use a pipeline as a high-level helper
from transformers import pipeline
messages = [
{"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="asharsha30/LLAMA_Harsha_8_B_ORDP_10k")
pipe(messages)
📈Training And Evaluation Report:
Reports from Wandb
Acknowledgment:
Huge thanks to Maxime Labonne for his brilliant blog post covering about the techniques related to finetuning the llama models using SFT and ORPO
Evaluated Using:
The model is evaluated using the https://github.com/mlabonne/llm-autoeval and the results are summarized from the generated gist https://gist.github.com/asharsha30-1996/4162fc98d9669aab3080645c54905bd0
Accuracy measure on Neous Benchmarks:
Model | AGIEval | GPT4All | TruthfulQA | Bigbench | Average |
---|---|---|---|---|---|
LLAMA_Harsha_8_B_ORDP_10k | 35.54 | 71.15 | 55.39 | 37.96 | 50.01 |
AGIEval
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
agieval_aqua_rat | 0 | acc | 26.77 | ± | 2.78 |
acc_norm | 27.17 | ± | 2.80 | ||
agieval_logiqa_en | 0 | acc | 31.34 | ± | 1.82 |
acc_norm | 33.03 | ± | 1.84 | ||
agieval_lsat_ar | 0 | acc | 18.70 | ± | 2.58 |
acc_norm | 19.57 | ± | 2.62 | ||
agieval_lsat_lr | 0 | acc | 42.94 | ± | 2.19 |
acc_norm | 35.10 | ± | 2.12 | ||
agieval_lsat_rc | 0 | acc | 52.42 | ± | 3.05 |
acc_norm | 43.87 | ± | 3.03 | ||
agieval_sat_en | 0 | acc | 65.53 | ± | 3.32 |
acc_norm | 54.37 | ± | 3.48 | ||
agieval_sat_en_without_passage | 0 | acc | 41.75 | ± | 3.44 |
acc_norm | 33.98 | ± | 3.31 | ||
agieval_sat_math | 0 | acc | 42.27 | ± | 3.34 |
acc_norm | 37.27 | ± | 3.27 |
Average: 35.54%
GPT4All
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
arc_challenge | 0 | acc | 49.91 | ± | 1.46 |
acc_norm | 54.10 | ± | 1.46 | ||
arc_easy | 0 | acc | 80.47 | ± | 0.81 |
acc_norm | 80.05 | ± | 0.82 | ||
boolq | 1 | acc | 82.08 | ± | 0.67 |
hellaswag | 0 | acc | 61.08 | ± | 0.49 |
acc_norm | 80.26 | ± | 0.40 | ||
openbookqa | 0 | acc | 34.00 | ± | 2.12 |
acc_norm | 45.00 | ± | 2.23 | ||
piqa | 0 | acc | 79.71 | ± | 0.94 |
acc_norm | 81.61 | ± | 0.90 | ||
winogrande | 0 | acc | 74.98 | ± | 1.22 |
Average: 71.15%
TruthfulQA
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
truthfulqa_mc | 1 | mc1 | 37.45 | ± | 1.69 |
mc2 | 55.39 | ± | 1.50 |
Average: 55.39%
Bigbench
Task | Version | Metric | Value | Stderr | |
---|---|---|---|---|---|
bigbench_causal_judgement | 0 | multiple_choice_grade | 57.37 | ± | 3.60 |
bigbench_date_understanding | 0 | multiple_choice_grade | 68.02 | ± | 2.43 |
bigbench_disambiguation_qa | 0 | multiple_choice_grade | 31.01 | ± | 2.89 |
bigbench_geometric_shapes | 0 | multiple_choice_grade | 20.89 | ± | 2.15 |
exact_str_match | 0.00 | ± | 0.00 | ||
bigbench_logical_deduction_five_objects | 0 | multiple_choice_grade | 28.40 | ± | 2.02 |
bigbench_logical_deduction_seven_objects | 0 | multiple_choice_grade | 20.71 | ± | 1.53 |
bigbench_logical_deduction_three_objects | 0 | multiple_choice_grade | 48.67 | ± | 2.89 |
bigbench_movie_recommendation | 0 | multiple_choice_grade | 31.60 | ± | 2.08 |
bigbench_navigate | 0 | multiple_choice_grade | 50.60 | ± | 1.58 |
bigbench_reasoning_about_colored_objects | 0 | multiple_choice_grade | 63.25 | ± | 1.08 |
bigbench_ruin_names | 0 | multiple_choice_grade | 34.38 | ± | 2.25 |
bigbench_salient_translation_error_detection | 0 | multiple_choice_grade | 21.84 | ± | 1.31 |
bigbench_snarks | 0 | multiple_choice_grade | 44.20 | ± | 3.70 |
bigbench_sports_understanding | 0 | multiple_choice_grade | 50.30 | ± | 1.59 |
bigbench_temporal_sequences | 0 | multiple_choice_grade | 26.30 | ± | 1.39 |
bigbench_tracking_shuffled_objects_five_objects | 0 | multiple_choice_grade | 21.36 | ± | 1.16 |
bigbench_tracking_shuffled_objects_seven_objects | 0 | multiple_choice_grade | 15.77 | ± | 0.87 |
bigbench_tracking_shuffled_objects_three_objects | 0 | multiple_choice_grade | 48.67 | ± | 2.89 |
Average: 37.96%
Average score: 50.01%
Elapsed time: 02:36:38
- Downloads last month
- 129
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for asharsha30/LLAMA_Harsha_8_B_ORDP_10k
Dataset used to train asharsha30/LLAMA_Harsha_8_B_ORDP_10k
Evaluation results
- strict accuracy on IFEval (0-Shot)Open LLM Leaderboard34.640
- normalized accuracy on BBH (3-Shot)Open LLM Leaderboard25.730
- exact match on MATH Lvl 5 (4-Shot)Open LLM Leaderboard5.210
- acc_norm on GPQA (0-shot)Open LLM Leaderboard3.130
- acc_norm on MuSR (0-shot)Open LLM Leaderboard7.070
- accuracy on MMLU-PRO (5-shot)test set Open LLM Leaderboard20.110