macadeliccc's picture
Update README.md
d1aab33 verified
|
raw
history blame
8.17 kB
metadata
license: apache-2.0
library_name: transformers

Laser-Dolphin-Mixtral-2x7b-dpo

laser_dolphin_image

New Version will be uploaded soon

Credit to Fernando Fernandes and Eric Hartford for their project laserRMT

This model is a medium-sized MoE implementation based on cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser

A 2x7b configuration offers better performance than a standard 7b model even if loaded in 4 bit. (9G VRAM)

If this 2x7b model is loaded in 4 bit the hellaswag score is .8270 which is higher than the base model achieves on its own in full precision.

The process is outlined in this notebook

These Quants will result in unpredicted behavior and I am working on new Quants as I have updated the model

Quatizations provided by TheBloke

Code Example

Switch the commented model definition to use in 4-bit. Should work with 9GB and still exceed the single 7B model by 5-6 points roughly

from transformers import AutoModelForCausalLM, AutoTokenizer

def generate_response(prompt):
    """
    Generate a response from the model based on the input prompt.

    Args:
    prompt (str): Prompt for the model.

    Returns:
    str: The generated response from the model.
    """
    # Tokenize the input prompt
    inputs = tokenizer(prompt, return_tensors="pt")

    # Generate output tokens
    outputs = model.generate(**inputs, max_new_tokens=256, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)

    # Decode the generated tokens to a string
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

# Load the model and tokenizer
model_id = "macadeliccc/laser-dolphin-mixtral-2x7b-dpo"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

prompt = "Write a quicksort algorithm in python"

# Generate and print responses for each language
print("Response:")
print(generate_response(prompt), "\n")

colab with usage example

Eval

evaluation colab

Model AGIEval GPT4All TruthfulQA Bigbench Average
laser-dolphin-mixtral-2x7b-dpo 41.31 73.67 61.69 42.79 54.87

AGIEval

Task Version Metric Value Stderr
agieval_aqua_rat 0 acc 22.44 ± 2.62
acc_norm 21.26 ± 2.57
agieval_logiqa_en 0 acc 34.87 ± 1.87
acc_norm 35.79 ± 1.88
agieval_lsat_ar 0 acc 22.17 ± 2.75
acc_norm 23.04 ± 2.78
agieval_lsat_lr 0 acc 43.14 ± 2.20
acc_norm 45.10 ± 2.21
agieval_lsat_rc 0 acc 57.25 ± 3.02
acc_norm 55.76 ± 3.03
agieval_sat_en 0 acc 71.84 ± 3.14
acc_norm 71.84 ± 3.14
agieval_sat_en_without_passage 0 acc 44.17 ± 3.47
acc_norm 41.75 ± 3.44
agieval_sat_math 0 acc 40.91 ± 3.32
acc_norm 35.91 ± 3.24

Average: 41.31%

GPT4All

Task Version Metric Value Stderr
arc_challenge 0 acc 58.02 ± 1.44
acc_norm 60.58 ± 1.43
arc_easy 0 acc 85.48 ± 0.72
acc_norm 82.62 ± 0.78
boolq 1 acc 87.16 ± 0.59
hellaswag 0 acc 65.04 ± 0.48
acc_norm 83.63 ± 0.37
openbookqa 0 acc 35.60 ± 2.14
acc_norm 45.00 ± 2.23
piqa 0 acc 81.99 ± 0.90
acc_norm 83.51 ± 0.87
winogrande 0 acc 73.16 ± 1.25

Average: 73.67%

TruthfulQA

Task Version Metric Value Stderr
truthfulqa_mc 1 mc1 44.31 ± 1.74
mc2 61.69 ± 1.50

Average: 61.69%

Bigbench

Task Version Metric Value Stderr
bigbench_causal_judgement 0 multiple_choice_grade 59.47 ± 3.57
bigbench_date_understanding 0 multiple_choice_grade 66.67 ± 2.46
bigbench_disambiguation_qa 0 multiple_choice_grade 36.05 ± 3.00
bigbench_geometric_shapes 0 multiple_choice_grade 20.33 ± 2.13
exact_str_match 7.52 ± 1.39
bigbench_logical_deduction_five_objects 0 multiple_choice_grade 27.80 ± 2.01
bigbench_logical_deduction_seven_objects 0 multiple_choice_grade 19.86 ± 1.51
bigbench_logical_deduction_three_objects 0 multiple_choice_grade 48.67 ± 2.89
bigbench_movie_recommendation 0 multiple_choice_grade 49.60 ± 2.24
bigbench_navigate 0 multiple_choice_grade 53.20 ± 1.58
bigbench_reasoning_about_colored_objects 0 multiple_choice_grade 68.50 ± 1.04
bigbench_ruin_names 0 multiple_choice_grade 41.74 ± 2.33
bigbench_salient_translation_error_detection 0 multiple_choice_grade 16.23 ± 1.17
bigbench_snarks 0 multiple_choice_grade 64.09 ± 3.58
bigbench_sports_understanding 0 multiple_choice_grade 70.69 ± 1.45
bigbench_temporal_sequences 0 multiple_choice_grade 37.70 ± 1.53
bigbench_tracking_shuffled_objects_five_objects 0 multiple_choice_grade 23.44 ± 1.20
bigbench_tracking_shuffled_objects_seven_objects 0 multiple_choice_grade 17.60 ± 0.91
bigbench_tracking_shuffled_objects_three_objects 0 multiple_choice_grade 48.67 ± 2.89

Average: 42.79%

Average score: 54.87%

Elapsed time: 02:53:28

Citations

Fernando Fernandes Neto and Eric Hartford. "Optimizing Large Language Models Using Layer-Selective Rank Reduction and Random Matrix Theory." 2024.

@article{sharma2023truth,
title={The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction},
author={Sharma, Pratyusha and Ash, Jordan T and Misra, Dipendra},
journal={arXiv preprint arXiv:2312.13558},
year={2023} }
@article{gao2021framework,
  title={A framework for few-shot language model evaluation},
  author={Gao, Leo and Tow, Jonathan and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and McDonell, Kyle and Muennighoff, Niklas and others},
  journal={Version v0. 0.1. Sept},
  year={2021}
}