llama3.1-8b-spaetzle-v51
This is only a quick test in merging 3 and 3.1 llamas despite a number of differences in tokenizer setup i.a., also motivated by ongoing problems with BOS, looping, etc, with 3.1, esp. with llama.cpp, missing full RoPE scaling yet, etc. Performance is yet not satisfactory of course, which might have a number of causes.
Summary Table
AGIEval Results
Task |
llama3.1-8b-spaetzle-v51 |
llama3-8b-spaetzle-v39 |
agieval_aqua_rat |
27.95 |
24.41 |
agieval_logiqa_en |
38.10 |
37.94 |
agieval_lsat_ar |
24.78 |
22.17 |
agieval_lsat_lr |
42.94 |
45.29 |
agieval_lsat_rc |
59.11 |
62.08 |
agieval_sat_en |
68.45 |
71.36 |
agieval_sat_en_without_passage |
38.35 |
44.17 |
agieval_sat_math |
38.18 |
40.00 |
Average |
42.23 |
43.43 |
TruthfulQA Results
Task |
llama3.1-8b-spaetzle-v51 |
llama3-8b-spaetzle-v39 |
mc1 |
38.07 |
43.82 |
mc2 |
57.29 |
60.00 |
Average |
57.29 |
60.00 |
Bigbench Results
Task |
llama3.1-8b-spaetzle-v51 |
llama3-8b-spaetzle-v39 |
bigbench_causal_judgement |
56.32 |
59.47 |
bigbench_date_understanding |
69.65 |
70.73 |
bigbench_disambiguation_qa |
31.40 |
34.88 |
bigbench_geometric_shapes |
29.81 |
24.23 |
bigbench_logical_deduction_five_objects |
30.20 |
36.20 |
bigbench_logical_deduction_seven_objects |
23.00 |
24.00 |
bigbench_logical_deduction_three_objects |
55.67 |
65.00 |
bigbench_movie_recommendation |
33.00 |
36.20 |
bigbench_navigate |
55.10 |
51.70 |
bigbench_reasoning_about_colored_objects |
66.55 |
68.60 |
bigbench_ruin_names |
52.23 |
51.12 |
bigbench_salient_translation_error_detection |
25.55 |
28.96 |
bigbench_snarks |
61.88 |
62.43 |
bigbench_sports_understanding |
51.42 |
53.96 |
bigbench_temporal_sequences |
59.30 |
53.60 |
bigbench_tracking_shuffled_objects_five_objects |
23.28 |
22.32 |
bigbench_tracking_shuffled_objects seven objects |
17.31 |
17.66 |
bigbench_tracking_shuffled_objects three objects |
55.67 |
65.00 |
Average |
44.30 |
45.89 |
(GPT4All run broke.)
𧩠Configuration
models:
- model: cstr/llama3-8b-spaetzle-v34
- model: sparsh35/Meta-Llama-3.1-8B-Instruct
parameters:
density: 0.65
weight: 0.5
merge_method: dare_ties
base_model: cstr/llama3-8b-spaetzle-v34
parameters:
int8_mask: true
dtype: bfloat16
random_seed: 0
tokenizer_source: base
π» Usage
!pip install -qU transformers accelerate
from transformers import AutoTokenizer
import transformers
import torch
model = "cstr/llama3-8b-spaetzle-v51"
messages = [{"role": "user", "content": "What is a large language model?"}]
tokenizer = AutoTokenizer.from_pretrained(model)
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.float16,
device_map="auto",
)
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])