phi 3 4x4b

a continually pretrained phi3-mini sparse moe upcycle

benchmarks

ran locally

Microsoft/phi-3-4k-instruct Fizzarolli/phi3-4x4b-v1
MMLU acc. (0-shot) 0.6799 0.6781
Hellaswag acc. (0-shot) 0.6053 0.5962
ARC-E acc. (0-shot) 0.8325 0.8367
ARC-C acc. (0-shot) 0.5546 0.5606

honestly i was expecting it to do worse :p, but those are all within a margin of error! so it didn't lose any performance, at least

open llm leaderboard

todo!

support me on ko-fi!

please i need money to stay alive and keep making models

notes

not trained on instruct data. it's pretty likely that it won't be much different from phi 3 if you use it like that, if not worse due to any forgetting of instruct formats during the continued training.

future experiments

  • the datasets for this were literally chosen on a whim. perhaps experiment with a further filtered HuggingFaceFW/fineweb-edu?
  • actually freeze the gate layers next time (see Chen et. al, 2023), oops
  • MOAR TRAINING, this only went up to ~0.2 of an epoch because i ran out of dolar
Downloads last month
10
Safetensors
Model size
11.1B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for Fizzarolli/phi3-4x4b-v1

Quantizations
2 models

Datasets used to train Fizzarolli/phi3-4x4b-v1

Collection including Fizzarolli/phi3-4x4b-v1