HuggingFaceFW/fineweb-edu
Viewer โข Updated โข 3.5B โข 387k โข 1.16k
MiniBananaMind-v2-9M is a small causal language model trained from scratch on FineWeb-Edu.
The model has about 8.9M parameters and uses a custom 8k-token byte-level BPE tokenizer.
| Field | Value |
|---|---|
| Parameters | ~8.9M |
| Architecture | Custom Llama-style decoder |
| Layers | 9 |
| Hidden size | 256 |
| Intermediate size | 768 |
| Attention heads | 8 |
| KV heads | 2 |
| Vocabulary size | 8,192 |
| Context length | 2,048 |
| Weight format | safetensors |
| Training precision | BF16 |
| Checkpoint used | checkpoint-6755 |
MiniBananaMind-v2-9M was trained on:
HuggingFaceFW/fineweb-edusample-10BTThe model was trained from scratch. No benchmark datasets were used for training.
The final checkpoint used for this model is checkpoint-6755.
The name leaf-alpha was also used for it.
| Benchmark | Metric | Score |
|---|---|---|
| HellaSwag | acc_norm | 27.04% |
| ARC-Easy | acc_norm | 33.92% |
| ARC-Challenge | acc_norm | 20.73% |
| PIQA | acc | 55.06% |
| ArithMark-2 | acc | 25.32% |
Average across the 5 listed tasks: 32.41%
This model uses custom architecture code, so it must be loaded with trust_remote_code=True.
Example:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "BananaMind/MiniBananaMind-v2-9M"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.float16,
).cuda().eval()
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=80,
do_sample=True,
temperature=0.8,
top_p=0.95,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "BananaMind/MiniBananaMind-v2-9M"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
torch_dtype=torch.float16,
).cuda().eval()
prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=80,
do_sample=False,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Trained on a 5070 Ti in 4 hours and 34 minutes.