MiniBananaMind-v2-9M

MiniBananaMind-v2-9M is a small causal language model trained from scratch on FineWeb-Edu.

The model has about 8.9M parameters and uses a custom 8k-token byte-level BPE tokenizer.

Model Details

Field	Value
Parameters	~8.9M
Architecture	Custom Llama-style decoder
Layers	9
Hidden size	256
Intermediate size	768
Attention heads	8
KV heads	2
Vocabulary size	8,192
Context length	2,048
Weight format	safetensors
Training precision	BF16
Checkpoint used	checkpoint-6755

Training Data

MiniBananaMind-v2-9M was trained on:

Dataset: HuggingFaceFW/fineweb-edu
Config: sample-10BT
Text domain: educational web text
Tokenizer: custom 8k byte-level BPE tokenizer
Training tokens seen: ~3.54B tokens after retokenization

The model was trained from scratch. No benchmark datasets were used for training.

Evaluation

The final checkpoint used for this model is checkpoint-6755.

The name leaf-alpha was also used for it.

Benchmark	Metric	Score
HellaSwag	acc_norm	27.04%
ARC-Easy	acc_norm	33.92%
ARC-Challenge	acc_norm	20.73%
PIQA	acc	55.06%
ArithMark-2	acc	25.32%

Average across the 5 listed tasks: 32.41%

Usage

This model uses custom architecture code, so it must be loaded with trust_remote_code=True.

Example:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "BananaMind/MiniBananaMind-v2-9M"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
).cuda().eval()

prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=True,
        temperature=0.8,
        top_p=0.95,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Greedy Generation Example

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "BananaMind/MiniBananaMind-v2-9M"

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.float16,
).cuda().eval()

prompt = "The capital of France is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=80,
        do_sample=False,
    )

print(tokenizer.decode(output[0], skip_special_tokens=True))

Trained on a 5070 Ti in 4 hours and 34 minutes.

Downloads last month: -

Safetensors

Model size

8.88M params

Tensor type

F32

BananaMind
/

MiniBananaMind-v2-9M