banner

BananaMind-1.5-Base

BananaMind-1.5-Base is a small English causal language model trained from scratch by BananaMind.

It is our first fully pretrained medium model

Model Details

Field Value
Parameters 75,054,720
Architecture Llama-style decoder-only Transformer
Layers 12
Hidden size 640
Intermediate size 1728
Attention heads 10
KV heads 5
Context length 4096 tokens
Vocabulary size 32,000
Tokenizer Custom byte-level BPE
Training tokens ~27B tokens
Precision BF16 training, safetensors release
Model type Base causal LM
Training Cost 103.31$(PLEASE LIKE THIS IS SO EXPENSIVE)
Training GPU RTX Pro 6000

Benchmarks A instruction tuned version is coming very soon.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

repo = "BananaMind/BananaMind-1.5-Base"

device = "cuda" if torch.cuda.is_available() else "cpu"
dtype = torch.float16 if torch.cuda.is_available() else torch.float32

tok = AutoTokenizer.from_pretrained(repo, trust_remote_code=False)

model = AutoModelForCausalLM.from_pretrained(
    repo,
    trust_remote_code=False,
    dtype=dtype,
).to(device)

model.eval()

prompt = "The color of the sky is blue. The color of a banana is"
inputs = tok(prompt, return_tensors="pt").to(device)

with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=16,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
        pad_token_id=tok.eos_token_id,
        eos_token_id=tok.eos_token_id,
    )

print(tok.decode(out[0], skip_special_tokens=True))

Generation Settings

Recommended starting settings:

temperature = 0.7
top_p = 0.9
max_new_tokens = 64

For deterministic sanity tests:

do_sample = False
max_new_tokens = 8

Training

BananaMind-1.5-Base was trained from scratch on approximately 27B tokens of FineWeb-Edu-style English web text.

The model uses a custom 32k byte-level BPE tokenizer and a compact Llama-style architecture with grouped-query attention.

Architecture

BananaMind-1.5-Base uses a compact Llama-style decoder architecture:

  • 12 Transformer layers
  • 640 hidden size
  • 1728 intermediate size
  • 10 attention heads
  • 5 key-value heads
  • grouped-query attention
  • SiLU activation
  • RMSNorm
  • tied input/output embeddings
  • 4096 token context length

Evaluation

Our model performs very good in comparison to other models:

Model HellaSwag ARC-Easy ARC-Challenge PIQA ArithMark-2.0 Average
BananaMind-1.5-Base 30.91% 42.38% 23.98% 60.55% 26.68% 36.90%
Gemma 3 IT 270M 37.70% - - 66.20% - -
Zupra-1.6-Instruct-Ultra-Exp 29.66% 34.41% 25.51% 59.74% 30.44% 35.95%
KeyLM 75M 29.66% 35.73% 23.98% 60.50% 25.80% 35.13%
GPT-2 124M 31.26% 39.35% 22.35% 62.08% 26.48% 36.30%

Benchmarks

Parameter vs Size

Parameter vs size Chart

Citation

@misc{bananamind15base,
  title = {BananaMind-1.5-Base},
  author = {BananaMind},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/BananaMind/BananaMind-1.5-Base}}
}
Downloads last month
-
Safetensors
Model size
75.1M params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train BananaMind/BananaMind-1.5-Base

Space using BananaMind/BananaMind-1.5-Base 1