Text Generation
Transformers
Safetensors
English
jamba
smol MoE
smol
Edit model card

jamba-900M-v0.13-KIx2

Open In Colab

The API widget is off as it isn't supported by hf yet - try the Colab

This is a pretraining experiment on the jamba arch as a "smol MoE".

Details:

  • pretrained at context length 16384
  • seen approx 20b tokens
  • uses Claude3 tokenizer (as hf GPT2 tokenizer)
  • hidden size 1024, 12 layers, 8 experts

achieves the following results on the evaluation set (most recent dataset):

  • Loss: 3.0366
  • Accuracy: 0.4514
  • Num Input Tokens Seen: 1975517184

if I pretrain it further, other versions will be in new repos with incremented version (this is v0.13)

Quick eval

Quick eval for: pszemraj/jamba-H1024_L12-v0.13-KIx2

hf (pretrained=pszemraj/jamba-H1024_L12-v0.13-KIx2,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: 0.9999, num_fewshot: None, batch_size: 8

Tasks Version Filter n-shot Metric Value Stderr
winogrande 1 none 0 acc 0.5067 ± 0.0141
piqa 1 none 0 acc 0.5912 ± 0.0138
none 0 acc_norm 0.5951 ± 0.0138
openbookqa 1 none 0 acc 0.1800 ± 0.0172
none 0 acc_norm 0.2920 ± 0.0204
lambada_openai 1 none 0 perplexity 103.1241 ± 8.5843
none 0 acc 0.2502 ± 0.0122
boolq 2 none 0 acc 0.6196 ± 0.0136
arc_easy 1 none 0 acc 0.3836 ± 0.0137
none 0 acc_norm 0.3694 ± 0.0136

example outputs

image/png

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 80085
  • gradient_accumulation_steps: 32
  • total_train_batch_size: 128
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_ratio: 0.05
  • num_epochs: 2.0

Training results

Training Loss Epoch Step Validation Loss Accuracy Input Tokens Seen
3.2013 0.4241 200 3.0653 0.4479 419430400
3.1976 0.8481 400 3.0434 0.4506 838860800
3.1485 1.2722 600 3.0375 0.4513 1258291200
3.1871 1.6963 800 3.0366 0.4514 1677721600

Framework versions

  • Transformers 4.40.1
  • Pytorch 2.2.0+cu121
  • Datasets 2.19.0
  • Tokenizers 0.19.1
Downloads last month
734
Safetensors
Model size
888M params
Tensor type
F32
·
BF16
·
Inference API (serverless) has been turned off for this model.

Datasets used to train pszemraj/jamba-900M-v0.13-KIx2

Collection including pszemraj/jamba-900M-v0.13-KIx2