File size: 3,360 Bytes
55287d5 6e546f2 55287d5 80e6a33 6e546f2 c2efc63 55287d5 6e546f2 0dbe377 b8b870e 0dbe377 c2efc63 55287d5 0dbe377 55287d5 ffb78a7 55287d5 1058833 55287d5 0dbe377 55287d5 34866dd 55287d5 34866dd 55287d5 34866dd b8b870e 34866dd 55287d5 730ae8e 55287d5 80e6a33 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
license: apache-2.0
tags:
- jamba
- smol MoE
- smol
metrics:
- accuracy
datasets:
- BEE-spoke-data/knowledge-inoc-concat-v1
- BEE-spoke-data/wikipedia-20230901.en-deduped
- BEE-spoke-data/fineweb-100k_en-med
- BEE-spoke-data/fineweb-1M_en-med
- BEE-spoke-data/fineweb-1M_longish
language:
- en
inference: false
---
# jamba-900M-v0.13-KIx2
<a href="https://colab.research.google.com/gist/pszemraj/62d037d0d93656ef2101d7e29e3b7220/jamba-test-sandbox.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
> The API widget is off as it isn't supported by hf yet - try the Colab
This is a pretraining experiment on the `jamba` arch as a "smol MoE".
Details:
- pretrained at context length 16384
- seen approx 20b tokens
- uses Claude3 tokenizer (as hf GPT2 tokenizer)
- hidden size 1024, 12 layers, 8 experts
achieves the following results on the evaluation set (_most recent dataset_):
- Loss: 3.0366
- Accuracy: 0.4514
- Num Input Tokens Seen: 1975517184
if I pretrain it further, other versions will be in new repos with incremented version (this is v0.13)
## Quick eval
Quick eval for: pszemraj/jamba-H1024_L12-v0.13-KIx2
hf (pretrained=pszemraj/jamba-H1024_L12-v0.13-KIx2,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: 0.9999, num_fewshot: None, batch_size: 8
| Tasks |Version|Filter|n-shot| Metric | Value | |Stderr|
|--------------|------:|------|-----:|----------|-------:|---|-----:|
|winogrande | 1|none | 0|acc | 0.5067|± |0.0141|
|piqa | 1|none | 0|acc | 0.5912|± |0.0138|
| | |none | 0|acc_norm | 0.5951|± |0.0138|
|openbookqa | 1|none | 0|acc | 0.1800|± |0.0172|
| | |none | 0|acc_norm | 0.2920|± |0.0204|
|lambada_openai| 1|none | 0|perplexity|103.1241|± |8.5843|
| | |none | 0|acc | 0.2502|± |0.0122|
|boolq | 2|none | 0|acc | 0.6196|± |0.0136|
|arc_easy | 1|none | 0|acc | 0.3836|± |0.0137|
| | |none | 0|acc_norm | 0.3694|± |0.0136|
## example outputs
![image/png](https://cdn-uploads.huggingface.co/production/uploads/60bccec062080d33f875cd0c/wky-qjUtS0AJ6YtIsJh3T.png)
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 80085
- gradient_accumulation_steps: 32
- total_train_batch_size: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 2.0
### Training results
| Training Loss | Epoch | Step | Validation Loss | Accuracy | Input Tokens Seen |
|:-------------:|:------:|:----:|:---------------:|:--------:|:-----------------:|
| 3.2013 | 0.4241 | 200 | 3.0653 | 0.4479 | 419430400 |
| 3.1976 | 0.8481 | 400 | 3.0434 | 0.4506 | 838860800 |
| 3.1485 | 1.2722 | 600 | 3.0375 | 0.4513 | 1258291200 |
| 3.1871 | 1.6963 | 800 | 3.0366 | 0.4514 | 1677721600 |
### Framework versions
- Transformers 4.40.1
- Pytorch 2.2.0+cu121
- Datasets 2.19.0
- Tokenizers 0.19.1 |