File size: 3,360 Bytes
55287d5
 
 
6e546f2
 
 
55287d5
 
80e6a33
 
 
 
 
 
6e546f2
 
c2efc63
55287d5
 
6e546f2
0dbe377
b8b870e
 
 
0dbe377
c2efc63
 
 
 
 
55287d5
0dbe377
 
 
 
55287d5
ffb78a7
55287d5
 
1058833
55287d5
0dbe377
55287d5
34866dd
55287d5
34866dd
55287d5
 
34866dd
b8b870e
34866dd
 
 
 
 
 
 
 
 
 
 
 
55287d5
730ae8e
 
 
 
 
55287d5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80e6a33
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
---
license: apache-2.0
tags:
- jamba
- smol MoE
- smol
metrics:
- accuracy
datasets:
- BEE-spoke-data/knowledge-inoc-concat-v1
- BEE-spoke-data/wikipedia-20230901.en-deduped
- BEE-spoke-data/fineweb-100k_en-med
- BEE-spoke-data/fineweb-1M_en-med
- BEE-spoke-data/fineweb-1M_longish
language:
- en
inference: false
---

# jamba-900M-v0.13-KIx2

<a href="https://colab.research.google.com/gist/pszemraj/62d037d0d93656ef2101d7e29e3b7220/jamba-test-sandbox.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

> The API widget is off as it isn't supported by hf yet - try the Colab

This is a pretraining experiment on the `jamba` arch as a "smol MoE". 

Details:

- pretrained at context length 16384
- seen approx 20b tokens
- uses Claude3 tokenizer (as hf GPT2 tokenizer)
- hidden size 1024, 12 layers, 8 experts

achieves the following results on the evaluation set (_most recent dataset_):
- Loss: 3.0366
- Accuracy: 0.4514
- Num Input Tokens Seen: 1975517184

if I pretrain it further, other versions will be in new repos with incremented version (this is v0.13)

## Quick eval

Quick eval for:	pszemraj/jamba-H1024_L12-v0.13-KIx2


hf (pretrained=pszemraj/jamba-H1024_L12-v0.13-KIx2,trust_remote_code=True,dtype=float), gen_kwargs: (None), limit: 0.9999, num_fewshot: None, batch_size: 8

|    Tasks     |Version|Filter|n-shot|  Metric  | Value  |   |Stderr|
|--------------|------:|------|-----:|----------|-------:|---|-----:|
|winogrande    |      1|none  |     0|acc       |  0.5067|±  |0.0141|
|piqa          |      1|none  |     0|acc       |  0.5912|±  |0.0138|
|              |       |none  |     0|acc_norm  |  0.5951|±  |0.0138|
|openbookqa    |      1|none  |     0|acc       |  0.1800|±  |0.0172|
|              |       |none  |     0|acc_norm  |  0.2920|±  |0.0204|
|lambada_openai|      1|none  |     0|perplexity|103.1241|±  |8.5843|
|              |       |none  |     0|acc       |  0.2502|±  |0.0122|
|boolq         |      2|none  |     0|acc       |  0.6196|±  |0.0136|
|arc_easy      |      1|none  |     0|acc       |  0.3836|±  |0.0137|
|              |       |none  |     0|acc_norm  |  0.3694|±  |0.0136|

## example outputs



![image/png](https://cdn-uploads.huggingface.co/production/uploads/60bccec062080d33f875cd0c/wky-qjUtS0AJ6YtIsJh3T.png)

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 4
- eval_batch_size: 4
- seed: 80085
- gradient_accumulation_steps: 32
- total_train_batch_size: 128
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.05
- num_epochs: 2.0

### Training results

| Training Loss | Epoch  | Step | Validation Loss | Accuracy | Input Tokens Seen |
|:-------------:|:------:|:----:|:---------------:|:--------:|:-----------------:|
| 3.2013        | 0.4241 | 200  | 3.0653          | 0.4479   | 419430400         |
| 3.1976        | 0.8481 | 400  | 3.0434          | 0.4506   | 838860800         |
| 3.1485        | 1.2722 | 600  | 3.0375          | 0.4513   | 1258291200        |
| 3.1871        | 1.6963 | 800  | 3.0366          | 0.4514   | 1677721600        |


### Framework versions

- Transformers 4.40.1
- Pytorch 2.2.0+cu121
- Datasets 2.19.0
- Tokenizers 0.19.1