leo-pekelis-gradient commited on
Commit
9411de7
1 Parent(s): ab6d0a2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -6
README.md CHANGED
@@ -7,9 +7,10 @@ tags:
7
  - llama-3
8
  ---
9
 
10
- **[NIAH eval figure here]**
11
 
12
- This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
 
 
13
 
14
  **Approach:**
15
 
@@ -25,17 +26,18 @@ We build on top of the EasyContext Blockwise RingAttention library [3] to scalab
25
 
26
  For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
27
 
 
28
 
29
  | Parameter | 65K | 262K |
30
  |-----------------------------|------------|------------|
31
- | Initialize From | LLaMA-3 7B | 65K |
32
  | Sequence Length | 2^16 | 2^18 |
33
  | RoPE theta | 15.3 M | 207.1 M |
34
- | batch_size | 1 | 1 |
35
- | gradient_accumulation_steps | 32 | 16 |
36
  | Steps | 30 | 24 |
37
  | Total Tokens | 63 M | 101 M |
38
- | learning_rate | 2.00E-05 | 2.00E-05 |
39
  | # GPUs | 8 | 8 |
40
  | GPU Type | NVIDIA L40S| NVIDIA L40S|
41
 
 
7
  - llama-3
8
  ---
9
 
 
10
 
11
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6585dc9be92bc5f258156bd6/F2WLF8_jOx_gttxbPtLK1.png)
12
+
13
+ This model extends LLama-3 8B's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
14
 
15
  **Approach:**
16
 
 
26
 
27
  For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
28
 
29
+ **Progressive Training Details:**
30
 
31
  | Parameter | 65K | 262K |
32
  |-----------------------------|------------|------------|
33
+ | Initialize From | LLaMA-3 8B | 65K |
34
  | Sequence Length | 2^16 | 2^18 |
35
  | RoPE theta | 15.3 M | 207.1 M |
36
+ | Batch Size | 1 | 1 |
37
+ | Gradient Accumulation Steps | 32 | 16 |
38
  | Steps | 30 | 24 |
39
  | Total Tokens | 63 M | 101 M |
40
+ | Learning Rate | 2.00E-05 | 2.00E-05 |
41
  | # GPUs | 8 | 8 |
42
  | GPU Type | NVIDIA L40S| NVIDIA L40S|
43