leo-pekelis-gradient commited on
Commit
7976e88
1 Parent(s): 1ab8322

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +10 -5
README.md CHANGED
@@ -7,18 +7,23 @@ tags:
7
  - llama-3
8
  ---
9
 
10
- **COPY THIS REPO BEFORE MAKING PUBLIC**
11
-
12
- [NIAH eval figure here]
13
 
14
  This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
15
 
16
- We used:
 
17
  - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
18
  - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
19
  - progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
20
 
21
- We build on top of the EasyContext Blockwise RingAttention library to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster. For training data, we generate long contexts from the slimpajama dataset.
 
 
 
 
 
 
22
 
23
 
24
  | Parameter | 65K | 262K |
 
7
  - llama-3
8
  ---
9
 
10
+ **[NIAH eval figure here]**
 
 
11
 
12
  This model extends LLama-3's context length from 8k to > 130K, developed by Gradient, sponsored by compute from Crusoe Energy. It demonstrates that SOTA LLMs can learn to operate on long context with minimal training (< 200M tokens) by appropriately adjusting RoPE theta.
13
 
14
+ Approach:
15
+
16
  - [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the base
17
  - NTK-aware interpolation [1] to initialize an optimal schedule for RoPE theta, followed by a new data-driven RoPE theta optimization technique
18
  - progressive training on increasing context lengths similar to the [Large World Model](https://huggingface.co/LargeWorldModel) [2]
19
 
20
+ Infra:
21
+
22
+ We build on top of the EasyContext Blockwise RingAttention library [3] to scalably and efficiently train on contexts up to 256k in length on Crusoe Energy's high performance L40S cluster.
23
+
24
+ Data:
25
+
26
+ For training data, we generate long contexts by augmenting [SlimPajama](https://huggingface.co/datasets/cerebras/SlimPajama-627B).
27
 
28
 
29
  | Parameter | 65K | 262K |