Upload 2 files
Browse files
README.md
CHANGED
@@ -6,9 +6,9 @@ license: mit
|
|
6 |
library_name: transformers
|
7 |
---
|
8 |
|
9 |
-
![SpiralAI RetNet-3b-
|
10 |
|
11 |
-
# SpiralAI RetNet-3b-
|
12 |
|
13 |
We have conducted pre-training from scratch on the RetNet (https://arxiv.org/abs/2307.08621) architecture model 3b using a mixed dataset of Japanese and English.
|
14 |
This model is released primarily for the basic research of "retention mechanism".
|
@@ -16,7 +16,7 @@ This model is released primarily for the basic research of "retention mechanism"
|
|
16 |
# Model Description
|
17 |
|
18 |
- **Developed by:** [SpiralAI](https://go-spiral.ai/)
|
19 |
-
- **Model type:** The `SpiralAI RetNet-3b-
|
20 |
- **Languages:** Japanese, English.
|
21 |
- **License:** MIT
|
22 |
- **Training:** Trained on 80b tokens.
|
@@ -98,15 +98,15 @@ Here we show the result of the last layer.
|
|
98 |
|
99 |
## Test loss comparison
|
100 |
|
101 |
-
We compared the test loss of `Spiral-AI/RetNet-3b-
|
102 |
The first 100 examples are extracted from `wikipedia-ja` for the test dataset.
|
103 |
|
104 |
![test_loss](loss_comparison.png)
|
105 |
|
106 |
Key findings are:
|
107 |
|
108 |
-
- The test loss of `Spiral-AI/RetNet-3b-
|
109 |
-
- The explosion of test loss is suppressed in `Spiral-AI/RetNet-3b-
|
110 |
|
111 |
# Training Datasets
|
112 |
|
|
|
6 |
library_name: transformers
|
7 |
---
|
8 |
|
9 |
+
![SpiralAI Spiral-RetNet-3b-base](logo.png)
|
10 |
|
11 |
+
# SpiralAI Spiral-RetNet-3b-base
|
12 |
|
13 |
We have conducted pre-training from scratch on the RetNet (https://arxiv.org/abs/2307.08621) architecture model 3b using a mixed dataset of Japanese and English.
|
14 |
This model is released primarily for the basic research of "retention mechanism".
|
|
|
16 |
# Model Description
|
17 |
|
18 |
- **Developed by:** [SpiralAI](https://go-spiral.ai/)
|
19 |
+
- **Model type:** The `SpiralAI Spiral-RetNet-3b-base` is a language model equipped with a retention mechanism. It uses the `cyberagent/calm2-7b-chat` tokenizer.
|
20 |
- **Languages:** Japanese, English.
|
21 |
- **License:** MIT
|
22 |
- **Training:** Trained on 80b tokens.
|
|
|
98 |
|
99 |
## Test loss comparison
|
100 |
|
101 |
+
We compared the test loss of `Spiral-AI/Spiral-RetNet-3b-base` and `cyberagent/open-calm-3b` on different length of tokens.
|
102 |
The first 100 examples are extracted from `wikipedia-ja` for the test dataset.
|
103 |
|
104 |
![test_loss](loss_comparison.png)
|
105 |
|
106 |
Key findings are:
|
107 |
|
108 |
+
- The test loss of `Spiral-AI/Spiral-RetNet-3b-base` goes as low as `cyberagent/open-calm-3b`, showing the effectiveness of the retention mechanism.
|
109 |
+
- The explosion of test loss is suppressed in `Spiral-AI/Spiral-RetNet-3b-base` when the context length goes longer than 2,048 tokens (the maximum context length of training data; Note that `cyberagent/open-calm-3b` is trained on the same context length.).
|
110 |
|
111 |
# Training Datasets
|
112 |
|
logo.png
ADDED