File size: 2,945 Bytes
139bc19
0305294
 
 
 
 
 
 
 
 
 
139bc19
0305294
139bc19
 
0305294
139bc19
4b1aa4a
139bc19
cb1266d
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
 
 
139bc19
c4be33d
 
0305294
139bc19
0305294
139bc19
0305294
 
 
 
139bc19
0305294
139bc19
0305294
8b38e9d
139bc19
0305294
 
 
 
139bc19
0305294
139bc19
0305294
8b38e9d
139bc19
0305294
 
 
 
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
 
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
139bc19
0305294
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- solar
- mistral
- pytorch
- solar-ko
library_name: transformers
license: apache-2.0
---

**Update Log**

- 2024.05.16: Released Solar-Ko-Recovery

# **Solar-Ko-Recovery** 🌟❤️‍🩹

Solar-Ko-Recovery aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation. 

## Model Details

**Model Developers:** Junbum Lee (Beomi)

**Variations:** Solar-Ko-Recovery is available with one parameter sizes — 10.8B.

**Input:** The model accepts only text input.

**Output:** The model produces text output exclusively.

**Model Architecture:** 

Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|---|---|---|---|---|---|---|
|Solar-Ko-Recovery|*A curated mix of Korean+English Corpora*|10.8B|4k|O|>30B*|5e<sup>-5</sup>|

> NOTE: Only Embedding layer and LM Head layer are trained.

**Vocab Expansion**

Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer.

| Model Name | Vocabulary Size | Description | 
| --- | --- | --- |
| Original Solar | 32000 | Sentencepiece BPE |
| **solar-1-mini-tokenizer** | 64000 | Sentencepiece BPE. Added Ko/JP vocabs |

**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**

- SOLAR-10.7B: 26 tokens
- Solar-Ko-Recovery: 7 tokens

| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` |
| Solar-Ko-Recovery | `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` |

**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**

- SOLAR-10.7B: 22 tokens
- Solar-Ko-Recovery: 22 tokens

| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
| Solar-Ko-Recovery | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |

# LICENSE

Apache 2.0

# **Model Benchmark**

## LM Eval Harness - Korean

- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- 5-shot scores

TBD

## Citation

TBD

## Acknowledgements

- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.