File size: 2,886 Bytes
139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 4b1aa4a 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 8b38e9d 139bc19 0305294 139bc19 0305294 139bc19 0305294 8b38e9d 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 139bc19 0305294 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- solar
- mistral
- pytorch
- solar-ko
library_name: transformers
license: apache-2.0
---
**Update Log**
- 2024.05.16: Released Solar-Ko-Recovery
# **Solar-Ko-Recovery** ⭐🇰🇷🇺🇸
Solar-Ko-Recovery aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.
## Model Details
**Model Developers:** Junbum Lee (Beomi)
**Variations:** Solar-Ko-Recovery is available with one parameter sizes — 10.8B.
**Input:** The model accepts only text input.
**Output:** The model produces text output exclusively.
**Model Architecture:**
Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|---|---|---|---|---|---|---|
|Solar-Ko-Recovery|*A curated mix of Korean+English Corpora*|10.8B|4k|O|>30B*|5e<sup>-5</sup>|
**Vocab Expansion**
Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer.
| Model Name | Vocabulary Size | Description |
| --- | --- | --- |
| Original Solar | 32000 | Sentencepiece BPE |
| **solar-1-mini-tokenizer** | 64000 | Sentencepiece BPE. Added Ko/JP vocabs |
**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**
- SOLAR-10.7B: 26 tokens
- Solar-Ko-Recovery: 7 tokens
| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` |
| Solar-Ko-Recovery | `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` |
**Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"**
- SOLAR-10.7B: 22 tokens
- Solar-Ko-Recovery: 22 tokens
| Model | Tokens |
| --- | --- |
| SOLAR-10.7B | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
| Solar-Ko-Recovery | `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` |
# LICENSE
Apache 2.0
# **Model Benchmark**
## LM Eval Harness - Korean
- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
- 5-shot scores
TBD
## Citation
TBD
## Acknowledgements
- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program. |