--- language: - ko - en pipeline_tag: text-generation inference: false tags: - solar - mistral - pytorch - solar-ko library_name: transformers license: apache-2.0 --- **Update Log** - 2024.05.16: Released Solar-Ko-Recovery # **Solar-Ko-Recovery** โญ๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡บ๐Ÿ‡ธ Solar-Ko-Recovery aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation. ## Model Details **Model Developers:** Junbum Lee (Beomi) **Variations:** Solar-Ko-Recovery is available with one parameter sizes โ€” 10.8B. **Input:** The model accepts only text input. **Output:** The model produces text output exclusively. **Model Architecture:** Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2. | |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate| |---|---|---|---|---|---|---| |Solar-Ko-Recovery|*A curated mix of Korean+English Corpora*|10.8B|4k|O|>30B*|5e-5| **Vocab Expansion** Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer. | Model Name | Vocabulary Size | Description | | --- | --- | --- | | Original Solar | 32000 | Sentencepiece BPE | | **solar-1-mini-tokenizer** | 64000 | Sentencepiece BPE. Added Ko/JP vocabs | **Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."** - SOLAR-10.7B: 26 tokens - Solar-Ko-Recovery: 7 tokens | Model | Tokens | | --- | --- | | SOLAR-10.7B | `['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '๋‚ ', '<0xEC>', '<0x94>', '<0xA8>', '๊ฐ€', 'โ–', '์ข‹', '๋„ค', '์š”', '.']` | | Solar-Ko-Recovery | `['โ–์•ˆ๋…•ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ์”จ๊ฐ€', 'โ–์ข‹', '๋„ค์š”', '.']` | **Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"** - SOLAR-10.7B: 22 tokens - Solar-Ko-Recovery: 22 tokens | Model | Tokens | | --- | --- | | SOLAR-10.7B | `['โ–Meet', 'โ–', '1', '0', '.', '7', 'B', 'โ–Solar', ':', 'โ–E', 'lev', 'ating', 'โ–Performance', 'โ–with', 'โ–Up', 'stage', 'โ–Dep', 'th', 'โ–UP', 'โ–Scal', 'ing', '!']` | | Solar-Ko-Recovery | `['โ–Meet', 'โ–', '1', '0', '.', '7', 'B', 'โ–Solar', ':', 'โ–E', 'lev', 'ating', 'โ–Performance', 'โ–with', 'โ–Up', 'stage', 'โ–Dep', 'th', 'โ–UP', 'โ–Scal', 'ing', '!']` | # LICENSE Apache 2.0 # **Model Benchmark** ## LM Eval Harness - Korean - Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) - 5-shot scores TBD ## Citation TBD ## Acknowledgements - Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.