Edit model card

Update Log

  • 2024.05.16: Released Solar-Ko-Recovery

Solar-Ko-Recovery-11B 🌟❤️‍🩹

Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.

Model Details

Model Developers: Junbum Lee (Beomi)

Variations: Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣).

Input: The model accepts only text input.

Output: The model produces text output exclusively.

Model Architecture:

Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

Training Data Parameters Content Length GQA Tokens Learning Rate
Solar-Ko-Recovery A curated mix of Korean+English Corpora 10.8B 4k O >30B* 5e-5

NOTE: Only Embedding layer and LM Head layer are trained.

Vocab Expansion

Vocab expansion is conducted on edited upstage/solar-1-mini-tokenizer, which is superset of Solar tokenizer.

Model Name Vocabulary Size Description
Original Solar 32000 Sentencepiece BPE
solar-1-mini-tokenizer 64000 Sentencepiece BPE. Added Ko/JP vocabs

Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

  • SOLAR-10.7B: 26 tokens
  • Solar-Ko-Recovery: 7 tokens
Model Tokens
SOLAR-10.7B ['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']
Solar-Ko-Recovery ['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

  • SOLAR-10.7B: 22 tokens
  • Solar-Ko-Recovery: 22 tokens
Model Tokens
SOLAR-10.7B ['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']
Solar-Ko-Recovery ['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']

LICENSE

Apache 2.0

Model Benchmark

LM Eval Harness - Korean

TBD

Citation

TBD

Acknowledgements

Downloads last month
693
Safetensors
Model size
11B params
Tensor type
BF16
·
Inference Examples
Inference API (serverless) has been turned off for this model.