beomi
/

Solar-Ko-Recovery-11B

Text Generation

text-generation-inference

Model card Files Files and versions Community

Solar-Ko-Recovery-11B / README.md

beomi's picture

Update README.md

cb1266d verified 5 months ago

|

No virus

2.95 kB

	---
	language:
	- ko
	- en
	pipeline_tag: text-generation
	inference: false
	tags:
	- solar
	- mistral
	- pytorch
	- solar-ko
	library_name: transformers
	license: apache-2.0
	---

	Update Log

	- 2024.05.16: Released Solar-Ko-Recovery

	# Solar-Ko-Recovery 🌟❤️‍🩹

	Solar-Ko-Recovery aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.

	## Model Details

	Model Developers: Junbum Lee (Beomi)

	Variations: Solar-Ko-Recovery is available with one parameter sizes — 10.8B.

	Input: The model accepts only text input.

	Output: The model produces text output exclusively.

	Model Architecture:

	Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

	\| \|Training Data\|Parameters\|Content Length\|GQA\|Tokens\|Learning Rate\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|Solar-Ko-Recovery\|A curated mix of Korean+English Corpora\|10.8B\|4k\|O\|>30B*\|5e<sup>-5</sup>\|

	> NOTE: Only Embedding layer and LM Head layer are trained.

	Vocab Expansion

	Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer.

	\| Model Name \| Vocabulary Size \| Description \|
	\| --- \| --- \| --- \|
	\| Original Solar \| 32000 \| Sentencepiece BPE \|
	\| solar-1-mini-tokenizer \| 64000 \| Sentencepiece BPE. Added Ko/JP vocabs \|

	Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

	- SOLAR-10.7B: 26 tokens
	- Solar-Ko-Recovery: 7 tokens

	\| Model \| Tokens \|
	\| --- \| --- \|
	\| SOLAR-10.7B \| `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` \|
	\| Solar-Ko-Recovery \| `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` \|

	Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

	- SOLAR-10.7B: 22 tokens
	- Solar-Ko-Recovery: 22 tokens

	\| Model \| Tokens \|
	\| --- \| --- \|
	\| SOLAR-10.7B \| `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` \|
	\| Solar-Ko-Recovery \| `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` \|

	# LICENSE

	Apache 2.0

	# Model Benchmark

	## LM Eval Harness - Korean

	- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
	- 5-shot scores

	TBD

	## Citation

	TBD

	## Acknowledgements

	- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.