Solar-Ko-Recovery-11B / README.md

beomi

Update README.md

c8306dc verified 10 days ago

preview code

raw

history blame contribute delete

No virus

9.47 kB

	---
	language:
	- ko
	- en
	pipeline_tag: text-generation
	inference: false
	tags:
	- solar
	- mistral
	- pytorch
	- solar-ko
	library_name: transformers
	license: apache-2.0
	base_model: upstage/SOLAR-10.7B-v1.0
	---

	<img src="https://cdn-uploads.huggingface.co/production/uploads/5e56829137cb5b49818287ea/WuiaS45EAWDurGTOtjR_d.png" style="max-width:250px;margin:0 auto;" />

	Update Log

	- 2024.07.01: Released Solar-Ko-Recovery & Uploaded Benchmark scores
	- 2024.05.16: Preview Released Solar-Ko-Recovery

	# Solar-Ko-Recovery-11B 🌟❤️‍🩹

	Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.

	## Model Details

	Model Developers: Junbum Lee (Beomi)

	Variations: Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣).

	Input: The model accepts only text input.

	Output: The model produces text output exclusively.

	Model Architecture:

	Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

	\| \|Training Data\|Parameters\|Content Length\|GQA\|Tokens\|Learning Rate\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|Solar-Ko-Recovery\|A curated mix of Korean+English Corpora\|11B(10.99B)\|4k\|O\|>100B*\|5e<sup>-5</sup>\|

	> NOTE: 2-step training processed
	>
	> 1) Only Embedding layer and LM Head layer are trained
	> 2) Full params trained

	Vocab Expansion

	Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer.

	\| Model Name \| Vocabulary Size \| Description \|
	\| --- \| --- \| --- \|
	\| Original Solar \| 32000 \| Sentencepiece BPE \|
	\| solar-1-mini-tokenizer \| 64000 \| Sentencepiece BPE. Added Ko/JP vocabs \|

	Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

	- SOLAR-10.7B: 26 tokens
	- Solar-Ko-Recovery: 7 tokens

	\| Model \| Tokens \|
	\| --- \| --- \|
	\| SOLAR-10.7B \| `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` \|
	\| Solar-Ko-Recovery \| `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` \|

	Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

	- SOLAR-10.7B: 22 tokens
	- Solar-Ko-Recovery: 22 tokens

	\| Model \| Tokens \|
	\| --- \| --- \|
	\| SOLAR-10.7B \| `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` \|
	\| Solar-Ko-Recovery \| `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` \|

	# LICENSE

	Apache 2.0

	# Model Benchmark

	## LM Eval Harness - Korean

	- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
	- 5-shot scores

	\| Tasks \| Metric \| Value \| \| Stderr \|
	\|----------------------------------------------------------\|-----------\|--------:\|---\|--------:\|
	\|haerae \|acc_norm \| 0.7874 \|± \| 0.0118 \|
	\| - haerae_general_knowledge \|acc \| 0.5000 \|± \| 0.0378 \|
	\| - haerae_history \|acc \| 0.8723 \|± \| 0.0244 \|
	\| - haerae_loan_word \|acc \| 0.8402 \|± \| 0.0283 \|
	\| - haerae_rare_word \|acc \| 0.8346 \|± \| 0.0185 \|
	\| - haerae_standard_nomenclature \|acc \| 0.8301 \|± \| 0.0305 \|
	\|kmmlu_direct \|exact_match\| 0.4205 \|± \| 0.0026 \|
	\| - kmmlu_direct_accounting \|exact_match\| 0.3700 \|± \| 0.0485 \|
	\| - kmmlu_direct_agricultural_sciences \|exact_match\| 0.3140 \|± \| 0.0147 \|
	\| - kmmlu_direct_aviation_engineering_and_maintenance \|exact_match\| 0.3870 \|± \| 0.0154 \|
	\| - kmmlu_direct_biology \|exact_match\| 0.3510 \|± \| 0.0151 \|
	\| - kmmlu_direct_chemical_engineering \|exact_match\| 0.3910 \|± \| 0.0154 \|
	\| - kmmlu_direct_chemistry \|exact_match\| 0.4000 \|± \| 0.0200 \|
	\| - kmmlu_direct_civil_engineering \|exact_match\| 0.4010 \|± \| 0.0155 \|
	\| - kmmlu_direct_computer_science \|exact_match\| 0.6520 \|± \| 0.0151 \|
	\| - kmmlu_direct_construction \|exact_match\| 0.3080 \|± \| 0.0146 \|
	\| - kmmlu_direct_criminal_law \|exact_match\| 0.3100 \|± \| 0.0328 \|
	\| - kmmlu_direct_ecology \|exact_match\| 0.4660 \|± \| 0.0158 \|
	\| - kmmlu_direct_economics \|exact_match\| 0.5385 \|± \| 0.0439 \|
	\| - kmmlu_direct_education \|exact_match\| 0.6200 \|± \| 0.0488 \|
	\| - kmmlu_direct_electrical_engineering \|exact_match\| 0.3000 \|± \| 0.0145 \|
	\| - kmmlu_direct_electronics_engineering \|exact_match\| 0.4740 \|± \| 0.0158 \|
	\| - kmmlu_direct_energy_management \|exact_match\| 0.3560 \|± \| 0.0151 \|
	\| - kmmlu_direct_environmental_science \|exact_match\| 0.2980 \|± \| 0.0145 \|
	\| - kmmlu_direct_fashion \|exact_match\| 0.4470 \|± \| 0.0157 \|
	\| - kmmlu_direct_food_processing \|exact_match\| 0.3690 \|± \| 0.0153 \|
	\| - kmmlu_direct_gas_technology_and_engineering \|exact_match\| 0.3000 \|± \| 0.0145 \|
	\| - kmmlu_direct_geomatics \|exact_match\| 0.3820 \|± \| 0.0154 \|
	\| - kmmlu_direct_health \|exact_match\| 0.5700 \|± \| 0.0498 \|
	\| - kmmlu_direct_industrial_engineer \|exact_match\| 0.3830 \|± \| 0.0154 \|
	\| - kmmlu_direct_information_technology \|exact_match\| 0.6090 \|± \| 0.0154 \|
	\| - kmmlu_direct_interior_architecture_and_design \|exact_match\| 0.5440 \|± \| 0.0158 \|
	\| - kmmlu_direct_korean_history \|exact_match\| 0.3800 \|± \| 0.0488 \|
	\| - kmmlu_direct_law \|exact_match\| 0.4670 \|± \| 0.0158 \|
	\| - kmmlu_direct_machine_design_and_manufacturing \|exact_match\| 0.3960 \|± \| 0.0155 \|
	\| - kmmlu_direct_management \|exact_match\| 0.5030 \|± \| 0.0158 \|
	\| - kmmlu_direct_maritime_engineering \|exact_match\| 0.4283 \|± \| 0.0202 \|
	\| - kmmlu_direct_marketing \|exact_match\| 0.7460 \|± \| 0.0138 \|
	\| - kmmlu_direct_materials_engineering \|exact_match\| 0.4020 \|± \| 0.0155 \|
	\| - kmmlu_direct_math \|exact_match\| 0.2867 \|± \| 0.0262 \|
	\| - kmmlu_direct_mechanical_engineering \|exact_match\| 0.3490 \|± \| 0.0151 \|
	\| - kmmlu_direct_nondestructive_testing \|exact_match\| 0.3760 \|± \| 0.0153 \|
	\| - kmmlu_direct_patent \|exact_match\| 0.3700 \|± \| 0.0485 \|
	\| - kmmlu_direct_political_science_and_sociology \|exact_match\| 0.5300 \|± \| 0.0289 \|
	\| - kmmlu_direct_psychology \|exact_match\| 0.4470 \|± \| 0.0157 \|
	\| - kmmlu_direct_public_safety \|exact_match\| 0.3520 \|± \| 0.0151 \|
	\| - kmmlu_direct_railway_and_automotive_engineering \|exact_match\| 0.3220 \|± \| 0.0148 \|
	\| - kmmlu_direct_real_estate \|exact_match\| 0.4350 \|± \| 0.0351 \|
	\| - kmmlu_direct_refrigerating_machinery \|exact_match\| 0.3240 \|± \| 0.0148 \|
	\| - kmmlu_direct_social_welfare \|exact_match\| 0.4970 \|± \| 0.0158 \|
	\| - kmmlu_direct_taxation \|exact_match\| 0.3800 \|± \| 0.0344 \|
	\| - kmmlu_direct_telecommunications_and_wireless_technology\|exact_match\| 0.5480 \|± \| 0.0157 \|
	\|kobest_boolq \|acc \| 0.9202 \|± \| 0.0072 \|
	\| \|f1 \| 0.9202 \|± \|N/A \|
	\|kobest_copa \|acc \| 0.8680 \|± \| 0.0107 \|
	\| \|f1 \| 0.8678 \|± \|N/A \|
	\|kobest_hellaswag \|acc \| 0.5560 \|± \| 0.0222 \|
	\| \|f1 \| 0.5520 \|± \|N/A \|
	\| \|acc_norm \| 0.6540 \|± \| 0.0213 \|
	\|kobest_sentineg \|acc \| 0.9824 \|± \| 0.0066 \|
	\| \|f1 \| 0.9824 \|± \|N/A \|



	## Citation

	TBD

	## Acknowledgements

	- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.

	---
	language:
	- ko
	- en
	pipeline_tag: text-generation
	inference: false
	tags:
	- solar
	- mistral
	- pytorch
	- solar-ko
	library_name: transformers
	license: apache-2.0
	base_model: upstage/SOLAR-10.7B-v1.0
	---

	<img src="https://cdn-uploads.huggingface.co/production/uploads/5e56829137cb5b49818287ea/WuiaS45EAWDurGTOtjR_d.png" style="max-width:250px;margin:0 auto;" />

	Update Log

	- 2024.07.01: Released Solar-Ko-Recovery & Uploaded Benchmark scores
	- 2024.05.16: Preview Released Solar-Ko-Recovery

	# Solar-Ko-Recovery-11B 🌟❤️‍🩹

	Solar-Ko-Recovery-11B aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.

	## Model Details

	Model Developers: Junbum Lee (Beomi)

	Variations: Solar-Ko-Recovery is available with one parameter sizes — 11B(10.99B🤣).

	Input: The model accepts only text input.

	Output: The model produces text output exclusively.

	Model Architecture:

	Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

	\| \|Training Data\|Parameters\|Content Length\|GQA\|Tokens\|Learning Rate\|
	\|---\|---\|---\|---\|---\|---\|---\|
	\|Solar-Ko-Recovery\|A curated mix of Korean+English Corpora\|11B(10.99B)\|4k\|O\|>100B*\|5e<sup>-5</sup>\|

	> NOTE: 2-step training processed
	>
	> 1) Only Embedding layer and LM Head layer are trained
	> 2) Full params trained

	Vocab Expansion

	Vocab expansion is conducted on edited [upstage/solar-1-mini-tokenizer](https://huggingface.co/upstage/solar-1-mini-tokenizer), which is superset of Solar tokenizer.

	\| Model Name \| Vocabulary Size \| Description \|
	\| --- \| --- \| --- \|
	\| Original Solar \| 32000 \| Sentencepiece BPE \|
	\| solar-1-mini-tokenizer \| 64000 \| Sentencepiece BPE. Added Ko/JP vocabs \|

	Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."

	- SOLAR-10.7B: 26 tokens
	- Solar-Ko-Recovery: 7 tokens

	\| Model \| Tokens \|
	\| --- \| --- \|
	\| SOLAR-10.7B \| `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '날', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '좋', '네', '요', '.']` \|
	\| Solar-Ko-Recovery \| `['▁안녕하세요', ',', '▁오늘은', '▁날씨가', '▁좋', '네요', '.']` \|

	Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

	- SOLAR-10.7B: 22 tokens
	- Solar-Ko-Recovery: 22 tokens

	\| Model \| Tokens \|
	\| --- \| --- \|
	\| SOLAR-10.7B \| `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` \|
	\| Solar-Ko-Recovery \| `['▁Meet', '▁', '1', '0', '.', '7', 'B', '▁Solar', ':', '▁E', 'lev', 'ating', '▁Performance', '▁with', '▁Up', 'stage', '▁Dep', 'th', '▁UP', '▁Scal', 'ing', '!']` \|

	# LICENSE

	Apache 2.0

	# Model Benchmark

	## LM Eval Harness - Korean

	- Used EleutherAI's [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
	- 5-shot scores

	\| Tasks \| Metric \| Value \| \| Stderr \|
	\|----------------------------------------------------------\|-----------\|--------:\|---\|--------:\|
	\|haerae \|acc_norm \| 0.7874 \|± \| 0.0118 \|
	\| - haerae_general_knowledge \|acc \| 0.5000 \|± \| 0.0378 \|
	\| - haerae_history \|acc \| 0.8723 \|± \| 0.0244 \|
	\| - haerae_loan_word \|acc \| 0.8402 \|± \| 0.0283 \|
	\| - haerae_rare_word \|acc \| 0.8346 \|± \| 0.0185 \|
	\| - haerae_standard_nomenclature \|acc \| 0.8301 \|± \| 0.0305 \|
	\|kmmlu_direct \|exact_match\| 0.4205 \|± \| 0.0026 \|
	\| - kmmlu_direct_accounting \|exact_match\| 0.3700 \|± \| 0.0485 \|
	\| - kmmlu_direct_agricultural_sciences \|exact_match\| 0.3140 \|± \| 0.0147 \|
	\| - kmmlu_direct_aviation_engineering_and_maintenance \|exact_match\| 0.3870 \|± \| 0.0154 \|
	\| - kmmlu_direct_biology \|exact_match\| 0.3510 \|± \| 0.0151 \|
	\| - kmmlu_direct_chemical_engineering \|exact_match\| 0.3910 \|± \| 0.0154 \|
	\| - kmmlu_direct_chemistry \|exact_match\| 0.4000 \|± \| 0.0200 \|
	\| - kmmlu_direct_civil_engineering \|exact_match\| 0.4010 \|± \| 0.0155 \|
	\| - kmmlu_direct_computer_science \|exact_match\| 0.6520 \|± \| 0.0151 \|
	\| - kmmlu_direct_construction \|exact_match\| 0.3080 \|± \| 0.0146 \|
	\| - kmmlu_direct_criminal_law \|exact_match\| 0.3100 \|± \| 0.0328 \|
	\| - kmmlu_direct_ecology \|exact_match\| 0.4660 \|± \| 0.0158 \|
	\| - kmmlu_direct_economics \|exact_match\| 0.5385 \|± \| 0.0439 \|
	\| - kmmlu_direct_education \|exact_match\| 0.6200 \|± \| 0.0488 \|
	\| - kmmlu_direct_electrical_engineering \|exact_match\| 0.3000 \|± \| 0.0145 \|
	\| - kmmlu_direct_electronics_engineering \|exact_match\| 0.4740 \|± \| 0.0158 \|
	\| - kmmlu_direct_energy_management \|exact_match\| 0.3560 \|± \| 0.0151 \|
	\| - kmmlu_direct_environmental_science \|exact_match\| 0.2980 \|± \| 0.0145 \|
	\| - kmmlu_direct_fashion \|exact_match\| 0.4470 \|± \| 0.0157 \|
	\| - kmmlu_direct_food_processing \|exact_match\| 0.3690 \|± \| 0.0153 \|
	\| - kmmlu_direct_gas_technology_and_engineering \|exact_match\| 0.3000 \|± \| 0.0145 \|
	\| - kmmlu_direct_geomatics \|exact_match\| 0.3820 \|± \| 0.0154 \|
	\| - kmmlu_direct_health \|exact_match\| 0.5700 \|± \| 0.0498 \|
	\| - kmmlu_direct_industrial_engineer \|exact_match\| 0.3830 \|± \| 0.0154 \|
	\| - kmmlu_direct_information_technology \|exact_match\| 0.6090 \|± \| 0.0154 \|
	\| - kmmlu_direct_interior_architecture_and_design \|exact_match\| 0.5440 \|± \| 0.0158 \|
	\| - kmmlu_direct_korean_history \|exact_match\| 0.3800 \|± \| 0.0488 \|
	\| - kmmlu_direct_law \|exact_match\| 0.4670 \|± \| 0.0158 \|
	\| - kmmlu_direct_machine_design_and_manufacturing \|exact_match\| 0.3960 \|± \| 0.0155 \|
	\| - kmmlu_direct_management \|exact_match\| 0.5030 \|± \| 0.0158 \|
	\| - kmmlu_direct_maritime_engineering \|exact_match\| 0.4283 \|± \| 0.0202 \|
	\| - kmmlu_direct_marketing \|exact_match\| 0.7460 \|± \| 0.0138 \|
	\| - kmmlu_direct_materials_engineering \|exact_match\| 0.4020 \|± \| 0.0155 \|
	\| - kmmlu_direct_math \|exact_match\| 0.2867 \|± \| 0.0262 \|
	\| - kmmlu_direct_mechanical_engineering \|exact_match\| 0.3490 \|± \| 0.0151 \|
	\| - kmmlu_direct_nondestructive_testing \|exact_match\| 0.3760 \|± \| 0.0153 \|
	\| - kmmlu_direct_patent \|exact_match\| 0.3700 \|± \| 0.0485 \|
	\| - kmmlu_direct_political_science_and_sociology \|exact_match\| 0.5300 \|± \| 0.0289 \|
	\| - kmmlu_direct_psychology \|exact_match\| 0.4470 \|± \| 0.0157 \|
	\| - kmmlu_direct_public_safety \|exact_match\| 0.3520 \|± \| 0.0151 \|
	\| - kmmlu_direct_railway_and_automotive_engineering \|exact_match\| 0.3220 \|± \| 0.0148 \|
	\| - kmmlu_direct_real_estate \|exact_match\| 0.4350 \|± \| 0.0351 \|
	\| - kmmlu_direct_refrigerating_machinery \|exact_match\| 0.3240 \|± \| 0.0148 \|
	\| - kmmlu_direct_social_welfare \|exact_match\| 0.4970 \|± \| 0.0158 \|
	\| - kmmlu_direct_taxation \|exact_match\| 0.3800 \|± \| 0.0344 \|
	\| - kmmlu_direct_telecommunications_and_wireless_technology\|exact_match\| 0.5480 \|± \| 0.0157 \|
	\|kobest_boolq \|acc \| 0.9202 \|± \| 0.0072 \|
	\| \|f1 \| 0.9202 \|± \|N/A \|
	\|kobest_copa \|acc \| 0.8680 \|± \| 0.0107 \|
	\| \|f1 \| 0.8678 \|± \|N/A \|
	\|kobest_hellaswag \|acc \| 0.5560 \|± \| 0.0222 \|
	\| \|f1 \| 0.5520 \|± \|N/A \|
	\| \|acc_norm \| 0.6540 \|± \| 0.0213 \|
	\|kobest_sentineg \|acc \| 0.9824 \|± \| 0.0066 \|
	\| \|f1 \| 0.9824 \|± \|N/A \|



	## Citation

	TBD

	## Acknowledgements

	- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.