beomi's picture
Update README.md
4b1aa4a verified
|
raw
history blame
No virus
2.89 kB
metadata
language:
  - ko
  - en
pipeline_tag: text-generation
inference: false
tags:
  - solar
  - mistral
  - pytorch
  - solar-ko
library_name: transformers
license: apache-2.0

Update Log

  • 2024.05.16: Released Solar-Ko-Recovery

Solar-Ko-Recovery โญ๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡บ๐Ÿ‡ธ

Solar-Ko-Recovery aimed to recover Solar's capability on Korean with re-arrange of Embeddings and LM head, featuring an expanded vocabulary and the inclusion of a Korean+English corpus for enhanced representation.

Model Details

Model Developers: Junbum Lee (Beomi)

Variations: Solar-Ko-Recovery is available with one parameter sizes โ€” 10.8B.

Input: The model accepts only text input.

Output: The model produces text output exclusively.

Model Architecture:

Solar-Ko-Recovery is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.

Training Data Parameters Content Length GQA Tokens Learning Rate
Solar-Ko-Recovery A curated mix of Korean+English Corpora 10.8B 4k O >30B* 5e-5

Vocab Expansion

Vocab expansion is conducted on edited upstage/solar-1-mini-tokenizer, which is superset of Solar tokenizer.

Model Name Vocabulary Size Description
Original Solar 32000 Sentencepiece BPE
solar-1-mini-tokenizer 64000 Sentencepiece BPE. Added Ko/JP vocabs

Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."

  • SOLAR-10.7B: 26 tokens
  • Solar-Ko-Recovery: 7 tokens
Model Tokens
SOLAR-10.7B ['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '๋‚ ', '<0xEC>', '<0x94>', '<0xA8>', '๊ฐ€', 'โ–', '์ข‹', '๋„ค', '์š”', '.']
Solar-Ko-Recovery ['โ–์•ˆ๋…•ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ์”จ๊ฐ€', 'โ–์ข‹', '๋„ค์š”', '.']

Tokenizing "Meet 10.7B Solar: Elevating Performance with Upstage Depth UP Scaling!"

  • SOLAR-10.7B: 22 tokens
  • Solar-Ko-Recovery: 22 tokens
Model Tokens
SOLAR-10.7B ['โ–Meet', 'โ–', '1', '0', '.', '7', 'B', 'โ–Solar', ':', 'โ–E', 'lev', 'ating', 'โ–Performance', 'โ–with', 'โ–Up', 'stage', 'โ–Dep', 'th', 'โ–UP', 'โ–Scal', 'ing', '!']
Solar-Ko-Recovery ['โ–Meet', 'โ–', '1', '0', '.', '7', 'B', 'โ–Solar', ':', 'โ–E', 'lev', 'ating', 'โ–Performance', 'โ–with', 'โ–Up', 'stage', 'โ–Dep', 'th', 'โ–UP', 'โ–Scal', 'ing', '!']

LICENSE

Apache 2.0

Model Benchmark

LM Eval Harness - Korean

TBD

Citation

TBD

Acknowledgements