language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- facebook
- meta
- pytorch
- llama
- llama-2
- kollama
- llama-2-ko
license: mit
library_name: transformers
Update Log
- 2023.12.14: First Release of Open-Llama-2-Ko
Open-Llama-2-Ko ๐ฆ๐ฐ๐ท
Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus, including AI Hub, Modu Corpus, ๋ชจ๋์ ๋ง๋ญ์น and Korean Wikipedia.
Model Details
Model Developers Junbum Lee (Beomi)
Variations Open-Llama-2-Ko will come in a range of parameter sizes โ 7B and 13B โ as well as pretrained variations.
Input Models input text only.
Output Models generate text only.
Model Architecture
Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
Training Data | Params | Content Length | GQA | Tokens | LR | |
---|---|---|---|---|---|---|
Llama 2 | A new mix of Publicly Accessable Korean Corpus | 7B | 4k | โ | >15B* | 5e-5 |
Train Corpus
TBD
Vocab Expansion
Model Name | Vocabulary Size | Description |
---|---|---|
Original Llama-2 | 32000 | Sentencepiece BPE |
Expanded Llama-2-Ko | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
Tokenizing "์๋ ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์."
Model | Tokens |
---|---|
Llama-2 | ['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์'] |
Llama-2-Ko | ['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์'] |
Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"
Model | Tokens |
---|---|
Llama-2 | ['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els'] |
Llama-2-Ko | ['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els'] |
Model Benchmark
LM Eval Harness - Korean (polyglot branch)
- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
TBD
Note for oobabooga/text-generation-webui
Remove ValueError
at load_tokenizer
function(line 109 or near), in modules/models.py
.
diff --git a/modules/models.py b/modules/models.py
index 232d5fa..de5b7a0 100644
--- a/modules/models.py
+++ b/modules/models.py
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
trust_remote_code=shared.args.trust_remote_code,
use_fast=False
)
- except ValueError:
+ except:
tokenizer = AutoTokenizer.from_pretrained(
path_to_model,
trust_remote_code=shared.args.trust_remote_code,
Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package,
it is required to use use_fast=True
option when initialize tokenizer.
Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
Citation
TBD
Acknowledgement
- The training is supported by TPU Research Cloud program.
- The training corpus is from AI Hub, Modu Corpus and Korean Wikipedia.