--- language: - ko - en pipeline_tag: text-generation inference: false tags: - facebook - meta - pytorch - llama - llama-2 - kollama - llama-2-ko license: mit library_name: transformers --- **Update Log** - 2023.12.14: First Release of Open-Llama-2-Ko # **Open-Llama-2-Ko** ๐Ÿฆ™๐Ÿ‡ฐ๐Ÿ‡ท Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format. The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus, including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/). Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License) ## Model Details **Model Developers** Junbum Lee (Beomi) **Variations** Open-Llama-2-Ko will come in a range of parameter sizes โ€” 7B and 13B โ€” as well as pretrained variations. **Input** Models input text only. **Output** Models generate text only. **Model Architecture** Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2. ||Training Data|Params|Content Length|GQA|Tokens|LR| |---|---|---|---|---|---|---| |Llama 2|*A new mix of Publicly Accessable Korean Corpus*|7B|2k|✗|>15B*|5e-5| **Train Corpus** Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below: - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB) - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS) Final JSONL dataset to trian this model is: 61GB. Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.) **Vocab Expansion** | Model Name | Vocabulary Size | Description | | --- | --- | --- | | Original Llama-2 | 32000 | Sentencepiece BPE | | **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges | **Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."** | Model | Tokens | | --- | --- | | Llama-2 | `['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”']` | | Llama-2-Ko | `['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”']` | **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"** | Model | Tokens | | --- | --- | | Llama-2 | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` | | Llama-2-Ko | `['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']` | # **Model Benchmark** ## LM Eval Harness - Korean (polyglot branch) - Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot TBD ## Note for oobabooga/text-generation-webui Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`. ```python diff --git a/modules/models.py b/modules/models.py index 232d5fa..de5b7a0 100644 --- a/modules/models.py +++ b/modules/models.py @@ -106,7 +106,7 @@ def load_tokenizer(model_name, model): trust_remote_code=shared.args.trust_remote_code, use_fast=False ) - except ValueError: + except: tokenizer = AutoTokenizer.from_pretrained( path_to_model, trust_remote_code=shared.args.trust_remote_code, ``` Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package, it is required to use `use_fast=True` option when initialize tokenizer. Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU) ## Citation TBD ## Acknowledgement - The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program. - The training corpus is from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).