open-llama-2-ko-7b / README.md
beomi's picture
Update README.md
b82ed08
metadata
language:
  - ko
  - en
pipeline_tag: text-generation
inference: false
tags:
  - facebook
  - meta
  - pytorch
  - llama
  - llama-2
  - kollama
  - llama-2-ko
license: mit
library_name: transformers

Update Log

  • 2023.12.14: First Release of Open-Llama-2-Ko

Open-Llama-2-Ko ๐Ÿฆ™๐Ÿ‡ฐ๐Ÿ‡ท

Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.

The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus, including AI Hub, Modu Corpus, ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ and Korean Wikipedia.

Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)

Model Details

Model Developers Junbum Lee (Beomi)

Variations Open-Llama-2-Ko will come in a range of parameter sizes โ€” 7B and 13B โ€” as well as pretrained variations.

Input Models input text only.

Output Models generate text only.

Model Architecture

Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.

Training Data Params Content Length GQA Tokens LR
Llama 2 A new mix of Publicly Accessable Korean Corpus 7B 2k โœ— >15B* 5e-5

Train Corpus

Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:

Final JSONL dataset to trian this model is: 61GB.

Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)

Vocab Expansion

Model Name Vocabulary Size Description
Original Llama-2 32000 Sentencepiece BPE
Expanded Llama-2-Ko 46336 Sentencepiece BPE. Added Korean vocab and merges

Tokenizing "์•ˆ๋…•ํ•˜์„ธ์š”, ์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ ์ข‹๋„ค์š”."

Model Tokens
Llama-2 ['โ–', '์•ˆ', '<0xEB>', '<0x85>', '<0x95>', 'ํ•˜', '์„ธ', '์š”', ',', 'โ–', '์˜ค', '<0xEB>', '<0x8A>', '<0x98>', '์€', 'โ–', '<0xEB>', '<0x82>', '<0xA0>', '์”จ', '๊ฐ€', 'โ–', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์š”']
Llama-2-Ko ['โ–์•ˆ๋…•', 'ํ•˜์„ธ์š”', ',', 'โ–์˜ค๋Š˜์€', 'โ–๋‚ ', '์”จ๊ฐ€', 'โ–์ข‹๋„ค์š”']

Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"

Model Tokens
Llama-2 ['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']
Llama-2-Ko ['โ–L', 'l', 'ama', 'โ–', '2', ':', 'โ–Open', 'โ–Foundation', 'โ–and', 'โ–Fine', '-', 'T', 'un', 'ed', 'โ–Ch', 'at', 'โ–Mod', 'els']

Model Benchmark

LM Eval Harness - Korean (polyglot branch)

TBD

Note for oobabooga/text-generation-webui

Remove ValueError at load_tokenizer function(line 109 or near), in modules/models.py.

diff --git a/modules/models.py b/modules/models.py
index 232d5fa..de5b7a0 100644
--- a/modules/models.py
+++ b/modules/models.py
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
                 trust_remote_code=shared.args.trust_remote_code,
                 use_fast=False
             )
-        except ValueError:
+        except:
             tokenizer = AutoTokenizer.from_pretrained(
                 path_to_model,
                 trust_remote_code=shared.args.trust_remote_code,

Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package, it is required to use use_fast=True option when initialize tokenizer.

Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)

Citation

TBD

Acknowledgement