---
language:
- ko
- en
pipeline_tag: text-generation
inference: false
tags:
- facebook
- meta
- pytorch
- llama
- llama-2
- kollama
- llama-2-ko
license: mit
library_name: transformers
---

**Update Log**

- 2023.12.14: First Release of Open-Llama-2-Ko

# **Open-Llama-2-Ko** 🦙🇰🇷

Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining. 
Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters. 
This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format. 

The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus,
including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, 모두의 말뭉치](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).

Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)

## Model Details

**Model Developers** Junbum Lee (Beomi)

**Variations** Open-Llama-2-Ko will come in a range of parameter sizes — 7B and 13B — as well as pretrained variations.

**Input** Models input text only.

**Output** Models generate text only.

**Model Architecture** 

Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.

||Training Data|Params|Content Length|GQA|Tokens|LR|
|---|---|---|---|---|---|---|
|Llama 2|*A new mix of Publicly Accessable Korean Corpus*|7B|2k|&#10007;|>15B*|5e<sup>-5</sup>|

**Train Corpus**

Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:

- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)

Final JSONL dataset to trian this model is: 61GB.

Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)

**Vocab Expansion**

| Model Name | Vocabulary Size | Description | 
| --- | --- | --- |
| Original Llama-2 | 32000 | Sentencepiece BPE |
| **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |

**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요."**

| Model | Tokens |
| --- | --- |
| Llama-2 | `['▁', '안', '<0xEB>', '<0x85>', '<0x95>', '하', '세', '요', ',', '▁', '오', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '씨', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '요']` |
| Llama-2-Ko | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요']` |

**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**

| Model | Tokens |
| --- | --- |
| Llama-2 | `['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']` |
| Llama-2-Ko | `['▁L', 'l', 'ama', '▁', '2', ':', '▁Open', '▁Foundation', '▁and', '▁Fine', '-', 'T', 'un', 'ed', '▁Ch', 'at', '▁Mod', 'els']` |

# **Model Benchmark**

## LM Eval Harness - Korean (polyglot branch)

- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot

TBD

## Note for oobabooga/text-generation-webui

Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`.

```python
diff --git a/modules/models.py b/modules/models.py
index 232d5fa..de5b7a0 100644
--- a/modules/models.py
+++ b/modules/models.py
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
                 trust_remote_code=shared.args.trust_remote_code,
                 use_fast=False
             )
-        except ValueError:
+        except:
             tokenizer = AutoTokenizer.from_pretrained(
                 path_to_model,
                 trust_remote_code=shared.args.trust_remote_code,
```

Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package, 
it is required to use `use_fast=True` option when initialize tokenizer.

Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)

## Citation

TBD

## Acknowledgement

- The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
- The training corpus is from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).