File size: 4,674 Bytes
802d40f
1656bee
 
 
 
 
802d40f
1656bee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
tags:
- generated_from_trainer
model-index:
- name: upstage/SOLAR-10.7B-v1.0
  results: []
---
[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
# KoSOLAR-10.7B-v0.2

## Join Our Community on Discord!

If you're passionate about the field of Large Language Models and wish to exchange knowledge and insights, we warmly invite you to join our Discord server. It's worth noting that Korean is the primary language used in this server. The landscape of LLM is evolving rapidly, and without active sharing, our collective knowledge risks becoming outdated swiftly. Let's collaborate and drive greater impact together! Join us here: [Discord Link](https://discord.gg/b27bAHg95m).

## About the Model

This model is a Korean vocabulary-extended version of [upstage/SOLAR-10.7B-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-v1.0), specifically fine-tuned on various Korean web-crawled datasets available on HuggingFace. Our approach was to expand the model's understanding of Korean by pre-training the embeddings for new tokens and partially fine-tuning the `lm_head` embeddings for the already existing tokens while preserving the original parameters of the base model.

### Our Dedicated Team

#### Research
- Myeongho Jeong
- Seungtaek Choi
- Seungduk Kim

#### Engineering
- Sanghoon Han
- Suhyun Kang
- Geon Kim
- Rifqi Alfi

#### Product Management
- Bokyung Huh

### Technical Deep Dive

Here’s a glimpse into our technical approach:

```python
def freeze_partial_embedding_hook(grad):
    grad[:32000] = 0
    return grad

for name, param in model.named_parameters():
    if ("lm_head" in name or "embed_tokens" in name) and "original" not in name:
        param.requires_grad = True
        if "embed_tokens" in name:
            param.register_hook(freeze_partial_embedding_hook)
    else:
        param.requires_grad = False
```

Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:

1. Freezing the `lm_head` layer for existing tokens is crucial to maintain overall performance.
2. Unfreezing the `embed_tokens` layer for existing tokens actually boosts performance.

As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.

### Usage and Limitations

Keep in mind, this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.

### Training Details

Our model’s training was comprehensive and diverse:

- **Data Sources:**
  - English to Korean paragraph pairs: 5.86%
  - Multi-lingual corpus (primarily English): 10.69%
  - Korean web content: 83.46%

- **Vocabulary Expansion:**
  We meticulously selected 8,960 Korean tokens based on their frequency in our Korean web corpus. This process involved multiple rounds of tokenizer training, manual curation, and token frequency analysis, ensuring a rich and relevant vocabulary for our model.

    1. **Initial Tokenizer Training:** We trained an intermediate tokenizer on a Korean web corpus, with a vocabulary of 40,000 tokens.
    
    2. **Extraction of New Korean Tokens:** From the intermediate tokenizer, we identified all Korean tokens not present in the original SOLAR's tokenizer.

    3. **Manual Tokenizer Construction:** We then built the target tokenizer, focusing on these new Korean tokens.

    4. **Frequency Analysis:** Using target tokenizer, we processed a 100GB Korean corpus to count each token's frequency.

    5. **Refinement of Token List:** We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later.

    6. **Inclusion of Single-Letter Characters:** Counted missing Korean single-letter characters and added them to the target tokenizer that appearing more than 6,000 times.

    7. **Iterative Refinement:** We repeated steps 2 to 6 until there are no tokens to drop or add.

    8. **Training Bias Towards New Tokens:** Our training data was biased to include more texts with new tokens, for effective learning.

This rigorous approach ensured a comprehensive and contextually rich Korean vocabulary for the model.