File size: 4,807 Bytes
93adbfe
 
65c534a
 
 
 
e7e53fc
65c534a
93adbfe
65c534a
 
e7e53fc
 
 
 
615b687
 
 
 
a4ddde9
65c534a
be0c24a
65c534a
a4ddde9
65c534a
be0c24a
65c534a
a4ddde9
65c534a
be0c24a
65c534a
a4ddde9
65c534a
be0c24a
65c534a
a4ddde9
65c534a
be0c24a
 
d92b680
be0c24a
a4ddde9
65c534a
 
 
 
 
 
 
 
 
 
 
a4ddde9
65c534a
 
cb52d94
65c534a
a4ddde9
 
 
 
 
 
 
 
 
 
 
 
65c534a
a4ddde9
65c534a
a4ddde9
 
cb52d94
a4ddde9
 
 
 
 
65c534a
a4ddde9
65c534a
 
 
 
cb52d94
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
license: apache-2.0
base_model: upstage/SOLAR-10.7B-v1.0
tags:
- generated_from_trainer
model-index:
- name: yanolja/KoSOLAR-10.7B-v0.1
  results: []
---

[<img src="https://raw.githubusercontent.com/OpenAccess-AI-Collective/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/OpenAccess-AI-Collective/axolotl)
## Discord

If you're passionate about the field of Large Language Models and wish to exchange knowledge and insights, we warmly invite you to join our Discord server. It's worth noting that Korean is the primary language used in this server. The landscape of LLM is evolving rapidly, and without active sharing, our collective knowledge risks becoming outdated swiftly. Let's collaborate and drive greater impact together! Join us here: https://discord.gg/b27bAHg95m.

# Caution

This model is **DEPRECATED** due to an issue with the tokenizer. A new, corrected version will be uploaded shortly. We strongly advise against fine-tuning this model until the updated version is available. Details for the new version will be provided in a separate model card.

# yanolja/KoSOLAR-10.7B-v0.1

This model is a Korean vocabulary-extended version of [upstage/SOLAR-10.7B-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-v1.0), specifically pre-trained on various Korean web-crawled datasets available on HuggingFace. Our approach was to expand the model's understanding of Korean by pre-training the embeddings for new tokens while preserving the original parameters of the base model.

## Model Description

Most parameters of [upstage/SOLAR-10.7B-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-v1.0) were kept frozen during our training process. Only the embeddings for the newly added Korean tokens in the `embed_tokens` layer and the `lm_head` layer were pre-trained. This approach allowed us to enhance the model's performance in Korean while maintaining its original English capabilities.

## Intended Uses & Limitations

No instruction tuning has been performed on this model. We recommend further training for specific purposes with caution, as it was primarily enhanced for Korean language understanding.

## Training and Evaluation Data

The model was pre-trained on various Korean web-crawled datasets openly available on HuggingFace.

## Training Procedure

### Clarification on "Pre-trained"

It's essential to understand what "pre-trained" means in the context of this model. While the base model was already pre-trained on a broad, non-task-specific corpus of data, we further pre-trained only the embeddings for the expanded Korean vocabulary. This means that we did not alter the other existing parameters from the base model at all. This approach ensures a robust understanding of both English and Korean.

### Training Hyperparameters

The following hyperparameters were used during training:
- learning_rate: 0.0003
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 4
- total_train_batch_size: 256
- total_eval_batch_size: 64
- optimizer: Adam with betas=(0.9, 0.999) and epsilon=1e-08
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 10
- num_epochs: 1

### Training Results

#### upstage/SOLAR-10.7B-v1.0

| Groups      | Version | Filter    | n-shot | Metric      | Value  |     | Stderr |
|-------------|---------|-----------|--------|-------------|--------|-----|--------|
| kmmlu       | N/A     | none      | 0      | acc         | 0.3004 | ±   | 0.0528 |
| gsm8k       | Yaml    | get-answer| 5      | exact_match | 0.5625 | ±   | 0.0137 |
| hellaswag   | Yaml    | none      | 0      | acc         | 0.6393 | ±   | 0.0048 |
| mmlu        | N/A     | none      | 0      | acc         | 0.6305 | ±   | 0.1452 |
| truthfulqa  | N/A     | none      | 0      | acc         | 0.4096 | ±   | 0.0467 |
| winogrande  | Yaml    | none      | 0      | acc         | 0.7443 | ±   | 0.0123 |

#### yanolja/KoSOLAR-10.7B-v0.1

| Groups      | Version | Filter    | n-shot | Metric      | Value  |     | Stderr |
|-------------|---------|-----------|--------|-------------|--------|-----|--------|
| kmmlu       | N/A     | none      | 0      | acc         | 0.2948 | ±   | 0.0537 |
| gsm8k       | Yaml    | get-answer| 5      | exact_match | 0.5527 | ±   | 0.0137 |
| hellaswag   | Yaml    | none      | 0      | acc         | 0.6392 | ±   | 0.0048 |
| mmlu        | N/A     | none      | 0      | acc         | 0.6303 | ±   | 0.1411 |
| truthfulqa  | N/A     | none      | 0      | acc         | 0.3618 | ±   | 0.0472 |
| winogrande  | Yaml    | none      | 0      | acc         | 0.7459 | ±   | 0.0122 |

### Framework Versions

- Transformers 4.37.0.dev0
- Pytorch 2.1.2+cu121
- Datasets 2.16.0
- Tokenizers 0.15.0