myeongho-jeong commited on
Commit
498bf08
1 Parent(s): 4b30efe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +58 -1
README.md CHANGED
@@ -28,7 +28,64 @@ This model is a Korean vocabulary-extended version of [upstage/SOLAR-10.7B-v1.0]
28
 
29
  ### Technical Deep Dive
30
 
31
- TBU
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ### Usage and Limitations
34
 
 
28
 
29
  ### Technical Deep Dive
30
 
31
+ Here’s a glimpse into our technical approach:
32
+
33
+ ```python
34
+ # number_of_old_tokens is the size of tokenizer before vocab extension. For example, in case of EEVE-Korean-10.8B-v1.0, number_of_old_tokens is 32000.
35
+ def freeze_partial_embedding_hook(grad):
36
+ grad[:number_of_old_tokens] = 0
37
+ return grad
38
+
39
+ for name, param in model.named_parameters():
40
+ if ("lm_head" in name or "embed_tokens" in name) and "original" not in name:
41
+ param.requires_grad = True
42
+ if "embed_tokens" in name:
43
+ param.register_hook(freeze_partial_embedding_hook)
44
+ else:
45
+ param.requires_grad = False
46
+ ```
47
+
48
+ Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
49
+
50
+ 1. Freezing the `embed_tokens` layer for existing tokens is crucial to maintain overall performance.
51
+ 2. Unfreezing the `lm_head` layer for existing tokens actually boosts performance.
52
+
53
+ As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
54
+
55
+ ### Usage and Limitations
56
+
57
+ Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
58
+
59
+ ### Training Details
60
+
61
+ Our model’s training was comprehensive and diverse:
62
+
63
+ - **Data Sources:**
64
+ - English to Korean paragraph pairs: 5.86%
65
+ - Multi-lingual corpus (primarily English): 10.69%
66
+ - Korean web content: 83.46%
67
+
68
+ - **Vocabulary Expansion:**
69
+ We meticulously selected 8,960 Korean tokens based on their frequency in our Korean web corpus. This process involved multiple rounds of tokenizer training, manual curation, and token frequency analysis, ensuring a rich and relevant vocabulary for our model.
70
+
71
+ 1. **Initial Tokenizer Training:** We trained an intermediate tokenizer on a Korean web corpus, with a vocabulary of 40,000 tokens.
72
+
73
+ 2. **Extraction of New Korean Tokens:** From the intermediate tokenizer, we identified all Korean tokens not present in the original SOLAR's tokenizer.
74
+
75
+ 3. **Manual Tokenizer Construction:** We then built the target tokenizer, focusing on these new Korean tokens.
76
+
77
+ 4. **Frequency Analysis:** Using the target tokenizer, we processed a 100GB Korean corpus to count each token's frequency.
78
+
79
+ 5. **Refinement of Token List:** We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later.
80
+
81
+ 6. **Inclusion of Single-Letter Characters:** Counted missing Korean single-letter characters and added them to the target tokenizer that appeared more than 6,000 times.
82
+
83
+ 7. **Iterative Refinement:** We repeated steps 2 to 6 until there were no tokens to drop or add.
84
+
85
+ 8. **Training Bias Towards New Tokens:** Our training data was biased to include more texts with new tokens, for effective learning.
86
+
87
+ This rigorous approach ensured a comprehensive and contextually rich Korean vocabulary for the model.
88
+
89
 
90
  ### Usage and Limitations
91