seungduk commited on
Commit
ca61485
1 Parent(s): 617e58a

Update README.md

Browse files

Correct the training process explanation (reversed)

Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -46,14 +46,14 @@ for name, param in model.named_parameters():
46
 
47
  Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
48
 
49
- 1. Freezing the `lm_head` layer for existing tokens is crucial to maintain overall performance.
50
- 2. Unfreezing the `embed_tokens` layer for existing tokens actually boosts performance.
51
 
52
  As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
53
 
54
  ### Usage and Limitations
55
 
56
- Keep in mind, this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
57
 
58
  ### Training Details
59
 
@@ -73,13 +73,13 @@ Our model’s training was comprehensive and diverse:
73
 
74
  3. **Manual Tokenizer Construction:** We then built the target tokenizer, focusing on these new Korean tokens.
75
 
76
- 4. **Frequency Analysis:** Using target tokenizer, we processed a 100GB Korean corpus to count each token's frequency.
77
 
78
  5. **Refinement of Token List:** We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later.
79
 
80
- 6. **Inclusion of Single-Letter Characters:** Counted missing Korean single-letter characters and added them to the target tokenizer that appearing more than 6,000 times.
81
 
82
- 7. **Iterative Refinement:** We repeated steps 2 to 6 until there are no tokens to drop or add.
83
 
84
  8. **Training Bias Towards New Tokens:** Our training data was biased to include more texts with new tokens, for effective learning.
85
 
 
46
 
47
  Our strategy involved a selective freeze of model parameters. Specifically, we kept most parameters of the base model unchanged while focusing on enhancing the Korean language capabilities. Through our experiments, we discovered:
48
 
49
+ 1. Freezing the `embed_tokens` layer for existing tokens is crucial to maintain overall performance.
50
+ 2. Unfreezing the `lm_head` layer for existing tokens actually boosts performance.
51
 
52
  As a result, we froze the internal layers and the first 32,000 `embed_tokens`, directing our training efforts on a rich mix of Korean and multi-lingual corpora. This balanced approach has notably improved the model’s proficiency in Korean, without compromising its original language capabilities.
53
 
54
  ### Usage and Limitations
55
 
56
+ Keep in mind that this model hasn't been fine-tuned with instruction-based training. While it excels in Korean language tasks, we advise careful consideration and further training for specific applications.
57
 
58
  ### Training Details
59
 
 
73
 
74
  3. **Manual Tokenizer Construction:** We then built the target tokenizer, focusing on these new Korean tokens.
75
 
76
+ 4. **Frequency Analysis:** Using the target tokenizer, we processed a 100GB Korean corpus to count each token's frequency.
77
 
78
  5. **Refinement of Token List:** We removed tokens appearing less than 6,000 times, ensuring to secure enough tokens to train models later.
79
 
80
+ 6. **Inclusion of Single-Letter Characters:** Counted missing Korean single-letter characters and added them to the target tokenizer that appeared more than 6,000 times.
81
 
82
+ 7. **Iterative Refinement:** We repeated steps 2 to 6 until there were no tokens to drop or add.
83
 
84
  8. **Training Bias Towards New Tokens:** Our training data was biased to include more texts with new tokens, for effective learning.
85