File size: 3,308 Bytes
7472fe8 d0b5f39 7472fe8 56cfc61 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
# KoUL2
- λͺ¨λμλ§λμΉ + AI hubμ 곡κ°λ κΈ°ν νκ΅μ΄ ν
μ€νΈ λ°μ΄ν°λ₯Ό κΈ°λ°μΌλ‘ νμ΅λ UL2(Unifying Language Learning Paradigm)λͺ¨λΈμ
λλ€.
- [lassl](https://github.com/lassl/lassl) μ€νμμ€ νλ‘μ νΈλ₯Ό νμ©νμ¬ νμ΅νμμ΅λλ€.
- μ¬μ νμ΅λ§ μ§νλ λͺ¨λΈμ΄λ―λ‘ μλμ κ°μ΄ UL2μ denoisingμ νμΈν΄λ³΄μ€ μ μμ΅λλ€.
```py
model = T5ForConditionalGeneration.from_pretrained("DaehanKim/KoUL2")
tokenizer = AutoTokenizer.from_pretrained("DaehanKim/KoUL2")
for prefix_token in ("[NLU]","[NLG]","[S2S]"):
input_string = f"{prefix_token}μ΄λ€ μννΈλ νΈκ°κ° [new_id_27]λλ± κ²½κΈ° μΉ¨μ²΄λ‘ μΈν [new_id_26]λ₯Ό νμΈν μ μμμ΅λλ€.</s>"
inputs = tokenizer(input_string, return_tensors="pt", add_special_tokens=False)
decoder_inputs = tokenizer("<pad>[new_id_27]", return_tensors='pt', add_special_tokens=False)
outputs = model.generate(input_ids = inputs.input_ids, decoder_input_ids=decoder_inputs.input_ids, num_beams=10, num_return_sequences=5)
print(tokenizer.batch_decode(outputs))
```
```
# output
['<pad>[new_id_27] κ³ κ³΅νμ§μ[new_id_26] μννΈμ νΈκ°κ° κ³ κ³΅νμ§μ', '<pad>[new_id_27] κ³ κ³΅ νμ§μ[new_id_26] μννΈ νΈκ°κ° κ³ κ³΅ νμ§', '<pad>[new_id_27] κ³ κ³΅ νμ§μ[new_id_26] μννΈ κ°μ΄ κ³ κ³΅ νμ§', '<pad>[new_id_27] κ³ κ³΅ νμ§μ[new_id_26] μννΈμ νΈκ°κ° κ³ κ³΅ ν', '<pad>[new_id_27] κ³ κ³΅ νμ§μ[new_id_26] μννΈ νΈκ°κ° κ³ κ³΅νμ§μ']
['<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄κ³ μ΄λ€ μννΈλ νΈκ°κ° μ²λ§ ', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄κ³ μ΄λ€ μννΈλ νΈκ°κ° μ²λ§[new_id_26]', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄κ³ μ΄λ€ μννΈλ νΈκ°κ° μ² λ§', '<pad>[new_id_27] μ²λ§ μμμ μ²λ§ μ κΉμ§ μ€λ₯΄λ[new_id_26] μννΈ κ°κ²© νλ½', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄κ³ μ΄λ€ μννΈλ νΈκ°κ° μ² μ']
['<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄λ[new_id_26] μννΈ κ°μ΄ μ²λ§ μ', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄λ[new_id_26] μννΈ κ°μ΄ μ²λ§ μμ', '<pad>[new_id_27] μ²λ§ μ μ΄μ μ€λ₯΄λ[new_id_26] μννΈ κ°μ΄ μ€λ₯΄λ λ± λΆλμ°', '<pad>[new_id_27] κ³ κ³΅ νμ§μ μ΄μ΄κ°κ³ [new_id_26] μννΈ κ°μ΄ νλ½νλ λ±', '<pad>[new_id_27] κ³ κ³΅ νμ§μ νκ³ [new_id_26] μννΈ κ°μ΄ νλ½νλ λ±']
```
- μ¬μ νμ΅ κ³Όμ μμ sentinel tokenμ κΈ°μ‘΄ T5μ νΈνλκ² νκΈ° μν΄ [new_id_27]...[new_id_1]<extra_token_0>...<extra_token_99> μμΌλ‘ λ€μ΄κ°κ² λ©λλ€. νμ΅ λ°©μμ λν λ΄μ©μ [μ΄ ν¬μ€νΈ](https://daehankim.blogspot.com/2022/08/lassl-feat-t5-ul2.html)λ₯Ό μ°Έμ‘°ν΄μ£Όμλ©΄ κ°μ¬νκ² μ΅λλ€.
- Licenseλ MITμ
λλ€.
- νμ΅ λ‘κ·Έλ [μ¬κΈ°](https://wandb.ai/lucas01/huggingface?workspace=user-lucas01)μμ νμΈνμ€ μ μμ΅λλ€.
- λͺ¨λΈμ΄λ λ°μ΄ν° μ
μ λν΄ κΆκΈνμ μ μ΄ μμΌμλ©΄ `kdh5852 [at] gmail [dot] com`μΌλ‘ λ¬Έμν΄μ£Όμλ©΄ λ΅λ³ λλ¦¬κ² μ΅λλ€.
## acknowledgement
- μ΄ νλ‘μ νΈλ TFRC νλ‘κ·Έλ¨μ TPU μ§μμ λ°μ μνλμμ΅λλ€. |