license: mit
language:
- ja
- ko
pipeline_tag: translation
Japanese to Korean translator
Japanese to Korean translator model based on EncoderDecoderModel(bert-japanese+kogpt2)
Usage
Demo
Please visit https://huggingface.co/spaces/sappho192/aihub-ja-ko-translator-demo
Dependencies (PyPI)
- torch
- transformers
- fugashi
- unidic-lite
Inference
from transformers import(
EncoderDecoderModel,
PreTrainedTokenizerFast,
BertJapaneseTokenizer,
)
import torch
encoder_model_name = "cl-tohoku/bert-base-japanese-v2"
decoder_model_name = "skt/kogpt2-base-v2"
src_tokenizer = BertJapaneseTokenizer.from_pretrained(encoder_model_name)
trg_tokenizer = PreTrainedTokenizerFast.from_pretrained(decoder_model_name)
model = EncoderDecoderModel.from_pretrained("sappho192/aihub-ja-ko-translator")
text = "εγγΎγγ¦γγγγγγι‘γγγΎγγ"
def translate(text_src):
embeddings = src_tokenizer(text_src, return_attention_mask=False, return_token_type_ids=False, return_tensors='pt')
embeddings = {k: v for k, v in embeddings.items()}
output = model.generate(**embeddings, max_length=500)[0, 1:-1]
text_trg = trg_tokenizer.decode(output.cpu())
return text_trg
print(translate(text))
Dataset
This model used datasets from 'The Open AI Dataset Project (AI-Hub, South Korea)'.
All data information can be accessed through 'AI-Hub (aihub.or.kr)'.
(In order for a corporation, organization, or individual located outside of Korea to use AI data, etc., a separate agreement is required with the performing organization and the Korea National Information Society agency(NIA). In order to export AI data, etc. outside the country, a separate agreement is required with the performing organization and the NIA. Link)
μ΄ λͺ¨λΈμ κ³ΌνκΈ°μ μ 보ν΅μ λΆμ μ¬μμΌλ‘ νκ΅μ§λ₯μ 보μ¬νμ§ν₯μμ μ§μμ λ°μ ꡬμΆλ λ°μ΄ν°μ
μ νμ©νμ¬ μνλ μ°κ΅¬μ
λλ€.
λ³Έ λͺ¨λΈμ νμ©λ λ°μ΄ν°λ AI νλΈ(aihub.or.kr)μμ λ€μ΄λ‘λ λ°μΌμ€ μ μμ΅λλ€.
(κ΅μΈμ μμ¬νλ λ²μΈ, λ¨μ²΄ λλ κ°μΈμ΄ AIλ°μ΄ν° λ±μ μ΄μ©νκΈ° μν΄μλ μνκΈ°κ΄ λ± λ° νκ΅μ§λ₯μ 보μ¬νμ§ν₯μκ³Ό λ³λλ‘ ν©μκ° νμν©λλ€.
λ³Έ AIλ°μ΄ν° λ±μ κ΅μΈ λ°μΆμ μν΄μλ μνκΈ°κ΄ λ± λ° νκ΅μ§λ₯μ 보μ¬νμ§ν₯μκ³Ό λ³λλ‘ ν©μκ° νμν©λλ€. [μΆμ²])
Dataset list
The dataset used to train the model is merged following sub-datasets:
- μΌμμν λ° κ΅¬μ΄μ²΄ ν-μ€, ν-μΌ λ²μ λ³λ ¬ λ§λμΉ λ°μ΄ν° [Link]
- νκ΅μ΄-λ€κ΅μ΄(μμ΄ μ μΈ) λ²μ λ§λμΉ(κΈ°μ κ³Όν) [Link]
- νκ΅μ΄-λ€κ΅μ΄ λ²μ λ§λμΉ(κΈ°μ΄κ³Όν) [Link]
- νκ΅μ΄-λ€κ΅μ΄ λ²μ λ§λμΉ (μΈλ¬Έν) [Link]
- νκ΅μ΄-μΌλ³Έμ΄ λ²μ λ§λμΉ [Link]
To reproduce the the merged dataset, you can use the code in below link:
https://github.com/sappho192/aihub-translation-dataset