metadata

license: cc-by-nc-sa-4.0
datasets:
  - squarelike/sharegpt_deepl_ko_translation
language:
  - ko
pipeline_tag: translation
tags:
  - translate

Seagull-13b-translation 📇

Seagull-13b-translation is yet another translator model, but carefully considered the following issues from existing translation models.

newline or space not matching the original text
Using translated dataset with first letter removed for training
Codes
Markdown format
LaTeX format
etc

이런 이슈들을 충분히 체크하고 학습을 진행하였지만, 모델을 사용할 때는 이런 부분에 대한 결과를 면밀하게 살펴보는 것을 추천합니다(코드가 포함된 텍스트 등).

If you're interested in building large-scale language models to solve a wide variety of problems in a wide variety of domains, you should consider joining Allganize. For a coffee chat or if you have any questions, please do not hesitate to contact me as well! - kuotient.dev@gmail.com

This model was created as a personal experiment, unrelated to the organization I work for.

License

From original model author:

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License, under LLAMA 2 COMMUNITY LICENSE AGREEMENT
Full License available at: https://huggingface.co/beomi/llama-2-koen-13b/blob/main/LICENSE

Model Details

Developed by

Jisoo Kim(kuotient)

Base Model

beomi/llama-2-koen-13b

Datasets

sharegpt_deepl_ko_translation
AIHUB
- 기술과학 분야 한-영 번역 병렬 말뭉치 데이터
- 일상생활 및 구어체 한-영 번역 병렬 말뭉치 데이터

Usage

Format

It follows only ChatML format.

<|im_start|>system
주어진 문장을 한국어로 번역하세요.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
# Don't miss newline here

<|im_start|>system
주어진 문장을 영어로 번역하세요.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
# Don't miss newline here

Example code

Since, chat_template already contains insturction format above. You can use the code below.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("kuotient/Seagull-13B-translation")
tokenizer = AutoTokenizer.from_pretrained("kuotient/Seagull-13B-translation")
messages = [
    {"role": "user", "content": "바나나는 원래 하얀색이야?"},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])