kuotient's picture
Update README.md
76aed3a verified
|
raw
history blame
3.06 kB
metadata
license: cc-by-nc-sa-4.0
datasets:
  - squarelike/sharegpt_deepl_ko_translation
language:
  - ko
pipeline_tag: translation
tags:
  - translate

Seagull-13b-translation πŸ“‡

Seagull-typewriter Seagull-13b-translation is yet another translator model, but carefully considered the following issues from existing translation models.

  • newline or space not matching the original text
  • Using translated dataset with first letter removed for training
  • Codes
  • Markdown format
  • LaTeX format
  • etc

이런 μ΄μŠˆλ“€μ„ μΆ©λΆ„νžˆ μ²΄ν¬ν•˜κ³  ν•™μŠ΅μ„ μ§„ν–‰ν•˜μ˜€μ§€λ§Œ, λͺ¨λΈμ„ μ‚¬μš©ν•  λ•ŒλŠ” 이런 뢀뢄에 λŒ€ν•œ κ²°κ³Όλ₯Ό λ©΄λ°€ν•˜κ²Œ μ‚΄νŽ΄λ³΄λŠ” 것을 μΆ”μ²œν•©λ‹ˆλ‹€(μ½”λ“œκ°€ ν¬ν•¨λœ ν…μŠ€νŠΈ λ“±).

If you're interested in building large-scale language models to solve a wide variety of problems in a wide variety of domains, you should consider joining Allganize. For a coffee chat or if you have any questions, please do not hesitate to contact me as well! - kuotient.dev@gmail.com

This model was created as a personal experiment, unrelated to the organization I work for.

License

From original model author:

Model Details

Developed by

Jisoo Kim(kuotient)

Base Model

beomi/llama-2-koen-13b

Datasets

  • sharegpt_deepl_ko_translation
  • AIHUB
    • κΈ°μˆ κ³Όν•™ λΆ„μ•Ό ν•œ-영 λ²ˆμ—­ 병렬 λ§λ­‰μΉ˜ 데이터
    • μΌμƒμƒν™œ 및 ꡬ어체 ν•œ-영 λ²ˆμ—­ 병렬 λ§λ­‰μΉ˜ 데이터

Usage

Format

It follows only ChatML format.

<|im_start|>system
주어진 λ¬Έμž₯을 ν•œκ΅­μ–΄λ‘œ λ²ˆμ—­ν•˜μ„Έμš”.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
# Don't miss newline here
<|im_start|>system
주어진 λ¬Έμž₯을 μ˜μ–΄λ‘œ λ²ˆμ—­ν•˜μ„Έμš”.<|im_end|>
<|im_start|>user
{instruction}<|im_end|>
<|im_start|>assistant
# Don't miss newline here

Example code

Since, chat_template already contains insturction format above. You can use the code below.

from transformers import AutoModelForCausalLM, AutoTokenizer
device = "cuda" # the device to load the model onto
model = AutoModelForCausalLM.from_pretrained("kuotient/Seagull-13B-translation")
tokenizer = AutoTokenizer.from_pretrained("kuotient/Seagull-13B-translation")
messages = [
    {"role": "user", "content": "λ°”λ‚˜λ‚˜λŠ” μ›λž˜ ν•˜μ–€μƒ‰μ΄μ•Ό?"},
]
encodeds = tokenizer.apply_chat_template(messages, return_tensors="pt")

model_inputs = encodeds.to(device)
model.to(device)
generated_ids = model.generate(model_inputs, max_new_tokens=1000, do_sample=True)
decoded = tokenizer.batch_decode(generated_ids)
print(decoded[0])