XaaS Gemma 2 2B β€” Stage 1: Continual Pre-Training (CPT)

Stage 1 of 4 in the XaaS fine-tuning pipeline for Korean international trade.

This model adapts google/gemma-2-2b-it to the Korean trade domain through continual pre-training on a curated corpus of Korean customs, HS code classification, Incoterms, and international trade regulatory text. It serves as the foundation for all downstream XaaS task-specific fine-tunes.

Pipeline Position

google/gemma-2-2b-it
    ↓  [this model]
lablup/gemma-2-2b-it-xaas-cpt  ← you are here
    ↓
lablup/gemma-2-2b-it-xaas-qa   (trade domain QA)
    ↓
lablup/gemma-2-2b-it-xaas-kie  (KIE from B2B emails)
lablup/gemma-2-2b-it-xaas-sum-tag  (email summarization + tagging)

Training Details

Parameter Value
Base model google/gemma-2-2b-it
Method Continual pre-training with LoRA
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Epochs 1
Learning rate 4e-4
Max sequence length 2,500 tokens
Optimizer AdamW
Precision float32
Framework HuggingFace Transformers + PEFT + Accelerate

Training Data

Internal Korean trade-domain text corpus (XaaS/train_dataset/cpt_dataset/concatenated_dataset) covering:

  • Korean Customs Act (관세법) and trade regulations
  • HS code classification explanatory notes (κ΄€μ„Έμœ¨ν‘œ ν•΄μ„€μ„œ)
  • Incoterms and international trade terminology
  • Trade finance and letter-of-credit documentation

How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "lablup/gemma-2-2b-it-xaas-cpt"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

# Gemma 2 chat format
messages = [{"role": "user", "content": "μ‹ μš©μž₯(L/C)의 κ°œμ„€ 절차λ₯Ό μ„€λͺ…ν•΄μ£Όμ„Έμš”."}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=512, do_sample=False)

print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Downstream Models

Model Task
lablup/gemma-2-2b-it-xaas-qa Korean trade QA (21,399 QA pairs)
lablup/gemma-2-2b-it-xaas-kie B2B email key-information extraction
lablup/gemma-2-2b-it-xaas-sum-tag Email summarization + tagging

Limitations

  • Fine-tuned for Korean trade domain; general-purpose performance may be degraded compared to base Gemma 2
  • Knowledge cutoff is inherited from google/gemma-2-2b-it; recent regulatory changes are not covered
  • CPT corpus is domain-specific and does not cover all Korean language use cases

License

This model is built on Google Gemma 2 and is subject to the Gemma Terms of Use. Fine-tuned weights are released under the same terms.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for lablup/gemma-2-2b-it-xaas-cpt

Adapter
(453)
this model
Adapters
1 model

Collection including lablup/gemma-2-2b-it-xaas-cpt