Gemma 4 LoRA Fine-Tuning for Kal'tsit-Style Dialogue Generation

Project Overview

This project fine-tunes google/gemma-4-E4B-it with LoRA to generate Chinese dialogue in the style of Kal'tsit from Arknights. The goal is not to train a full model from scratch, but to adapt a large instruction-tuned base model with a lightweight PEFT adapter so that it better follows a specific character voice: calm, restrained, analytical, and context-aware.

The project covers the full workflow from story data collection, text cleaning, character-specific dataset construction, prompt design, SFT formatting, LoRA training, validation monitoring, test generation, and optional adapter merging. The main training workflow is documented in gemma4_emotion_lora_arknights.ipynb.

Data Collection and Processing

The raw data was collected from Arknights story pages through ASTR story reader URLs. The URL list is stored in urls.txt and contains 263 story links. The data collection script is scripts/download_arknights_story.py.

The collection pipeline works as follows:

  1. Parse the language code and story file path from each ASTR page URL.
  2. Convert the page route into a raw JSON URL from the ArknightsStoryJson repository, using the pattern zh_CN/gamedata/story/{story_path}.json.
  3. Download each story JSON with requests.
  4. Read storyList and extract attributes.name as the speaker and attributes.content as the dialogue text.
  5. Mark lines without a speaker as narration, and preserve Sticker text as on-screen text.
  6. Clean color tags, HTML-like tags, escaped newlines, and redundant whitespace.
  7. Save each story as both readable .txt and structured .jsonl files.

Each structured line uses the following format:

{"speaker": "凯尔希", "text": "dialogue text"}

The character dataset is then built with scripts/build_character_dataset.py:

  1. Load all .jsonl files from the result folder in sorted order.
  2. Add source file names and line indices to every record for traceability.
  3. Select only records whose speaker exactly matches 凯尔希.
  4. Filter very short or very long responses. The default range is 2 to 300 Chinese characters.
  5. Use the previous 3 story lines as the dialogue context for each target response.
  6. Convert each sample into an instruction/input/output SFT record.
  7. Shuffle with seed 42 and split the dataset into train, validation, and test sets with an 80/10/10 ratio.

Final dataset size:

Split Samples
Train 2680
Validation 335
Test 335
Total 3350

Prompt Design

Each training sample is converted into a chat-style prompt and assistant completion. The notebook uses the official tokenizer chat template to format the data for Gemma.

System prompt:

你正在扮演《明日方舟》中的凯尔希。
请根据用户给出的上下文进行回复。

要求:
1. 只输出凯尔希的回复内容。
2. 不要解释你为什么这样回复。
3. 不要输出“凯尔希:”这个角色名前缀。
4. 语气应冷静、克制、理性,句子可以偏长。
5. 回复应尽量贴合上下文,而不是机械复述已有台词。

User prompt template:

请根据上下文,以凯尔希的说话风格进行回复。

上下文:
[Character A]:previous line
[Character B]:previous line
[凯尔希]:previous line

The assistant completion is the target Kal'tsit response. In other words, the model is trained to generate the next character-style reply from context, rather than to classify text into labels.

Fine-Tuning Setup

The base model is google/gemma-4-E4B-it, downloaded from ModelScope and loaded from a local directory. Training uses single-GPU BF16 LoRA fine-tuning with PEFT and TRL SFTTrainer. The notebook loads the model with AutoModelForCausalLM and saves the final adapter and tokenizer.

Core training configuration:

Item Value
Base model google/gemma-4-E4B-it
Fine-tuning method LoRA / PEFT
Trainer TRL SFTTrainer
LoRA rank 8
LoRA alpha 16
LoRA dropout 0.05
Target modules all-linear
Trainable parameters 25,249,792
Total parameters during PEFT training 7,966,350,624
Trainable ratio 0.3170%
Epochs 2
Global steps 336
Per-device batch size 2
Gradient accumulation 8
Learning rate 5e-5
Scheduler cosine
Warmup steps 10
Max sequence length 512
Loss mode completion-only loss
Precision BF16
Evaluation interval every 50 steps
Checkpoint selection best eval_loss

Training workflow:

  1. Load local JSONL files into a DatasetDict.
  2. Convert instruction/input/output records into chat-style prompt/completion examples.
  3. Load the Gemma 4 tokenizer and base model.
  4. Configure LoRA and verify that trainable parameters are correctly attached.
  5. Run supervised fine-tuning for 2 epochs with SFTTrainer.
  6. Evaluate on the validation set every 50 steps and save checkpoints.
  7. Generate responses for all 335 test examples.
  8. Save the LoRA adapter, tokenizer, training metrics, and test generations.

Results

Training completed successfully at global_step=336, and the best checkpoint was checkpoint-336, which is also the final step. Total training time was about 1993 seconds, or 33.2 minutes.

Validation metrics:

Step Eval loss Eval token accuracy
50 3.1132 0.4541
100 2.9867 0.4716
150 2.9440 0.4739
200 2.9127 0.4758
250 2.8917 0.4769
300 2.8853 0.4801
336 2.8843 0.4788

The final test generation file is kaltsit_test_generations.csv, with 335 generated responses. There were no empty outputs and no generated responses with the unwanted 凯尔希: role prefix. The average target response length was 23.42 Chinese characters, while the average generated response length was 16.53 characters.

Qualitatively, the model learned part of the target style, especially restrained phrasing, concise responses, and role-prefix control. However, the test set also shows limitations. The response ...... appears 38 times, and 20 generated responses are 4 characters or shorter. This suggests that the LoRA adapter is valid and learned useful stylistic behavior, but it is not yet a high-quality story continuation model.

My interpretation of the result:

  • The training run completed successfully, and the adapter files are valid.
  • The language model LoRA weights were updated.
  • The model shows measurable style-control behavior.
  • Contextual reasoning and narrative continuation still need improvement.
  • Since the training data is text-only, this experiment should be viewed as Chinese character-style text fine-tuning, not multimodal capability fine-tuning.

Repository Artifacts

Main artifacts:

File Description
adapter_model.safetensors LoRA adapter weights
adapter_config.json PEFT/LoRA configuration
tokenizer.json / tokenizer_config.json Tokenizer files
chat_template.jinja Gemma 4 chat template
train_metrics.json Training summary metrics
kaltsit_test_generations.csv Test-set generations
checkpoint-300 / checkpoint-336 Training checkpoints

This repository currently contains the LoRA adapter, not a fully merged model. To deploy a merged model, the matching google/gemma-4-E4B-it base model must be loaded, the adapter must be attached with PeftModel.from_pretrained, and the weights can then be merged with merge_and_unload(). The full processor should also be saved with the merged model.

Data pipeline files:

File Description
scripts/download_arknights_story.py Downloads and parses raw Arknights story JSON files from ASTR URLs
scripts/build_character_dataset.py Builds the Kal'tsit SFT dataset and creates train/validation/test splits
scripts/count_speakers.py Counts speaker frequencies in the parsed story JSONL files
requirements.txt Minimal dependency list for the data collection scripts
urls.txt Source ASTR story URL list used for data collection

Minimal data pipeline reproduction:

pip install -r requirements.txt
python scripts/download_arknights_story.py --url-file urls.txt --jsonl
python scripts/build_character_dataset.py --input-dir result --character 凯尔希 --output-dir dataset --context-size 3
python scripts/count_speakers.py

Future Improvements

  1. Apply LoRA only to the language model modules instead of all linear modules, since the dataset is text-only.
  2. Filter or downweight very short target responses such as ...... and ——.
  3. Add a small manually curated evaluation set for character consistency, contextual relevance, and naturalness.
  4. Use longer context windows or scene-level samples to improve narrative continuity.
  5. Compare multiple LoRA configurations, including different ranks, target modules, and data filtering strategies.
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for asphyxiation112/gemma4-it-kaltsit-lora

Adapter
(122)
this model