Phi 3 Model with Extended Vocabulary and Fine-Tuning for Japanese

Overview

This project is a proof of concept that extends the base vocabulary of the Phi 3 model and then applies supervised fine-tuning to teach it a new language (Japanese). Despite using a very small custom dataset, the improvement in Japanese language understanding is substantial.

Model Details

Base Model: Phi 3
Objective: Extend the base vocabulary and fine-tune for Japanese language understanding.
Dataset: Custom dataset of 1,000 entries generated using ChatGPT-4.
Language: Japanese

Dataset

The dataset used for this project was generated with the assistance of ChatGPT-4. It comprises 1,000 entries, carefully curated to cover a diverse range of topics and linguistic structures.

Training

Vocabulary Extension

The base vocabulary of the Phi 3 model was extended to include new Japanese tokens. This was a crucial step to enable the model to comprehend and generate Japanese text more effectively.

Fine-Tuning

Supervised fine-tuning was performed on the extended model using the custom dataset. Despite the small dataset size, the model showed significant improvement in understanding and generating Japanese text.

Results

Even with the limited dataset and vocabulary size, the fine-tuned model demonstrated substantial improvements over the base model in terms of Japanese language understanding and generation.

Future Work

Dataset Expansion: Increase the size and diversity of the dataset to further enhance model performance.
Evaluation: Conduct comprehensive evaluation and benchmarking against standard Japanese language tasks.
Optimization: Optimize the model for better performance and efficiency.