--- license: llama3 datasets: - parinzee/seed-free-synthetic-instruct-thai-v1 language: - th - en library_name: transformers --- # LLaMA 3 8B - Seed-Free Synthetic Instruct (F+C+D+) This model is the result of fine-tuning LLaMA 3 8B using our novel seed-free synthetic instruction dataset for Thai. It represents the outcome of our research "Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai" submitted to ACL SRW 2024. ## Model Details - **Base Model**: LLaMA 3 8B - **Fine-tuning Dataset**: Seed-Free Synthetic Instruct Thai v1 (F+C+D+) - **Training Corpus Size**: 5,000 instructions - **Languages**: Primarily Thai, with potential for English and other languages supported by LLaMA 3 ## Intended Use This model is designed for: - Thai language understanding and generation - Instruction following in Thai - General language tasks with a focus on Thai cultural context It can be used for research purposes, to power Thai language applications, or as a baseline for further fine-tuning on specific Thai language tasks. ## Performance and Limitations ### Strengths - Comparable performance to state-of-the-art Thai LLMs (e.g., WangchanX, OpenThaiGPT) - Second-highest BERTScore on both Thai Culture and General Test Sets - Efficient performance achieved with only 5,000 instructions - Strong understanding of Thai cultural context ### Limitations - Performance on tasks outside the scope of the fine-tuning dataset may vary - May exhibit biases present in the synthetic dataset or the base LLaMA 3 model - Not extensively tested for factual accuracy or potential harmful outputs ## Training Procedure - **Fine-tuning Framework**: [axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) - **Training Data**: Seed-Free Synthetic Instruct Thai v1 (F+C+D+), generated using our novel framework ## Evaluation Results The model was evaluated on various Thai language tasks, including: - Thai Culture Test Set - General Test Set