Multilingual Question-Answering Model (Vietnamese and Japanese)

Overview

This repository contains a fine-tuned multilingual question-answering model that supports both Vietnamese and Japanese. Built on top of the Qwen/Qwen2.5-1.5B-Instruct base model, this model leverages advanced transformer architectures to provide high-quality answers in both languages.

The model has been fine-tuned using datasets such as:

bkai-foundation-models/vi-alpaca-input-output-format: A Vietnamese dataset designed for instruction-based input-output tasks.
CausalLM/GPT-4-Self-Instruct-Japanese: A Japanese dataset created with self-instruct techniques to improve language understanding and generation.

This model is ideal for applications requiring cross-lingual support between Vietnamese and Japanese.

License

This project is released under the MIT License, ensuring flexibility for both academic and commercial use. Please refer to the LICENSE file for more details.

Model Details

Base Model

Qwen/Qwen2.5-1.5B-Instruct: A powerful 1.5B parameter instruction-tuned model developed by Alibaba Cloud. It excels in understanding and generating natural language across various domains.

Supported Languages

Vietnamese (vi)
Japanese (ja)

Pipeline Tag

Question-Answering: The model is optimized for answering questions in both supported languages.

Library

Transformers: This model is built using the Hugging Face transformers library, making it easy to integrate into existing pipelines.

Installation

To use this model, ensure you have the transformers library installed:

pip install transformers

You can then load the model directly from the Hugging Face Hub:

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("haiFrHust/VNJPTranslate_base")
model = AutoModelForCausalLM.from_pretrained("haiFrHust/VNJPTranslate_base")

# Example usage
input_text = "質問: ベトナムの首都はどこですか？"  # Japanese: What is the capital of Vietnam?
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
answer = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(answer)

Dataset Information

Vietnamese Dataset

Name: bkai-foundation-models/vi-alpaca-input-output-format
Description: This dataset contains instruction-based input-output pairs in Vietnamese, enabling the model to understand and respond to structured queries effectively.

Japanese Dataset

Name: CausalLM/GPT-4-Self-Instruct-Japanese
Description: A self-instruct dataset in Japanese, designed to enhance the model's ability to generate accurate and contextually relevant responses.

Use Cases

This model is suitable for a variety of applications, including but not limited to:

Cross-Lingual Customer Support: Answering user queries in both Vietnamese and Japanese.
Educational Tools: Assisting students in learning and understanding concepts in their native language.
Multilingual Chatbots: Building conversational agents capable of handling multiple languages seamlessly.

Performance

The model demonstrates strong performance in both Vietnamese and Japanese, thanks to the high-quality datasets and the robust base model. However, performance may vary depending on the complexity of the questions and the domain-specific knowledge required.

For optimal results:

Ensure your input questions are clear and concise.
Fine-tune the model further on domain-specific data if necessary.

Contributions

Contributions to this project are welcome! If you have ideas for improvements, encounter issues, or wish to contribute additional datasets, please open an issue or submit a pull request.

Acknowledgments

We would like to thank the following organizations and contributors:

Alibaba Cloud for providing the Qwen base model.
The creators of the bkai-foundation-models/vi-alpaca-input-output-format and CausalLM/GPT-4-Self-Instruct-Japanese datasets.
The Hugging Face community for their excellent transformers library and support.

Contact

For any inquiries or feedback, feel free to reach out to us via:

Email: [hai.ph225715@sis.hust.edu.vn]
GitHub Issues: Open an issue in this repository.

Thank you for using our multilingual question-answering model! We hope it serves your needs effectively.

haiFrHust
/

VNJPTranslate_base