README.md · capleaf/T-Llama at 8c70874071769d649687e4addc4fbb38f518e982

metadata

license: apache-2.0
language:
  - vi
  - en

Model Details

Developed by: Tuan Pham (FPTU HCM Student)
- Contact me at: weekend.2810@gmail.com or tuanpmse160561@fpt.edu.vn
- Looking for intern opportunity :D
Model type: Llama2-7B Decoder-only
Finetuned from model :
- meta-llama/Llama-2-7b
- bkai-foundation-models/vietnamese-llama2-7b-120GB
- yeen214/llama2_7b_merge_orcafamily.
Bilingual support : English and Vietnamese

Model Description

This model is a proof of effort that one man can fine-tune his own model to reach SOTA.

Model Sources

Repository:
- Training: https://github.com/vTuanpham/Vietnamese_QA_System
- Data: https://github.com/vTuanpham/Large_dataset_translator
Paper: ...
Demo: ...

Uses

Prompt template

[SYSTEM_PROMPT]

 ####### Instruction:
[INPUT]

 %%%%%%% Response:
[RESPONSE]

Recommend keeping the system prompt in english.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from torch.cuda.amp import autocast
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline


def prompt_format(system_prompt, instruction):
    prompt = f"""{system_prompt}

 ####### Instruction:
{instruction}

 %%%%%%% Response:

"""
    return prompt

system_prompt = """
You're an AI Large Language Model developed(created) by an AI developer named Tuấn, the architecture of you is decoder-based LM, your task are to think loudly step by step before give a good and relevant response
to the user request, answer in the language the user preferred.

The AI has been trained to answer questions, provide recommendations, and help with decision making. The AI thinks outside the box and follows the user requests
"""
instruction = "Xin chào"

formatted_prompt = prompt_format(system_prompt, instruction)
print(formatted_prompt)


model_name = "1TuanPham/T-Llama"
model = AutoModelForCausalLM.from_pretrained(model_name,
                                             torch_dtype=torch.bfloat16,
                                             use_cache=True,
                                             device_map="auto"
                                             )
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
pipe = pipeline("text-generation", model=base_model, tokenizer=tokenizer, streamer=streamer)

with autocast():
  output_default = pipe(formatted_prompt, pad_token_id=50256, max_new_tokens=128)

Example output:

Xin chào! Tôi là một AI được phát triển bởi một AI nhà phát triển tên là Tuấn. Tôi được thiết kế để giúp đỡ người dùng bằng cách trả lời các câu hỏi, đưa ra đề xuất và hỗ trợ trong quá trình ra quyết định.
Tôi có thể hỗ trợ bạn bằng cách nghĩ ra các câu trả lời hay và phù hợp cho các câu hỏi của bạn.

Note: 120GB of pre-trained Vietnamese data might not be enough for a general question about Vietnamese events.

Here is a kaggle script to quickly test the model:

https://www.kaggle.com/code/tuanphamm/t-llama-test

Training Details

Hardware Type:

GPU: VGA NVIDIA Tesla P100 16GB
SYSTEM RAM: 29GB

Hours used: ~47.5 days Approx*

Training Data

BactrianX
OpenOrca_translated
WizardLM_70k_translated
TigerLabMathInstruct_translated_vi
GradeSchoolMathInstruct_translated
vilm_lima-vi
MTEngVietnamese
databricks_dolly15k_translated
AlpacaCleaned_translated
databricks_dolly15k
OpenOrca
GradeSchoolMathInstruct
AlpacaCleaned
WebglmQA

Training Procedure

Learning rate: 2e-5 cosine
Optimizer: PagedLion8bit
QLora: rank: 64 /Q: 4-bit
- 250k examples of 70% Vietnamese 30% English for 3.37 epoch
- 350k examples of 60% Vietnamese 40% English for 1.4 epoch

Training loss

Each line is 12 hours

Evaluation

Our model currently sits at TOP-5 on the VMLU benchmark

Citation

@online{t-llama,
  author = {Pham Minh Tuan},
  title = {T-Llama: A New Language Model for Vietnamese},
  year = 2024,
  url = {https://github.com/vTuanpham/Vietnamese_QA_System}
}