File size: 5,240 Bytes
9255e5c 442a193 9255e5c 442a193 9255e5c bed2d73 9255e5c c104028 9255e5c 442a193 f96f4cb 442a193 9255e5c efd737b 9255e5c f674f65 9255e5c f674f65 5359d7e 9255e5c f674f65 9255e5c f674f65 9255e5c f674f65 815696a f674f65 9255e5c f57ce0a 9255e5c 96709dd 9255e5c 96709dd 9255e5c 50ff845 9255e5c 652f4a4 9255e5c 652f4a4 9255e5c bed2d73 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: apache-2.0
language:
- vi
- en
---
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/63905e87df447b438817b2cd/QFhLKQlWeyO9XumtyghVo.jpeg" alt="Image" style="width: 400px; height: auto; border-radius: 10px;" />
</p>
## Model Details
- **Developed by:** Tuan Pham (FPTU HCM Student)
- Contact me at: weekend.2810@gmail.com or tuanpmse160561@fpt.edu.vn
- Looking for intern opportunity :D
- **Model type:** Llama2-7B Decoder-only
- **Finetuned from model :**
* meta-llama/Llama-2-7b
* bkai-foundation-models/vietnamese-llama2-7b-120GB
* yeen214/llama2_7b_merge_orcafamily.
- **Bilingual support :** English and Vietnamese
### Model Description
<!-- Provide a longer summary of what this model is. -->
This model is a proof of effort that one man can fine-tune his own model to reach SOTA.
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:**
* Training: https://github.com/vTuanpham/Vietnamese_QA_System
* Data: https://github.com/vTuanpham/Large_dataset_translator
- **Paper:** ...
- **Demo:** ...
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Prompt template
```
[SYSTEM_PROMPT]
####### Instruction:
[INPUT]
%%%%%%% Response:
[RESPONSE]
```
Recommend keeping the system prompt in english.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
import torch
from torch.cuda.amp import autocast
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline
def prompt_format(system_prompt, instruction):
prompt = f"""{system_prompt}
####### Instruction:
{instruction}
%%%%%%% Response:
"""
return prompt
system_prompt = """
You're an AI Large Language Model developed(created) by an AI developer named Tuấn, the architecture of you is decoder-based LM, your task are to think loudly step by step before give a good and relevant response
to the user request, answer in the language the user preferred.
The AI has been trained to answer questions, provide recommendations, and help with decision making. The AI thinks outside the box and follows the user requests
"""
instruction = "Xin chào"
formatted_prompt = prompt_format(system_prompt, instruction)
print(formatted_prompt)
model_name = "1TuanPham/T-Llama"
model = AutoModelForCausalLM.from_pretrained(model_name,
torch_dtype=torch.bfloat16,
use_cache=True,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True)
pipe = pipeline("text-generation", model=base_model, tokenizer=tokenizer, streamer=streamer)
with autocast():
output_default = pipe(formatted_prompt, pad_token_id=50256, max_new_tokens=128)
```
Example output:
```bash
Xin chào! Tôi là một AI được phát triển bởi một AI nhà phát triển tên là Tuấn. Tôi được thiết kế để giúp đỡ người dùng bằng cách trả lời các câu hỏi, đưa ra đề xuất và hỗ trợ trong quá trình ra quyết định.
Tôi có thể hỗ trợ bạn bằng cách nghĩ ra các câu trả lời hay và phù hợp cho các câu hỏi của bạn.
```
Here is a kaggle script to quickly test the model:
* https://www.kaggle.com/code/tuanphamm/t-llama-test
## Training Details
**Hardware Type:**
* GPU: VGA NVIDIA Tesla P100 16GB
* SYSTEM RAM: 29GB
**Hours used:** ~47.5 days Approx*
### Training Data
* BactrianX
* OpenOrca_translated
* WizardLM_70k_translated
* TigerLabMathInstruct_translated_vi
* GradeSchoolMathInstruct_translated
* vilm_lima-vi
* MTEngVietnamese
* databricks_dolly15k_translated
* AlpacaCleaned_translated
* databricks_dolly15k
* OpenOrca
* GradeSchoolMathInstruct
* AlpacaCleaned
* WebglmQA
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
* Learning rate: 2e-5 cosine
* Optimizer: PagedLion8bit
* QLora: rank: 64 /Q: 4-bit
- 250k examples of 70% Vietnamese 30% English for 3.37 epoch
- 350k examples of 60% Vietnamese 40% English for 1.4 epoch
### Training loss
![image/png](https://cdn-uploads.huggingface.co/production/uploads/63905e87df447b438817b2cd/rV8Go_YFZv7QcR_FhFxp-.png)
Each line is 12 hours
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
![image/png](https://cdn-uploads.huggingface.co/production/uploads/63905e87df447b438817b2cd/z1ZTm7Tab4tQbVPgQW1hU.png)
Our model currently sits at TOP-5 on the VMLU benchmark
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
```bibtex
@online{t-llama,
author = {Pham Minh Tuan},
title = {T-Llama: A New Language Model for Vietnamese},
year = 2024,
url = {https://github.com/vTuanpham/Vietnamese_QA_System}
}
```
|