|
--- |
|
license: apache-2.0 |
|
language: |
|
- vi |
|
- en |
|
--- |
|
|
|
<p align="center"> |
|
<img src="https://cdn-uploads.huggingface.co/production/uploads/63905e87df447b438817b2cd/QFhLKQlWeyO9XumtyghVo.jpeg" alt="Image" style="width: 400px; height: auto; border-radius: 10px;" /> |
|
</p> |
|
|
|
|
|
## Model Details |
|
|
|
- **Developed by:** Tuan Pham (FPTU HCM Student) |
|
- **Model type:** Llama2-7B Decoder-only |
|
- **Finetuned from model :** |
|
* meta-llama/Llama-2-7b |
|
* bkai-foundation-models/vietnamese-llama2-7b-120GB |
|
* yeen214/llama2_7b_merge_orcafamily. |
|
- **Bilingual support :** English and Vietnamese |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This model is a prove of effort that one man can finetune their own model that reach SOTA |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** |
|
* Training: https://github.com/vTuanpham/Vietnamese_QA_System |
|
* Data: https://github.com/vTuanpham/Large_dataset_translator |
|
- **Paper:** ... |
|
- **Demo:** ... |
|
|
|
## Uses |
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
### Prompt template |
|
|
|
``` |
|
[SYSTEM_PROMPT] |
|
|
|
####### Instruction: |
|
[INPUT] |
|
|
|
%%%%%%% Response: |
|
[RESPONSE] |
|
``` |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
```python |
|
from torch.cuda.amp import autocast |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer, pipeline |
|
|
|
model_name = "1TuanPham/T-Llama" |
|
model = AutoModelForCausalLM.from_pretrained(model_name, |
|
torch_dtype=torch.bfloat16, |
|
use_cache=True, |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True) |
|
streamer = TextStreamer(tokenizer, skip_special_tokens=True) |
|
pipe = pipeline("text-generation", model=base_model, tokenizer=tokenizer, streamer=streamer) |
|
|
|
with autocast(): |
|
output_default = pipe("Phạm Nhật Vượng là ", pad_token_id=50256, max_new_tokens=128) |
|
|
|
``` |
|
## Training Details |
|
|
|
**Hardware Type:** |
|
* GPU: VGA NVIDIA Tesla P100 16GB |
|
* SYSTEM RAM: 29GB |
|
|
|
**Hours used:** ~47.5 Approx* |
|
|
|
### Training Data |
|
|
|
* BactrianX |
|
* OpenOrca_translated |
|
* WizardLM_70k_translated |
|
* TigerLabMathInstruct_translated_vi |
|
* GradeSchoolMathInstruct_translated |
|
* vilm_lima-vi |
|
* MTEngVietnamese |
|
* databricks_dolly15k_translated |
|
* AlpacaCleaned_translated |
|
* databricks_dolly15k |
|
* OpenOrca |
|
* GradeSchoolMathInstruct |
|
* AlpacaCleaned |
|
* WebglmQA |
|
|
|
### Training Procedure |
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
* Learning rate: 2e-5 cosine |
|
* Optimizer: PagedLion8bit |
|
* QLora: rank: 64 /Q: 4-bit |
|
|
|
- 250k examples of 70% Vietnamese 30% English for 3.37 epoch |
|
- 350k examples of 60% Vietnamese 40% English for 1.4 epoch |
|
|
|
### Training loss |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/63905e87df447b438817b2cd/rV8Go_YFZv7QcR_FhFxp-.png) |
|
|
|
## Evaluation |
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
### Results |
|
|
|
[More Information Needed] |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
[More Information Needed] |
|
|
|
## Citation |
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
## Model Card Authors |
|
|
|
|
|
## Model Card Contact |
|
|
|
[More Information Needed] |