You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Please read the URA-LLaMA License Agreement before accepting it.
Log in or Sign Up to review the conditions and access this model content.
URA-LLaMa 7B
Model Details
Model Description
With a strong commitment to enhancing the quality of large language models for the Vietnamese language, a collaborative effort was undertaken by Vietnamese researchers hailing from Ho Chi Minh University of Technology (HCMUT) - Vietnam National University HCMC and Stanford University. Our endeavor involved the meticulous fine-tuning of Meta LLaMa-2 models using Vietnamese articles sourced from Wikipedia and online news websites. In line with our dedication to fostering community progress, we are pleased to offer our models free of charge for research purposes. For those who wish to delve further into our research and its details, we encourage you to explore the comprehensive information provided below.
- Developed by:
- Duc Q. Nguyen
- Sang T. Truong
- Toan D. V. Nguyen
- Dong D. Le
- Nhi N. Truong
- Tho Quan
- Sanmi Koyejo
- Funded by:
- Microsoft Accelerating Foundation Models Research program
- Stanford University
- Ho Chi Minh University of Technology (HCMUT) - VNU-HCM
- DSciLab (Faculty of Computer Science & Engineering, HCMUT - VNU-HCM)
- Model type: Text generation
- Languages: Vietnamese, English
- License:
- Custom license available at LICENSE
- Finetuned from model: Meta LLaMa-2 70B
Model Sources
We publicly provide starter source code and access to playground of URA-LLaMa 7B.
- Repository: URA-LLaMa Github
- Framework: ViLLM
- Paper: Our paper was accepted at NAACL 2024. Link
Uses
This model is primarily designed for text generation. However, as language models, it is versatile and can also function as an encoder for various downstream tasks, akin to other models. For a detailed understanding of its use cases, please refer to the information provided below.
Direct Use
You can use our models to perform various tasks containing
- Question answering (with context)
- Summarization
- Language modelling
- Text classification
- Translation
Downstream Use
This model can serve as an encoder for a wide range of downstream tasks, spanning from pure natural language processing to combinations of natural language processing with computer vision or speech processing.
Out-of-Scope Use
While our models have undergone fine-tuning using extensive Vietnamese datasets, they may not perform optimally in specialized domains necessitating profound domain expertise, such as medicine, politics, chemistry, etc. We kindly request that you refrain from employing our models for political purposes or any endeavors that may cause harm to individuals or compromise the sovereignty and territorial integrity of Vietnam.
Bias, Risks, and Limitations
Unless required by applicable law, the URA-LLaMa materials and any output and results therefrom are provided on an "as is" basis, without warranties of any kind, either express or implied, including, without limitation, any warranties of title, non-infringement, merchantability, or fitness for a particular purpose. you are solely responsible for determining the appropriateness of using or redistributing the URA-LLaMa materials and assume any risks associated with your use of the URA-LLaMa materials and any output and results.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. In order for the model to work well, you may need perform prompt engineering to create appropriate prompts before inference.
How to Get Started with the Model
Use the code below to get started with the model.
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
pipeline_kwargs={
"temperature": 1.0,
"max_new_tokens": 250,
"top_k": 1,
"repetition_penalty": 1.1
}
if __name__ == "__main__":
# Load model
model = AutoModelForCausalLM.from_pretrained(
"ura-hcmut/ura-llama-7b",
device_map="auto"
)
model.config.pretraining_tp = 1
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
"ura-hcmut/ura-llama-7b",
trust_remote_code=True
)
tokenizer.pad_token = tokenizer.eos_token
pipeline = transformers.pipeline(
model=model,
tokenizer=tokenizer,
return_full_text=False,
task='text-generation',
**pipeline_kwargs
)
query_template = "[INST] <<SYS>>\nBαΊ‘n lΓ mα»t trợ lΓ½ thΓ΄ng minh.\n<</SYS>>\n\nHΓ£y trαΊ£ lα»i cΓ’u hα»i sau.\nCΓ’u hα»i: {query}\nTrαΊ£ lα»i: [/INST]"
while True:
query = input("Query: ")
if query == "exit":
break
query = query_template.format(query=query)
answer = pipeline(query)[0]["generated_text"]
print(answer)
Finetuning Details
Finetuning Data
List of datasets used for finetuning:
Finetuning Procedure
We utilize the causal language modelling (next token prediction) procedure to finetune our models. Available tutorial is available at https://huggingface.co/docs/transformers/tasks/language_modeling.
Finetuning Hyperparameters
- Training regime: BFloat16 mixed precision
- Quantization: Normal Float 4bit
- Lora rank: 128
- Batch size: 120
- Optimizer: Paged AdamW 32bit
- Learning rate: 1e-5
Evaluation
Our models are tested with various tasks. The detail of evaluation process can be found at our Leaderboard.
Environmental Impact
Carbon emissions are estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: 1 x A100 80GB
- Hours used: ~520h
- Carbon Emitted: ~90 kg CO2 eq.
Citation
If you use URA-LLaMa materials in your research, please cite our model(s) as below.
BibTeX:
@inproceedings{crossing2024,
title = "Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models",
author = "Truong, Sang T. and Nguyen, Duc Q. and Nguyen, Toan D. V. and Le, Dong D. and Truong, Nhi N. and Quan, Tho and Koyejo, Sanmi",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
month = June,
year = "2024",
address = "Seattle, Washington",
publisher = "Association for Computational Linguistics",
url = "",
pages = "",
}
Model Card Authors
Contact
- Mr. Duc Q. Nguyen: nqduc@hcmut.edu.vn
- Mr. Sang T. Truong: sttruong@cs.stanford.edu
- Assoc. Prof. Tho Quan: qttho@hcmut.edu.vn
- Downloads last month
- 19