Model Card: vi-gemma-2b-RAG
(English below)
Tiแบฟng Viแปt (Vietnamese)
Mรด tแบฃ mรด hรฌnh:
vi-gemma-2b-RAG lร mแปt mรด hรฌnh ngรดn ngแปฏ lแปn ฤฦฐแปฃc tinh chแปnh tแปซ mรด hรฌnh cฦก sแป google/gemma-1.1-2b-it sแปญ dแปฅng kแปน thuแบญt LoRA. Mรด hรฌnh ฤฦฐแปฃc huแบฅn luyแปn trรชn tแบญp dแปฏ liแปu tiแบฟng Viแปt vแปi mแปฅc tiรชu cแบฃi thiแปn khแบฃ nฤng xแปญ lรฝ ngรดn ngแปฏ tiแบฟng Viแปt vร nรขng cao hiแปu suแบฅt cho cรกc tรกc vแปฅ truy xuแบฅt thรดng tin mแป (Retrieval Augmented Generation - RAG).
Mแปฅc ฤรญch sแปญ dแปฅng:
Mรด hรฌnh vi-gemma-2b-RAG phรน hแปฃp cho cรกc tรกc vแปฅ sau:
- Trแบฃ lแปi cรขu hแปi dแปฑa trรชn ngแปฏ cแบฃnh tiแบฟng Viแปt.
- Tรณm tแบฏt vฤn bแบฃn tiแบฟng Viแปt.
- Dแปch mรกy tiแบฟng Viแปt.
- Vร cรกc tรกc vแปฅ tแบกo vฤn bแบฃn tiแบฟng Viแปt khรกc.
Giแปi hแบกn:
Mแบทc dรน ฤรฃ ฤฦฐแปฃc tinh chแปnh cho tiแบฟng Viแปt, vi-gemma-2b-RAG vแบซn cรณ thแป gแบทp phแบฃi mแปt sแป hแบกn chแบฟ:
- Cรณ thแป tแบกo ra thรดng tin sai lแปch hoแบทc khรดng chรญnh xรกc.
- Cรณ thแป thแป hiแปn thร nh kiแบฟn โโhoแบทc quan ฤiแปm khรดng phรน hแปฃp.
- Hiแปu suแบฅt cรณ thแป bแป แบฃnh hฦฐแปng bแปi chแบฅt lฦฐแปฃng cแปงa dแปฏ liแปu ฤแบงu vร o.
Cรกch sแปญ dแปฅng:
Dฦฐแปi ฤรขy chรบng tรดi chia sแบป mแปt sแป ฤoแบกn mรฃ vแป cรกch bแบฏt ฤแบงu nhanh chรณng ฤแป sแปญ dแปฅng mรด hรฌnh. Trฦฐแปc tiรชn, hรฃy ฤแบฃm bแบฃo ฤรฃ cร i ฤแบทt pip install -U transformers
, sau ฤรณ sao chรฉp ฤoแบกn mรฃ tแปซ phแบงn cรณ liรชn quan ฤแบฟn usecase cแปงa bแบกn.
Chรบng tรดi khuyแบฟn nghแป sแปญ dแปฅng torch.bfloat16
lร m mแบทc ฤแปnh.
# pip install transformers torch accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Khแปi tแบกo tokenizer vร model tแปซ checkpoint ฤรฃ lฦฐu
tokenizer = AutoTokenizer.from_pretrained("himmeow/vi-gemma-2b-RAG")
model = AutoModelForCausalLM.from_pretrained(
"himmeow/vi-gemma-2b-RAG",
device_map="auto",
torch_dtype=torch.bfloat16
)
# Sแปญ dแปฅng GPU nแบฟu cรณ
if torch.cuda.is_available():
model.to("cuda")
# ฤแปnh dแบกng prompt cho model
prompt = """
### Instruction and Input:
Dแปฑa vร o ngแปฏ cแบฃnh/tร i liแปu sau:
{}
Hรฃy trแบฃ lแปi cรขu hแปi: {}
### Response:
{}
"""
# Chuแบฉn bแป dแปฏ liแปu ฤแบงu vร o
input_data = """
Short Tandem Repeats (STRs) lร cรกc trรฌnh tแปฑ DNA lแบทp lแบกi ngแบฏn (2- 6 nucleotides) xuแบฅt hiแปn phแป biแบฟn trong hแป gen cแปงa con ngฦฐแปi. Cรกc trรฌnh tแปฑ nร y cรณ tรญnh ฤa hรฌnh rแบฅt cao trong tแปฑ nhiรชn, ฤiแปu nร y khiแบฟn cรกc STRs trแป thร nh nhแปฏng markers di truyแปn rแบฅt quan trแปng trong nghiรชn cแปฉu bแบฃn ฤแป gen ngฦฐแปi vร chuแบฉn ฤoรกn bแปnh lรฝ di truyแปn cลฉng nhฦฐ xรกc ฤแปnh danh tรญnh trong lฤฉnh vแปฑc phรกp y.
Cรกc STRs trแป nรชn phแป biแบฟn tแบกi cรกc phรฒng xรฉt nghiแปm phรกp y bแปi vรฌ viแปc nhรขn bแบฃn vร phรขn tรญch STRs chแป cแบงn lฦฐแปฃng DNA rแบฅt thแบฅp ngay cแบฃ khi แป dแบกng bแป phรขn hแปงy viแปc ฤinh danh vแบซn cรณ thแป ฤฦฐแปฃc thแปฑc hiแปn thร nh cรดng. Hฦกn nแปฏa viแปc phรกt hiแปn vร ฤรกnh giรก sแปฑ nhiแป
m DNA mแบซu trong cรกc mแบซu vแบญt cรณ thแป ฤฦฐแปฃc giแบฃi quyแบฟt nhanh vแปi kแบฟt quแบฃ phรขn tรญch STRs. แป Hoa Kแปณ hiแปn nay, tแปซ bแป 13 markers nay ฤรฃ tฤng lรชn 20 markers chรญnh ฤang ฤฦฐแปฃc sแปญ dแปฅng ฤแป tแบกo ra mแปt cฦก sแป dแปฏ liแปu DNA trรชn toร n ฤแบฅt nฦฐแปc ฤฦฐแปฃc gแปi lร The FBI Combined DNA Index System (Expaned CODIS).
CODIS vร cรกc cฦก sแปญ dแปฏ liแปu DNA tฦฐฦกng tแปฑ ฤang ฤฦฐแปฃc sแปญ dแปฅng thแปฑc sแปฑ thร nh cรดng trong viแปc liรชn kแบฟt cรกc hแป sฦก DNA tแปซ cรกc tแปi phแบกm vร cรกc bแบฑng chแปฉng hiแปn trฦฐแปng vแปฅ รกn. Kแบฟt quแบฃ ฤแปnh danh STRs cลฉng ฤฦฐแปฃc sแปญ dแปฅng ฤแป hแป trแปฃ hร ng trฤm nghรฌn trฦฐแปng hแปฃp xรฉt nghiแปm huyแบฟt thแปng cha con mแปi nฤm'
"""
query = "Hรฃy cho tรดi biแบฟt mแปt sแป tรญnh chแบฅt cแปงa STRs ฤฦฐแปฃc dรนng ฤแป lร m gรฌ?"
# ฤแปnh dแบกng input text
input_text = prompt.format(input_data, query," ")
# Mรฃ hรณa input text thร nh input ids
input_ids = tokenizer(input_text, return_tensors="pt")
# Sแปญ dแปฅng GPU cho input ids nแบฟu cรณ
if torch.cuda.is_available():
input_ids = input_ids.to("cuda")
# Tแบกo vฤn bแบฃn bแบฑng model
outputs = model.generate(
**input_ids,
max_new_tokens=500,
no_repeat_ngram_size=5, # Ngฤn chแบทn lแบทp lแบกi cรกc cแปฅm tแปซ 5 gram
# do_sample=True, # Kรญch hoแบกt chแบฟ ฤแป tแบกo vฤn bแบฃn dแปฑa trรชn lแบฅy mแบซu. Trong chแบฟ ฤแป nร y, model sแบฝ chแปn ngแบซu nhiรชn token tiแบฟp theo dแปฑa trรชn xรกc suแบฅt ฤฦฐแปฃc tรญnh tแปซ phรขn phแปi xรกc suแบฅt cแปงa cรกc token.
# temperature=0.7, # Giแบฃm temperature ฤแป kiแปm soรกt tรญnh ngแบซu nhiรชn
# early_stopping=True, # Dแปซng tแบกo vฤn bแบฃn khi tรฌm thแบฅy kแบฟt thรบc phรน hแปฃp
)
# Giแบฃi mรฃ vร in kแบฟt quแบฃ
print(tokenizer.decode(outputs[0]))
'''
<bos>
### Instruction and Input:
Dแปฑa vร o ngแปฏ cแบฃnh/tร i liแปu sau:
Short Tandem Repeats (STRs) lร cรกc trรฌnh tแปฑ DNA lแบทp lแบกi ngแบฏn (2- 6 nucleotides) xuแบฅt hiแปn phแป biแบฟn trong hแป gen cแปงa con ngฦฐแปi. Cรกc trรฌnh tแปฑ nร y cรณ tรญnh ฤa hรฌnh rแบฅt cao trong tแปฑ nhiรชn, ฤiแปu nร y khiแบฟn cรกc STRs trแป thร nh nhแปฏng markers di truyแปn rแบฅt quan trแปng trong nghiรชn cแปฉu bแบฃn ฤแป gen ngฦฐแปi vร chuแบฉn ฤoรกn bแปnh lรฝ di truyแปn cลฉng nhฦฐ xรกc ฤแปnh danh tรญnh trong lฤฉnh vแปฑc phรกp y.
Cรกc STRs trแป nรชn phแป biแบฟn tแบกi cรกc phรฒng xรฉt nghiแปm phรกp y bแปi vรฌ viแปc nhรขn bแบฃn vร phรขn tรญch STRs chแป cแบงn lฦฐแปฃng DNA rแบฅt thแบฅp ngay cแบฃ khi แป dแบกng bแป phรขn hแปงy viแปc ฤinh danh vแบซn cรณ thแป ฤฦฐแปฃc thแปฑc hiแปn thร nh cรดng. Hฦกn nแปฏa viแปc phรกt hiแปn vร ฤรกnh giรก sแปฑ nhiแป
m DNA mแบซu trong cรกc mแบซu vแบญt cรณ thแป ฤฦฐแปฃc giแบฃi quyแบฟt nhanh vแปi kแบฟt quแบฃ phรขn tรญch STRs. แป Hoa Kแปณ hiแปn nay, tแปซ bแป 13 markers nay ฤรฃ tฤng lรชn 20 markers chรญnh ฤang ฤฦฐแปฃc sแปญ dแปฅng ฤแป tแบกo ra mแปt cฦก sแป dแปฏ liแปu DNA trรชn toร n ฤแบฅt nฦฐแปc ฤฦฐแปฃc gแปi lร The FBI Combined DNA Index System (Expaned CODIS).
CODIS vร cรกc cฦก sแปญ dแปฏ liแปu DNA tฦฐฦกng tแปฑ ฤang ฤฦฐแปฃc sแปญ dแปฅng thแปฑc sแปฑ thร nh cรดng trong viแปc liรชn kแบฟt cรกc hแป sฦก DNA tแปซ cรกc tแปi phแบกm vร cรกc bแบฑng chแปฉng hiแปn trฦฐแปng vแปฅ รกn. Kแบฟt quแบฃ ฤแปnh danh STRs cลฉng ฤฦฐแปฃc sแปญ dแปฅng ฤแป hแป trแปฃ hร ng trฤm nghรฌn trฦฐแปng hแปฃp xรฉt nghiแปm huyแบฟt thแปng cha con mแปi nฤm'
Hรฃy trแบฃ lแปi cรขu hแปi: Hรฃy cho tรดi biแบฟt mแปt sแป tรญnh chแบฅt cแปงa STRs ฤฦฐแปฃc dรนng ฤแป lร m gรฌ?
### Response:
STRs ฤฦฐแปฃc sแปญ dแปฅng ฤแป xรกc ฤแปnh danh tรญnh, chuแบฉn ฤoรกn bแปnh lรฝ vร xรกc ฤแปnh bแปnh lรฝ di truyแปn.
<eos>
'''
Huแบฅn luyแปn:
- Mรด hรฌnh cฦก sแป: google/gemma-1.1-2b-it
- Tแบญp dแปฏ liแปu: lamhieu/mabrycodes_dialogue_vi
- Phฦฐฦกng phรกp tinh chแปnh: LoRA, PEFT vแปi Unsloth
Model Card: vi-gemma-2b-RAG
English
Model Description:
vi-gemma-2b-RAG is a large language model fine-tuned from the base model google/gemma-1.1-2b-it using LoRA. The model is trained on a Vietnamese dataset to improve its Vietnamese language processing capabilities and enhance its performance for Retrieval Augmented Generation (RAG) tasks.
Intended Use:
The vi-gemma-2b-RAG model is suitable for tasks such as:
- Vietnamese question answering.
- Vietnamese text summarization.
- Vietnamese machine translation.
- And other Vietnamese text generation tasks.
Limitations:
While fine-tuned for Vietnamese, vi-gemma-2b-RAG may still have some limitations:
- It may generate incorrect or misleading information.
- It may exhibit biases or inappropriate opinions.
- Its performance may be affected by the quality of the input data.
How to Use:
Usage
Below we share some code snippets on how to get quickly started with running the model. First make sure to pip install -U transformers
, then copy the snippet from the section that is relevant for your usecase.
We recommend torch.bfloat16
as the default dtype.
# pip install transformers torch accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
# Initialize the tokenizer and model from the saved checkpoint
tokenizer = AutoTokenizer.from_pretrained("himmeow/vi-gemma-2b-RAG")
model = AutoModelForCausalLM.from_pretrained(
"himmeow/vi-gemma-2b-RAG",
device_map="auto",
torch_dtype=torch.bfloat16
)
# Use GPU if available
if torch.cuda.is_available():
model.to("cuda")
# Define the prompt format for the model
prompt = """
### Instruction and Input:
Based on the following context/document:
{}
Please answer the question: {}
### Response:
{}
"""
# Prepare the input data
input_data = """
Short Tandem Repeats (STRs) are short (2-6 nucleotides) repeating DNA sequences that are widespread in the human genome. These sequences are highly polymorphic in nature, which makes STRs very important genetic markers in human gene mapping and diagnosis of hereditary diseases as well as identification in the field of forensics.
STRs have become popular in forensic laboratories because the replication and analysis of STRs requires very small amounts of DNA, even in decomposed form, identification can still be performed successfully. Furthermore, the detection and assessment of sample DNA contamination in specimens can be quickly resolved with STR analysis results. In the United States today, the set of 13 markers has now been increased to 20 main markers being used to create a nationwide DNA database called The FBI Combined DNA Index System (Expaned CODIS).
CODIS and similar DNA databases are being used very successfully in linking DNA records from criminals and crime scene evidence. STR identification results are also used to support hundreds of thousands of paternity test cases each year.'
"""
query = "Tell me what are some properties of STRs used for?"
# Format the input text
input_text = prompt.format(input_data, query," ")
# Encode the input text into input ids
input_ids = tokenizer(input_text, return_tensors="pt")
# Use GPU for input ids if available
if torch.cuda.is_available():
input_ids = input_ids.to("cuda")
# Generate text using the model
outputs = model.generate(
**input_ids,
max_new_tokens=500, # Limit the number of tokens generated
no_repeat_ngram_size=5, # Prevent repetition of 5-gram phrases
# do_sample=True,
# temperature=0.7, # Adjust the randomness of the generated text
# early_stopping=True, # Stop generating text when a suitable ending is found
)
# Decode and print the results
print(tokenizer.decode(outputs[0]))
Training:
- Base Model: google/gemma-1.1-2b-it
- Dataset: lamhieu/mabrycodes_dialogue_vi
- Fine-tuning Method: LoRA, PEFT and Unsloth
Using example repository: https://github.com/Martincrux/Vietnamese-RAG-system-building-with-vi-gemma-2b-RAG-and-halong_embedding
Uploaded model
- Developed by: hiieu, himmeow the coder, cuctrinh
- License: apache-2.0
- Finetuned from model : unsloth/gemma-1.1-2b-it-bnb-4bit
This gemma model was trained 2x faster with Unsloth and Huggingface's TRL library.
- Downloads last month
- 736
Model tree for ricepaper/vi-gemma-2b-RAG
Base model
unsloth/gemma-1.1-2b-it-bnb-4bit