parseny/TinyLlama1.1B-Nvidia-QA

This repository contains the parseny/TinyLlama1.1B-Nvidia-QA model, a fine-tuned version of the TinyLlama language model designed for generating answers on NVIDIA documentation. The model was fine-tuned on a dataset of question-answer pairs and evaluated using several metrics to ensure high performance.

Model Details

Model ID: parseny/TinyLlama1.1B-Nvidia-QA
Model Type: Causal Language Model
Base Model: TinyLlama-1.1B
Quantization: 4-bit quantization using BitsAndBytes
Fine-Tuning Framework: Hugging Face Transformers and PEFT

Training Configuration

The model was fine-tuned with the following training arguments:

training_arguments = TrainingArguments(
    output_dir="./logs",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    fp16=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=5,
    load_best_model_at_end=True,
    learning_rate=5e-4
)

Evaluation Metrics

The performance of the fine-tuned model was evaluated using the following metrics:

ROUGE Scores:
- ROUGE-1: 0.3122
- ROUGE-2: 0.1228
- ROUGE-L: 0.2599
- ROUGE-Lsum: 0.2600
METEOR Score: 0.27

These scores indicate that the model performs reasonably well in generating responses that are lexically and semantically similar to the reference answers.

Model Usage

You can use this model to generate responses for chat-based applications. Below is an example of how to load and use the model for generating responses:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import torch

# Load the model and tokenizer
model_id = "parseny/TinyLlama1.1B-Nvidia-QA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
model.to('cuda')

# Generate a response
generation_config = GenerationConfig(
    penalty_alpha=0.6, do_sample=True,
    top_k=5, temperature=0.5, repetition_penalty=1.2,
    max_new_tokens=47, pad_token_id=tokenizer.eos_token_id
)

def generate_response(prompt):
    try:
        inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
        outputs = model.generate(**inputs, generation_config=generation_config)
        generated_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        start_idx = generated_response.find('<|im_start|>assistant\n') + len('<|im_start|>assistant\n')
        generated_response = generated_response[start_idx:]
        end_idx = generated_response.find('<|im_end|>')
        generated_response = generated_response[:end_idx]
        return generated_response
    except:
        return ""

# Example usage
prompt = "What was the purpose of setting up the DGX RAID memory in version 2 of the pipeline?"
response = generate_response(prompt)
print(response)

Training Procedure

The model was fine-tuned using a dataset of question-answer pairs. The fine-tuning process involved:

Loading the pre-trained TinyLlama-1.1B model.
Quantizing the model to 4-bit precision to reduce memory usage and increase inference speed.
Fine-tuning the model using the SFTTrainer with the specified training arguments.
Evaluating the model at the end of each epoch and saving the best-performing model.

How to Cite

If you use this model in your research or applications, please cite it as follows:

@misc{parseny-tinyllama-nvidia-qa,
  author = {Your Name},
  title = {TinyLlama1.1B-Nvidia-QA: NVIDIA documnetation helper},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/parseny/TinyLlama1.1B-Nvidia-QA},
}

Contact

For any questions or issues, please open an issue on the Hugging Face model repository.