Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

parseny/TinyLlama1.1B-Nvidia-QA

This repository contains the parseny/TinyLlama1.1B-Nvidia-QA model, a fine-tuned version of the TinyLlama language model designed for generating answers on NVIDIA documentation. The model was fine-tuned on a dataset of question-answer pairs and evaluated using several metrics to ensure high performance.

Model Details

  • Model ID: parseny/TinyLlama1.1B-Nvidia-QA
  • Model Type: Causal Language Model
  • Base Model: TinyLlama-1.1B
  • Quantization: 4-bit quantization using BitsAndBytes
  • Fine-Tuning Framework: Hugging Face Transformers and PEFT

Training Configuration

The model was fine-tuned with the following training arguments:

training_arguments = TrainingArguments(
    output_dir="./logs",
    per_device_train_batch_size=16,
    gradient_accumulation_steps=4,
    optim="paged_adamw_32bit",
    fp16=True,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    num_train_epochs=5,
    load_best_model_at_end=True,
    learning_rate=5e-4
)

Evaluation Metrics

The performance of the fine-tuned model was evaluated using the following metrics:

  • ROUGE Scores:

    • ROUGE-1: 0.3122
    • ROUGE-2: 0.1228
    • ROUGE-L: 0.2599
    • ROUGE-Lsum: 0.2600
  • METEOR Score: 0.27

These scores indicate that the model performs reasonably well in generating responses that are lexically and semantically similar to the reference answers.

Model Usage

You can use this model to generate responses for chat-based applications. Below is an example of how to load and use the model for generating responses:

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel, PeftConfig
import torch

# Load the model and tokenizer
model_id = "parseny/TinyLlama1.1B-Nvidia-QA"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
model.to('cuda')

# Generate a response
generation_config = GenerationConfig(
    penalty_alpha=0.6, do_sample=True,
    top_k=5, temperature=0.5, repetition_penalty=1.2,
    max_new_tokens=47, pad_token_id=tokenizer.eos_token_id
)

def generate_response(prompt):
    try:
        inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
        outputs = model.generate(**inputs, generation_config=generation_config)
        generated_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        start_idx = generated_response.find('<|im_start|>assistant\n') + len('<|im_start|>assistant\n')
        generated_response = generated_response[start_idx:]
        end_idx = generated_response.find('<|im_end|>')
        generated_response = generated_response[:end_idx]
        return generated_response
    except:
        return ""

# Example usage
prompt = "What was the purpose of setting up the DGX RAID memory in version 2 of the pipeline?"
response = generate_response(prompt)
print(response)

Training Procedure

The model was fine-tuned using a dataset of question-answer pairs. The fine-tuning process involved:

  1. Loading the pre-trained TinyLlama-1.1B model.
  2. Quantizing the model to 4-bit precision to reduce memory usage and increase inference speed.
  3. Fine-tuning the model using the SFTTrainer with the specified training arguments.
  4. Evaluating the model at the end of each epoch and saving the best-performing model.

How to Cite

If you use this model in your research or applications, please cite it as follows:

@misc{parseny-tinyllama-nvidia-qa,
  author = {Your Name},
  title = {TinyLlama1.1B-Nvidia-QA: NVIDIA documnetation helper},
  year = {2024},
  publisher = {Hugging Face},
  url = {https://huggingface.co/parseny/TinyLlama1.1B-Nvidia-QA},
}

Contact

For any questions or issues, please open an issue on the Hugging Face model repository.

Downloads last month
4
Safetensors
Model size
1.1B params
Tensor type
FP16
·
Inference API
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.