cyb3rr31a's picture
Upload README.md with huggingface_hub
75e38fc verified
metadata
language: en
license: apache-2.0
base_model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
tags:
  - data-engineering
  - fine-tuned
  - qlora
  - llm

TinyLlama Data Engineering Assistant

A TinyLlama-1.1B model fine-tuned on data engineering Q&A pairs using QLoRA. It answers questions about data engineering concepts more accurately than the base model.

Base model

TinyLlama/TinyLlama-1.1B-Chat-v1.0

Training

  • Method: QLoRA (4-bit quantization + LoRA)
  • Dataset: 15 custom data engineering Q&A pairs
  • Epochs: 10
  • LoRA rank: 16
  • Hardware: NVIDIA T4 (Google Colab free tier)

Topics covered

ETL, data warehouses, data lakes, Apache Spark, dbt, Apache Airflow, DAGs, batch vs stream processing, data pipelines, partitioning, data lineage, medallion architecture, idempotency, BigQuery, dimensional modeling, RAG

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "cyb3rr31a/tinyllama-data-engineering"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)

prompt = "### Question:\nWhat is dbt?\n\n### Answer:\n"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

This model was fine-tuned on a small dataset of 15 examples for demonstration purposes. It performs best on the topics covered in the training data.