Instructions to use A-Kishore/llama-3.2-3b-text2sql with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use A-Kishore/llama-3.2-3b-text2sql with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="A-Kishore/llama-3.2-3b-text2sql")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("A-Kishore/llama-3.2-3b-text2sql")
model = AutoModelForMultimodalLM.from_pretrained("A-Kishore/llama-3.2-3b-text2sql")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

PEFT
How to use A-Kishore/llama-3.2-3b-text2sql with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use A-Kishore/llama-3.2-3b-text2sql with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "A-Kishore/llama-3.2-3b-text2sql"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "A-Kishore/llama-3.2-3b-text2sql",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/A-Kishore/llama-3.2-3b-text2sql

SGLang

How to use A-Kishore/llama-3.2-3b-text2sql with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "A-Kishore/llama-3.2-3b-text2sql" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "A-Kishore/llama-3.2-3b-text2sql",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "A-Kishore/llama-3.2-3b-text2sql" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "A-Kishore/llama-3.2-3b-text2sql",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Unsloth Studio

How to use A-Kishore/llama-3.2-3b-text2sql with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for A-Kishore/llama-3.2-3b-text2sql to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for A-Kishore/llama-3.2-3b-text2sql to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for A-Kishore/llama-3.2-3b-text2sql to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="A-Kishore/llama-3.2-3b-text2sql",
    max_seq_length=2048,
)

Docker Model Runner
How to use A-Kishore/llama-3.2-3b-text2sql with Docker Model Runner:
```
docker model run hf.co/A-Kishore/llama-3.2-3b-text2sql
```

🔮 Llama-3.2-3B-Instruct Text-to-SQL

A fine-tuned unsloth/Llama-3.2-3B-Instruct-bnb-4bit model optimized for generating SQL queries from database schemas and natural language questions.

📋 Model Summary

Attribute	Value
Base Model	`unsloth/Llama-3.2-3B-Instruct-bnb-4bit`
Task	Text-to-SQL (natural language query to SQL conversion)
Fine-Tuning Method	LoRA (Low-Rank Adaptation) via Parameter-Efficient Fine-Tuning (PEFT)
Frameworks	`unsloth`, Hugging Face `trl`, and `transformers`
License	Apache-2.0 (incorporates Meta Llama 3 Community License Agreement)
Developer	A-Kishore

📖 Model Description

This model is a fine-tuned adapter of unsloth/Llama-3.2-3B-Instruct-bnb-4bit optimized to generate syntactically correct SQL statements from natural language questions and database DDL schemas.

The model was trained using Low-Rank Adaptation (LoRA), a technique under the parameter-efficient fine-tuning (PEFT) paradigm. LoRA freezes the pre-trained weights of the base model and injects trainable rank decomposition matrices into the self-attention and feed-forward network modules (specifically target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, and down_proj). This restricts the number of active training parameters to just 0.75% of the base model size, dramatically reducing VRAM usage and preventing catastrophic forgetting.

The fine-tuning process was accelerated using the unsloth library, which provides specialized GPU kernels for 4-bit quantized training. This setup achieved 2x faster training speed compared to standard configurations.

🚀 How to Use

The final model weights have been fully merged and exported in 16-bit precision (merged_16bit), allowing for standard deployment with the transformers library or high-speed execution with unsloth.

(a) Standard transformers Loading

Use the code below to run inference using standard Hugging Face transformers modules:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "A-Kishore/llama-3.2-3b-text2sql"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Format prompt to match the training format
prompt_template = """###TASK
Generate the SQL query to answer the following question

### Database Schema
{sql_context}

### Question
{sql_prompt}

### SQL Query
"""

sql_context = "CREATE TABLE Members (MemberID INT, Age INT, Gender VARCHAR(10), MembershipType VARCHAR(20));"
sql_prompt = "How many members are female?"

formatted_prompt = prompt_template.format(sql_context=sql_context, sql_prompt=sql_prompt)
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")

# Generate SQL query
outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    use_cache=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
sql_query = response.split("### SQL Query")[-1].strip()
print(f"Generated SQL Query:\n{sql_query}")

(b) Unsloth Fast Inference

Use the code below to load the model and perform native accelerated inference using unsloth:

import torch
from unsloth import FastLanguageModel

max_seq_length = 768
dtype = torch.float16
load_in_4bit = True

# Load optimized model and tokenizer
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="A-Kishore/llama-3.2-3b-text2sql",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)

# Enable native 2x faster inference
FastLanguageModel.for_inference(model)

# Format prompt to match the training format
prompt_template = """###TASK
Generate the SQL query to answer the following question

### Database Schema
{sql_context}

### Question
{sql_prompt}

### SQL Query
"""

sql_context = "CREATE TABLE Members (MemberID INT, Age INT, Gender VARCHAR(10), MembershipType VARCHAR(20));"
sql_prompt = "How many members are female?"

formatted_prompt = prompt_template.format(sql_context=sql_context, sql_prompt=sql_prompt)
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda")

# Generate SQL query
outputs = model.generate(
    **inputs,
    max_new_tokens=150,
    use_cache=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
sql_query = response.split("### SQL Query")[-1].strip()
print(f"Generated SQL Query:\n{sql_query}")

📊 Evaluation Results

The performance of the fine-tuned model was evaluated on a test split, comparing the lexical correctness of the generated SQL syntax against the gold-standard reference queries.

We compute ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics to quantify textual overlap:

ROUGE-1: Measures unigram overlap (representing correctness of schema identifiers and individual query tokens).
ROUGE-2: Measures bigram overlap (capturing structural alignment of consecutive SQL constructs).
ROUGE-L: Computes the Longest Common Subsequence (LCS) to track overall query flow and nesting structure.

Model	ROUGE-1	ROUGE-2	ROUGE-L
Base Model (`unsloth/Llama-3.2-3B-Instruct-bnb-4bit`)	0.2908	0.2016	0.2651
Fine-Tuned Model (`A-Kishore/llama-3.2-3b-text2sql`)	0.8486	0.7232	0.8151
Improvement	+191.82%	+258.73%	+207.47%

⚙️ Training Details

The following table summarizes the training configurations and hyperparameters used for fine-tuning:

Parameter / Metric	Configuration
Training Dataset	`gretelai/synthetic_text_to_sql` (train split)
Training Subset Size	`50000` samples (shuffled)
Base Model	`unsloth/Llama-3.2-3B-Instruct-bnb-4bit`
Fine-Tuning Framework	`trl` (`SFTTrainer`)
Optimizer	`paged_adamw_8bit`
Learning Rate	`2e-4`
Learning Rate Scheduler	`linear`
Warmup Steps	`5`
Number of Epochs	`1`
Per-Device Train Batch Size	`8`
Gradient Accumulation Steps	`1`
Sequence Length (`max_seq_length`)	`768`
Sequence Packing (`packing`)	Enabled (`True`)
LoRA Rank (`r`)	`16`
LoRA Alpha (`lora_alpha`)	`16`
LoRA Target Modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
LoRA Bias	`none`
Mixed Precision (`fp16`)	Enabled (`True`)
Hardware Platform	—
Total Training Steps	—
Total Training Duration	—
Final Training Loss	—

📂 Repository Structure

The local repository is structured as follows:

Text_to_SQL_Finetuning.ipynb: Jupyter notebook detailing the fine-tuning workflow, covering dataset downloading, sequence format mapping, LoRA parameter definition, training, and 16-bit weight export.
evaluate_model.ipynb: Jupyter notebook executing predictions across the test set for both the base and fine-tuned configurations, computing ROUGE metrics, and compiling comparison CSVs.
base_model_evaluation_result.csv: Output CSV file detailing predictions generated by the base model.
finetuned_model_evaluation_result.csv: Output CSV file detailing predictions generated by the fine-tuned model.
README.md: Professional model card containing model attributes, descriptions, implementation guides, and evaluation matrices.

⚠️ Limitations

SQL Executability: The evaluation utilizes ROUGE metrics as a proxy for structural and lexical correctness. While ROUGE metrics are sensitive to correct SQL keyword sequencing and table/column references, they do not validate SQL executability or logical equivalence. A generated SQL query could be semantically identical to the reference query but receive a lower ROUGE score due to trivial style choices, such as swapping join ordering or utilizing different table aliases. Conversely, a query with high ROUGE overlap could contain a minor syntax error that prevents it from executing.
Out-of-Distribution Schemas: The model's accuracy is tied to the complexity of the input database schema. High-cardinality databases, deeply nested subqueries, and non-standard query structures that deviate significantly from the training corpus may lead to incorrect SQL generations.

📄 License

The model adapter is licensed under the Apache 2.0 license. The underlying base model is subject to the Meta Llama 3 Community License Agreement. Users must comply with both license constraints.

🤝 Acknowledgements

Unsloth: For providing specialized kernels that optimize 4-bit loading, sequence packing, and memory offloading, speeding up the training pipeline.
Hugging Face: For the trl library used in executing supervised fine-tuning configurations.
Meta AI: For the release of the Llama 3.2 family of open weights models.
Gretel AI: For compiling and distributing the synthetic Text-to-SQL training dataset.

👤 Author

Developed by A-Kishore

GitHub: @A-Kishore
Hugging Face: @A-Kishore

Downloads last month: 110

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for A-Kishore/llama-3.2-3b-text2sql

Base model

meta-llama/Llama-3.2-3B-Instruct

Quantized

unsloth/Llama-3.2-3B-Instruct-bnb-4bit

Adapter

(53)

this model

A-Kishore
/

llama-3.2-3b-text2sql