Model Card for ThotaBhanu/t5_sql_askdb

Model Details

Model Description

This model is a T5-based Natural Language to SQL converter, fine-tuned on the WikiSQL dataset. It is designed to convert English natural language queries into SQL queries that can be executed on relational databases.

Developed by: Bhanu Prasad Thota
Shared by: Bhanu Prasad Thota
Model type: T5-based Sequence-to-Sequence Model
Language(s): English
License: MIT
Finetuned from model: t5-large

This model is particularly useful for text-to-SQL applications, allowing users to query databases using plain English instead of writing SQL.

Model Sources

Repository: https://huggingface.co/ThotaBhanu/t5_sql_askdb
Paper [optional]: N/A
Demo [optional]: Coming soon

Uses

Direct Use

Convert natural language questions into SQL queries
Assist in database query automation
Can be used in chatbots, data analytics tools, and enterprise database search systems

Downstream Use

Can be fine-tuned further on custom datasets to improve domain-specific SQL generation
Can be integrated into business intelligence tools for better user interaction

Out-of-Scope Use

The model does not infer database schema automatically
May generate incorrect SQL for complex nested queries or multi-table joins
Not suitable for non-relational (NoSQL) databases

Bias, Risks, and Limitations

The model may not always generate valid SQL for custom database schemas
Assumes consistent column naming, which may not always be the case in enterprise databases
Performance depends on how well the input query aligns with the training data format

Recommendations

Always validate generated SQL before executing on a live database
Use schema-aware validation methods for production environments
Consider fine-tuning the model on domain-specific SQL queries

How to Get Started with the Model

Use the code below to generate SQL queries from natural language:

from transformers import T5Tokenizer, T5ForConditionalGeneration

# Load model and tokenizer
model_name = "ThotaBhanu/t5_sql_askdb"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)

# Function to convert query to SQL
def generate_sql(query):
    input_text = f"Convert to SQL: {query}"
    inputs = tokenizer(input_text, return_tensors="pt")
    output = model.generate(**inputs)
    return tokenizer.decode(output[0], skip_special_tokens=True)

# Example usage
query = "Find all employees who joined in 2020"
sql_query = generate_sql(query)

print(f"📝 Query: {query}")
print(f"🛠 Generated SQL: {sql_query}")


## Training Details

### Training Data

Dataset: WikiSQL
Size: 80,654 pairs of natural language questions and SQL queries
Preprocessing: Tokenization using T5Tokenizer, max length 128


### Training Procedure

Training framework: Hugging Face Transformers + PyTorch
Hardware used: NVIDIA V100 GPU
Optimizer: AdamW
Learning rate: 5e-5
Batch size: 8
Epochs: 5

#### Training Hyperparameters

Training precision: Mixed precision (fp16)
Gradient accumulation: Yes (to handle large batch sizes)

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

[More Information Needed]

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

[More Information Needed]

### Results

[More Information Needed]

#### Summary



## Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->

[More Information Needed]

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]

## Technical Specifications [optional]

### Model Architecture and Objective

[More Information Needed]

### Compute Infrastructure

[More Information Needed]

#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation [optional]

@misc{t5_sql_askdb,
  author = {Bhanu Prasad Thota},
  title = {T5-SQL AskDB Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ThotaBhanu/t5_sql_askdb}}
}


**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]

## Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]