embedding-0.6b-spider2.0

Bi-encoder column retriever for text-to-SQL schema linking (Stage-I candidate retrieval). Qwen3-Embedding-0.6B fine-tuned with InfoNCE (LoRA r=8/α=32, merged), max_length=1024, 1 epoch.

Training data: thanhdath/embedding-0.6b-spider2.0-data — 39,238 (question, gold-columns, hard-negatives) groups from BIRD train + Spider train + Spider 2.0 synthetic (BigQuery/Snowflake + SQL-Gen). No SynSQL.

source rows
BIRD train 9,356
Spider train 8,386
Spider 2.0 synth (BQ/SF) 17,693
Spider 2.0 synth (SQL-Gen) 3,803

Results (column recall@K vs the previous embedding ckpt-3000)

BIRD dev (n=1521), flat: R@50 0.959 (old 0.875), R@100 0.995 (0.976), R@200 1.000. Spider 2.0-233q, two-stage top-50 tables → top-K cols: R@300 0.904 (old 0.876), R@500 0.930 (0.903), R@800 0.956 (0.937). Spider 2.0-233q, flat (shard-collapsed): R@500 0.954 (0.934), R@800 0.974 (0.959). Beats the previous checkpoint on every operating point.

Usage (vLLM embedding server)

vllm serve thanhdath/embedding-0.6b-spider2.0 --task embed --port 8001 --max-model-len 4096

Score = dot product between the question embedding and each column-description embedding (table.column ; Table meaning … ; Column meaning … ; type … ; has values …); take top-K.

Downloads last month
13
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thanhdath/embedding-0.6b-spider2.0

Finetuned
(185)
this model