pipe-sql-1.5b / docs /pipe-sql-fine-tuning-design-doc.md
Zheyuan Zhao
Add design doc: pipe-sql-fine-tuning-design-doc.md
e145226 verified

Design Document: Large-Scale Incremental Pipe SQL Synthesis & Specialized Fine-Tuning for Small Language Models (1.5B–7B)

1. Executive Summary

Pipe SQL syntax β€” where queries are written as linear chains of |> operators rather than nested, inside-out clause blocks β€” is now production-ready in BigQuery (GA February 2025), Apache Spark 4.0+, and Databricks Runtime 16.2+. This linear structure maps naturally onto autoregressive token generation, making it a compelling target for specialized Text-to-SQL models.

This document presents a complete system for:

  1. Synthesizing a large corpus of semantically validated pipe SQL queries from existing standard SQL benchmarks via a custom AST-based decompiler.
  2. Fine-tuning small language models (1.5B–7B) using an incremental, pipe-by-pipe training strategy that teaches models to build queries step-by-step rather than emit them in one shot.

The core thesis: by converting the Text-to-SQL problem from "generate a complex nested structure" into "append the next correct pipe operator," we can train models at a fraction of frontier cost while achieving competitive execution accuracy.


2. Why Pipe SQL?

2.1 The Problem with Standard SQL Generation

Standard SQL requires the model to produce clauses in an order that is the inverse of logical execution:

SELECT department, AVG(salary) AS avg_sal    -- Step 4: project
FROM employees                                -- Step 1: scan
WHERE hire_date > '2020-01-01'                -- Step 2: filter
GROUP BY department                           -- Step 3: aggregate
HAVING AVG(salary) > 80000                    -- Step 5: post-agg filter
ORDER BY avg_sal DESC                         -- Step 6: sort

An LLM generating this left-to-right must write SELECT department, AVG(salary) before it has committed to the WHERE filter or GROUP BY clause. This inversion causes three well-documented failure modes:

  • Schema hallucination: referencing columns eliminated by an earlier aggregation or join that the model hasn't generated yet.
  • Alias invisibility: standard SQL forbids referencing a SELECT alias in the WHERE clause of the same block, leading to invalid queries.
  • Nesting complexity: correlated subqueries and multi-level CTEs require the model to maintain deep structural state across its entire generation window.

2.2 How Pipe Syntax Solves This

Pipe syntax linearizes the query into a sequence of transformations, each consuming the output of the previous step:

FROM employees
|> WHERE hire_date > '2020-01-01'
|> AGGREGATE AVG(salary) AS avg_sal GROUP BY department
|> WHERE avg_sal > 80000
|> ORDER BY avg_sal DESC

This provides three structural guarantees that directly benefit autoregressive generation:

  1. The Prefix Property: every prefix up to any |> boundary is a valid, independently executable query. The model never needs to look ahead.
  2. Local scope: each operator only sees columns produced by its immediate predecessor. The active schema is always well-defined and narrow.
  3. Linear data flow: no nesting, no back-references. The generation order matches the logical execution order.

2.3 Supported Pipe Operators

The following operators are supported across BigQuery and Spark, which are the primary target engines:

Operator Purpose Replaces (Standard SQL)
FROM table Entry point FROM clause
|> SELECT Column projection SELECT
|> EXTEND expr AS alias Add computed column (preserves existing) Inline expression in SELECT
|> WHERE Row filtering at any point in the pipeline WHERE, HAVING, QUALIFY
|> AGGREGATE expr GROUP BY cols Aggregation SELECT ... GROUP BY
|> JOIN table ON cond Join (all types: INNER, LEFT, etc.) JOIN clause
|> ORDER BY Sorting ORDER BY
|> LIMIT n Row limiting LIMIT
|> SET col = expr Replace column values N/A (new)
|> DROP col Remove columns N/A (new)
|> RENAME old AS new Rename columns AS alias
|> DISTINCT Deduplicate SELECT DISTINCT

3. Pipe Query Syntax Landscape

Multiple pipe-like query systems exist. This section surveys the landscape to justify our choice of GoogleSQL pipe syntax as the canonical training target.

3.1 The Same Query in Every Syntax

"From employees, filter Chicago office, average salary by department, keep departments with avg > 80K, sort descending."

Standard SQL:

SELECT department, AVG(salary) AS avg_salary
FROM employees
WHERE office = 'Chicago'
GROUP BY department
HAVING AVG(salary) > 80000
ORDER BY avg_salary DESC;

GoogleSQL Pipe (BigQuery / Spark):

FROM employees
|> WHERE office = 'Chicago'
|> AGGREGATE AVG(salary) AS avg_salary GROUP BY department
|> WHERE avg_salary > 80000
|> ORDER BY avg_salary DESC;

Snowflake Flow (->>):

SELECT * FROM employees WHERE office = 'Chicago'
->> SELECT department, AVG(salary) AS avg_salary FROM $1 GROUP BY department
->> SELECT * FROM $2 WHERE avg_salary > 80000
->> SELECT * FROM $3 ORDER BY avg_salary DESC;

PRQL:

from employees
filter office == 'Chicago'
group {department} (
  aggregate { avg_salary = average salary }
)
filter avg_salary > 80000
sort {-avg_salary}

KQL (Kusto):

employees
| where office == 'Chicago'
| summarize avg_salary = avg(salary) by department
| where avg_salary > 80000
| sort by avg_salary desc

Malloy:

run: duckdb.table('employees') -> {
  where: office = 'Chicago'
  group_by: department
  aggregate: avg_salary is avg(salary)
} -> {
  where: avg_salary > 80000
  order_by: avg_salary desc
}

dplyr (R):

employees |>
  filter(office == "Chicago") |>
  group_by(department) |>
  summarize(avg_salary = mean(salary)) |>
  filter(avg_salary > 80000) |>
  arrange(desc(avg_salary))

Polars (Python):

(
    pl.scan_csv("employees.csv")
    .filter(pl.col("office") == "Chicago")
    .group_by("department")
    .agg(pl.col("salary").mean().alias("avg_salary"))
    .filter(pl.col("avg_salary") > 80000)
    .sort("avg_salary", descending=True)
    .collect()
)

3.2 Comparison Matrix

System Pipe Symbol SQL-Compatible? Compiles to SQL? Native Engine Adoption
GoogleSQL Pipe |> Yes (extension) N/A (IS SQL) BigQuery, Spark, Databricks Very High
Snowflake Flow ->> Yes (chains full stmts) N/A (IS SQL) Snowflake High
KQL (Kusto) | No (separate language) No Azure Data Explorer Very High
PRQL Newlines No (separate language) Yes None (compiler only) Medium
Malloy -> No (separate language) Yes None (compiles to SQL) Low-Medium
dplyr (R) %>% / |> No (R code) Yes (via dbplyr) R in-memory Very High
Polars .method() No (Python) No Rust engine High
Logica N/A (predicates) No (logic language) Yes None (compiler only) Low

3.3 Key Distinctions

GoogleSQL Pipe vs. Snowflake Flow: Snowflake's ->> chains entire SQL statements together, referencing prior results via positional $1, $2 parameters. GoogleSQL pipes individual operators within a single query. Snowflake's approach is more like Unix pipes between programs; GoogleSQL's is like method chaining within a single program.

GoogleSQL Pipe vs. KQL: KQL (Microsoft Azure Data Explorer) is the spiritual predecessor β€” its | where, | summarize, | extend operators clearly inspired GoogleSQL pipe syntax. However, KQL is an entirely separate language that only runs on Kusto-engine databases. GoogleSQL pipe syntax stays within SQL itself.

GoogleSQL Pipe vs. PRQL / Malloy: Both are full language replacements that compile down to SQL. They offer cleaner syntax but require a compilation step and have no native engine support (except ClickHouse's experimental PRQL support). GoogleSQL pipe syntax is SQL β€” it runs natively without compilation.

GoogleSQL Pipe vs. dplyr / Polars: These are DataFrame API approaches in R and Python respectively. They solve the same readability problem at the programming language level. The Databricks team explicitly cited DataFrame APIs as inspiration for pipe SQL.

3.4 BigQuery vs. Spark: Detailed Differences

While BigQuery and Spark share the same |> symbol and core operators, they diverge in several ways:

Operators only in BigQuery:

Operator Purpose
|> RENAME Rename columns directly
|> CALL Invoke table-valued functions in the pipe chain
|> WITH Inline CTEs within the pipe
|> WINDOW Standalone window function operator (deprecated in favor of EXTEND)
|> MATCH_RECOGNIZE Pattern matching on row sequences
GROUP AND ORDER BY Shorthand inside AGGREGATE that also orders the output

Features only in Spark:

Feature Purpose
SEMI JOIN / ANTI JOIN Explicit semi/anti join keywords (BigQuery requires WHERE EXISTS)
NATURAL JOIN / LATERAL JOIN Additional join types
Standalone OFFSET OFFSET without requiring LIMIT (BigQuery requires LIMIT before OFFSET)

Behavioral differences:

Behavior BigQuery Spark
Lateral references in EXTEND Allowed (later column can reference earlier alias in same EXTEND) Not allowed (each projection is independent)
Default NULL ordering (ASC) NULLs first NULLs last
Default NULL ordering (DESC) NULLs last NULLs first

3.5 Why GoogleSQL Pipe Syntax Is the Training Target

  1. It IS SQL. Unlike PRQL, KQL, or Malloy, GoogleSQL pipe syntax requires no compilation step. The generated output is directly executable.
  2. Multi-engine support. BigQuery, Spark, and Databricks all support it natively. No other pipe syntax has this breadth.
  3. BigQuery is a superset. Training on BigQuery's dialect covers all Spark operators. The only gap is Spark's SEMI/ANTI/NATURAL/LATERAL join types, which can be expressed differently in BigQuery syntax.
  4. SQLGlot round-trip. SQLGlot can parse GoogleSQL pipe syntax back to standard SQL for any of its 30+ supported dialects. This enables the dual-execution validation loop (Section 6) and deployment-time transpilation to any target database (Section 12.2).
  5. Pretraining signal. BigQuery has been GA since February 2025. Frontier LLMs trained after this date will have GoogleSQL pipe syntax in their training data, providing a foundation for fine-tuning.

4. The Data Problem: Why a Decompiler Is Essential

4.1 The Fine-Tuning Data Bottleneck

Fine-tuning a language model to generate pipe SQL requires a large corpus of (natural language question, database schema, pipe SQL query) triples that are semantically correct β€” meaning they return the right answer on the target database. The challenge:

  • No pipe SQL training data exists. All major Text-to-SQL benchmarks (Spider, BIRD-SQL, WikiSQL, KaggleDBQA) contain exclusively standard SQL. Combined, Spider 1.0 (7K train) and BIRD-SQL (9.4K train) provide roughly 16K standard SQL queries β€” zero in pipe syntax.
  • LLM-based generation is unreliable. Using a frontier model (GPT-4o, Claude) to generate pipe SQL has three problems: (a) high cost at scale, (b) the model has limited pipe SQL in its training data, and (c) there is no efficient way to validate correctness without executing every generated query.
  • Manual annotation is infeasible. Writing thousands of pipe SQL queries by hand is prohibitively expensive and error-prone.

4.2 The Decompiler Solution

A decompiler is a deterministic program that transforms standard SQL (which we have in abundance and with verified correctness) into semantically equivalent pipe SQL. This is the only approach that simultaneously satisfies all three requirements for training data:

Requirement LLM Generation Manual Writing Decompiler
Scale (50K+ queries) Expensive ($$$) Infeasible Free (compute only)
Correctness guarantee No (needs validation) Error-prone Deterministic (provable)
Reproducibility Non-deterministic N/A Fully reproducible
Speed ~1 query/sec ~5 min/query ~1000 queries/sec

The decompiler takes a known-correct standard SQL query from an established benchmark, parses it into an AST, and mechanically transforms it into an equivalent pipe SQL query. Because the transformation is rule-based and structure-preserving, the output is guaranteed to be semantically equivalent to the input β€” no execution-based validation is strictly required (though we perform it anyway as a safety net).

4.3 Why Not Just Use Standard SQL for Training?

One might ask: why not fine-tune on standard SQL and transpile at inference time? Because the entire point is to exploit the structural advantages of pipe syntax during generation. A model trained on standard SQL still suffers from the nesting and inversion problems described in Section 2.1. The model must learn to think in pipes β€” to decompose a question into a linear sequence of transformations β€” and this requires training on pipe SQL directly.

4.4 Data Augmentation Strategy

Starting from the ~16K seed queries in Spider 1.0 + BIRD-SQL:

  1. Decompile all seed queries β†’ ~16K pipe SQL equivalents
  2. Schema-aware augmentation: for each query, generate variants by substituting table/column names from the same schema family (e.g., swap employees.salary for staff.compensation)
  3. Complexity augmentation: compose simple queries into multi-step pipes (e.g., combine a filter query and an aggregation query into a single pipeline)
  4. Synthetic NL paraphrasing: use a language model to rephrase the natural language question while keeping the SQL unchanged

Target: 50K–100K validated pipe SQL training pairs.


5. Decompiler Architecture

5.1 Why SQLGlot (and Its Limitations)

SQLGlot is the best available open-source SQL parser/transpiler, supporting 30+ dialects and providing a rich AST. However, its pipe syntax support is one-directional only:

  • Pipe β†’ Standard: SQLGlot can parse pipe syntax and decompose it into CTE-based standard SQL. This works.
  • Standard β†’ Pipe: SQLGlot has no generator for pipe syntax output. Pipe nodes are destroyed at parse time and replaced with CTEs. There is no to_pipe() method, no pipe expression nodes in the AST, and no reverse transformation.

Therefore, we must build a custom decompiler on top of SQLGlot's parser and AST infrastructure. SQLGlot provides the parsing, qualification, and optimization layers; we add the pipe emission layer.

5.2 Transformation Pipeline

Standard SQL (string)
    β”‚
    β–Ό
[SQLGlot Parse] ──► AST (language-agnostic)
    β”‚
    β–Ό
[SQLGlot Qualify] ──► Fully qualified AST
    β”‚                  (all columns resolved to table.column,
    β”‚                   all aliases expanded, star expressions resolved)
    β”‚
    β–Ό
[Custom Pipe Emitter] ──► Pipe SQL (string)

5.3 Pipe Emitter Transformation Rules

The custom emitter walks the qualified AST and applies the following rules in order:

Rule 1: FROM extraction

SELECT ... FROM table_expr β†’ FROM table_expr

Extract the FROM clause as the pipe entry point. If the FROM contains joins, emit them as separate |> JOIN operators.

Rule 2: JOIN linearization

FROM a JOIN b ON a.id = b.id JOIN c ON b.id = c.id
β†’
FROM a
|> JOIN b ON a.id = b.id
|> JOIN c ON b.id = c.id

Rule 3: WHERE promotion

WHERE condition β†’ |> WHERE condition

Placed immediately after all FROM/JOIN operators.

Rule 4: GROUP BY + aggregation fusion

SELECT col, AGG(expr) ... GROUP BY col
β†’
|> AGGREGATE AGG(expr) AS alias GROUP BY col

The SELECT list is decomposed: aggregate expressions go into |> AGGREGATE, non-aggregate computed expressions become |> EXTEND operators placed before the aggregation.

Rule 5: HAVING β†’ post-aggregation WHERE

HAVING AGG(x) > threshold
β†’
|> WHERE agg_alias > threshold

Since |> AGGREGATE produces named output columns, the HAVING condition is rewritten to reference those aliases.

Rule 6: Window function extraction

SELECT ..., ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) AS rn
β†’
|> EXTEND ROW_NUMBER() OVER (PARTITION BY x ORDER BY y) AS rn

Window functions are emitted as |> EXTEND operators after all filtering and aggregation.

Rule 7: QUALIFY β†’ post-window WHERE

QUALIFY rn = 1 β†’ |> WHERE rn = 1

Rule 8: ORDER BY / LIMIT passthrough

ORDER BY col DESC LIMIT 10
β†’
|> ORDER BY col DESC
|> LIMIT 10

Rule 9: Subquery unrolling Correlated and non-correlated subqueries in WHERE or SELECT are "unrolled" into preceding pipe segments. Scalar subqueries become |> JOIN + |> EXTEND patterns. EXISTS/IN subqueries become |> JOIN (semi-join) patterns.

Rule 10: CTE inlining CTEs are unrolled into the main pipeline. Each CTE becomes a named sub-pipeline that feeds into the final query via |> JOIN or direct substitution.

5.4 Handling Ambiguity and Edge Cases

Not every standard SQL query maps cleanly to a single pipe representation. The emitter follows a canonical ordering convention:

FROM β†’ JOINs β†’ WHERE (pre-agg) β†’ EXTEND (computed cols) β†’
AGGREGATE β†’ WHERE (post-agg) β†’ EXTEND (windows) β†’
WHERE (post-window) β†’ SELECT (final projection) β†’
ORDER BY β†’ LIMIT

When multiple valid orderings exist, the canonical order ensures deterministic output, which is critical for training data consistency.


6. Semantic Validation: The Dual-Execution Loop

Even though the decompiler is deterministic, bugs in transformation rules can silently produce semantically incorrect pipe SQL. We validate every synthesized query through dual execution:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  Standard SQL (gold) β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Execute on DB     │──► Result Set A
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Pipe SQL (synth)   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
               β”‚  Transpile back to standard  β”‚
               β”‚  SQL via SQLGlot             β”‚
               │  (pipe→standard is supported)│
               β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Execute on DB     │──► Result Set B
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                    Result Set A == Result Set B?

Comparison method:

  • Both result sets are loaded into DataFrames.
  • Rows are sorted deterministically (by all columns) to eliminate non-deterministic ordering.
  • Column types are coerced to common types (e.g., DECIMAL vs. FLOAT tolerance).
  • For large results (>100K rows), row-level SHA-256 hashes are aggregated for constant-time comparison.

Queries that fail validation are quarantined with diagnostic metadata (AST diff, execution error) for manual review and decompiler rule refinement.


7. Incremental Training Strategy

7.1 Why Incremental?

Standard fine-tuning teaches the model to emit the entire query in one shot. This wastes the structural advantage of pipe syntax. Instead, we train the model to generate one pipe operator at a time, conditioned on the growing prefix. This mirrors how a human analyst would build a query: start with the data source, filter, transform, aggregate, filter again.

7.2 Trajectory Decomposition

Each N-operator pipe query is decomposed into N training samples:

Example: "Which departments in the Chicago office have average salaries above $80K?"

Step Input (prompt) Output (completion)
1 Question + Schema FROM employees
2 Question + Schema + FROM employees |> WHERE office = 'Chicago'
3 Question + Schema + prefix(1–2) |> AGGREGATE AVG(salary) AS avg_sal GROUP BY department
4 Question + Schema + prefix(1–3) |> WHERE avg_sal > 80000
5 Question + Schema + prefix(1–4) |> SELECT department, avg_sal

This produces a 5:1 amplification of training data: a single pipe query yields 5 supervised examples. For 50K pipe queries with an average of 4 operators each, this produces ~200K training samples.

7.3 Dataset Format (JSONL)

{
  "messages": [
    {
      "role": "system",
      "content": "You are a SQL assistant that builds pipe SQL queries incrementally. Given a question, schema, and the query built so far, emit only the next pipe operator."
    },
    {
      "role": "user",
      "content": "Question: Which departments have avg salary > 80K?\nSchema: employees(id INT, name TEXT, department TEXT, salary DECIMAL, office TEXT)\nQuery so far: FROM employees\n|> WHERE office = 'Chicago'"
    },
    {
      "role": "assistant",
      "content": "|> AGGREGATE AVG(salary) AS avg_sal GROUP BY department"
    }
  ]
}

7.4 Prefix Executability as a Training Signal

Because every prefix is independently executable (the Prefix Property), we can verify each intermediate step during data generation. If step K's prefix doesn't execute successfully against the database, the entire trajectory is flagged. This catches decompiler errors that might only manifest mid-pipeline.


8. Fine-Tuning Configuration

8.1 Base Model Selection

Model Parameters Why
Qwen-2.5-Coder-7B (primary) 7.6B 82.0 on Spider (standard SQL); strong code/SQL pretraining; 128K context
Llama-3.2-3B (lightweight) 3.2B Cost-efficient inference; suitable for simpler schemas; 128K context
Qwen-2.5-Coder-1.5B (edge) 1.5B Speculative decoding draft model; edge deployment

Qwen-2.5-Coder-7B is the primary target because it already has strong SQL capabilities from pretraining, meaning fewer training steps are needed to adapt it to pipe syntax.

8.2 Training Parameters

Parameter Value Rationale
Method QLoRA (4-bit quantization) 7B model fits in ~11GB VRAM on RTX 4090 (24GB)
LoRA rank 64–128 Pipe SQL is a novel syntax underrepresented in pretraining; higher rank captures structural patterns better
LoRA alpha 2Γ— rank Standard scaling rule
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj Full attention + MLP adaptation
Learning rate 1e-4 (with cosine decay) Standard for QLoRA
Batch size 8 (effective, with gradient accumulation) Balance between stability and speed
Context window 4096 tokens Accommodates schema (1–3K tokens) + prefix + next operator
Epochs 3–5 Monitor validation loss; early stop on plateau
Warmup 5% of total steps Prevent early divergence

8.3 Loss Masking

Only the assistant completion (the next pipe operator) contributes to the loss. The system prompt, user message (question + schema + prefix), and any padding tokens are masked with label ID -100.


9. Agentic Inference: Tool Calls Between Pipes

At inference time, the model operates within an agent loop. After generating each pipe operator, external tools are invoked to ground the next generation step in reality.

9.1 Tool 1: Schema Propagation (Dry Run)

When: After every operator. How: Execute DESCRIBE or a dry-run API call on the current prefix. Returns: The list of columns and their types available for the next operator. Why: Prevents the model from referencing columns that were dropped by an earlier |> AGGREGATE or |> SELECT. This is the single most impactful tool for preventing schema hallucination.

9.2 Tool 2: Sample Rows

When: After |> WHERE and |> JOIN operators (where data content matters). How: Execute [prefix] |> LIMIT 5 against the database. Returns: 5 rows of actual data. Why: Lets the model verify literal values (e.g., is it 'Furniture' or 'FURNITURE'?), date formats, and NULL patterns. Eliminates constant hallucination.

9.3 Tool 3: Syntax Validation

When: After generating any operator candidate. How: Parse [prefix] |> [candidate] with SQLGlot. Returns: Success or error with location. Why: Catches syntax errors (missing commas, invalid keywords) before the model proceeds, avoiding cascading errors in subsequent operators.

9.4 Agent Loop

Input: (question, schema)
prefix = ""

while not DONE:
    candidate = model.generate(question, schema, active_columns, prefix)

    if candidate == "<END>":
        break

    if not syntax_valid(prefix + candidate):
        candidate = model.retry(question, schema, prefix, error_msg)

    prefix = prefix + "\n" + candidate
    active_columns = dry_run(prefix)

    if needs_data_grounding(candidate):
        sample = execute(prefix + " |> LIMIT 5")
        # Feed sample into next generation context

return prefix

10. Post-SFT Reinforcement: Group Relative Policy Optimization (GRPO)

After supervised fine-tuning, we apply GRPO to improve reasoning quality. GRPO (introduced in DeepSeek-Math) eliminates the need for a separate critic model by using group-level baselines, making it practical for small-scale training.

10.1 Reward Signals

For each generated pipe query, compute a composite reward:

Signal Type Weight Description
Execution Binary (0/1) 0.3 Does the complete query execute without error?
Result correctness Continuous (0–1) 0.5 Does the result match the gold standard? (F1 over result set rows)
Schema adherence Binary (0/1) 0.1 Does every referenced column exist in the active schema at that pipe stage?
Operator structure Continuous (0–1) 0.1 Does the query use the expected operator types? (e.g., question implies aggregation β†’ model uses |> AGGREGATE)

10.2 GRPO Procedure

  1. For each training prompt (question + schema), generate K=8 complete pipe queries using the SFT model.
  2. Score each query using the reward signals above.
  3. Compute the group mean and standard deviation of rewards.
  4. For each query, compute advantage = (reward - group_mean) / group_std.
  5. Update the policy to increase the probability of above-average completions and decrease below-average ones.

This is particularly effective for Text-to-SQL because the reward signal (execution correctness) is cheap to compute β€” just run the query.


11. Performance Targets

11.1 Realistic Benchmarks

Performance targets are calibrated against published results as of early 2026:

Benchmark Base 7B (standard SQL) Specialized Pipe 7B (target) Current 7B SOTA (standard SQL) Frontier model reference
BIRD-SQL (dev) ~35% EX 65–70% EX 70.4% (Arctic-R1-7B w/ GRPO) ~78% (GPT-4o pipeline)
Spider 1.0 (test) ~70% EX 80–85% EX ~82% (Qwen-2.5-Coder-7B) ~86% (GPT-4o)

Note on Spider 2.0: This benchmark involves enterprise-scale databases with 1000+ columns and complex multi-tool workflows. Even GPT-4o scores only ~13% on Spider 2.0-Lite. We do not set a target here β€” Spider 2.0 requires agentic capabilities beyond single-query generation.

11.2 Why These Targets Are Achievable

  • BIRD-SQL 65–70%: Arctic-Text2SQL-R1-7B already achieves 70.4% with standard SQL + GRPO. Pipe syntax should provide a comparable or slightly improved structural advantage, as the model doesn't need to manage nesting.
  • Spider 1.0 80–85%: Qwen-2.5-Coder-7B already hits 82% with standard SQL. Fine-tuning on pipe syntax should at minimum match this, with incremental training providing an additional boost through better step-by-step reasoning.

12. Deployment

12.1 Serving Stack

  • vLLM for inference with automatic prefix caching (APC) enabled.
  • Schema caching: database schemas (often 1–3K tokens) are prepended to every request. With prefix caching, repeated queries against the same database save 70–90% on schema tokens.
  • Speculative decoding (optional): use the Qwen-2.5-Coder-1.5B model as a draft for the 7B model. Expected speedup: 1.5–1.8x for standard draft-verify; 2–3x if using EAGLE-style speculator trained on the 7B model's hidden states.

12.2 Output Processing

The generated pipe SQL must be transpiled to the target engine's dialect before execution. Since SQLGlot can parse pipe syntax and emit standard SQL for any of its 30+ supported dialects, this is a single function call:

import sqlglot

pipe_query = model.generate(question, schema)
executable = sqlglot.transpile(pipe_query, read="bigquery", write="postgres")[0]
cursor.execute(executable)

This means the model generates in a single canonical pipe syntax, and deployment supports any target database.


13. Risks and Mitigations

Risk Impact Mitigation
Decompiler bugs produce incorrect pipe SQL Corrupted training data Dual-execution validation loop (Section 6) catches all semantic errors
16K seed queries insufficient for fine-tuning Underfitting on complex patterns Data augmentation (Section 4.4) expands to 50–100K queries; trajectory decomposition (Section 7.2) amplifies to 200K+ samples
Pipe syntax unseen in pretraining Model struggles with novel tokens LoRA rank 64–128 provides sufficient capacity; |> is a simple 2-token sequence, not a complex new grammar
Engine-specific pipe differences (BigQuery vs. Spark) Portability issues Train on canonical (BigQuery) syntax only; transpile at deployment via SQLGlot
Incremental generation is slower than one-shot Latency at inference time Speculative decoding + prefix caching offset the overhead; accuracy gains justify the tradeoff

14. Implementation Roadmap

Phase 1: Decompiler

  • Build pipe emitter on top of SQLGlot's qualified AST
  • Implement transformation rules (Section 5.3)
  • Validate against Spider 1.0 + BIRD-SQL with dual-execution loop
  • Target: 90%+ of seed queries successfully decompiled and validated

Phase 2: Data Pipeline

  • Run decompiler over all seed queries
  • Apply augmentation strategies (schema substitution, complexity composition, NL paraphrasing)
  • Generate trajectory-decomposed JSONL training files
  • Target: 50K+ pipe queries β†’ 200K+ training samples

Phase 3: Supervised Fine-Tuning

  • Fine-tune Qwen-2.5-Coder-7B with QLoRA
  • Evaluate on held-out BIRD-SQL dev split
  • Ablate: incremental vs. one-shot training, LoRA rank, context window size
  • Target: match or exceed base model's standard SQL accuracy

Phase 4: GRPO Reinforcement

  • Implement execution-based reward function
  • Run GRPO on the SFT checkpoint
  • Evaluate on full BIRD-SQL dev and Spider 1.0 test
  • Target: 5–10% EX improvement over SFT-only model

Phase 5: Agentic Integration & Deployment

  • Build agent loop with dry-run, sample rows, and syntax validation tools
  • Integrate with vLLM serving
  • End-to-end evaluation on production schemas
  • Target: sub-2-second latency for 4-operator queries on RTX 4090

Appendix A: Pipe SQL Engine Support Matrix

Engine Status Pipe Symbol Notes
BigQuery (GoogleSQL) GA (Feb 2025) |> 20+ operators; most complete implementation
Apache Spark 4.0+ GA |> Full operator set; mirrors BigQuery
Databricks Runtime 16.2+ GA |> Same as Spark
Firebolt Supported |> Subset of operators
DuckDB Community extension only |> Regex-based preprocessor; not production-grade
Snowflake Different concept ->> Chains full SQL statements, not operators
PostgreSQL / MySQL / SQL Server Not supported N/A β€”

Appendix B: SQLGlot Pipe Syntax Capabilities

Capability Supported?
Parse pipe syntax β†’ AST Yes (12 operators)
Generate pipe syntax from AST No
Pipe β†’ standard SQL transpilation Yes (via CTE decomposition at parse time)
Standard SQL β†’ pipe transpilation No (requires custom decompiler)
Optimizer/qualify on pipe input Yes (operates on CTE-decomposed AST)
Supported pipe operators (parsing) SELECT, WHERE, AGGREGATE, EXTEND, JOIN, ORDER BY, LIMIT, AS, PIVOT, UNPIVOT, TABLESAMPLE, set ops
Known broken operators (v29.x) SET, DROP, RENAME, DISTINCT, CALL, WITH