Snowflake/Arctic-Text2SQL-R1-7B Fine-tuned for NL2SQL++ v8

This model is a fine-tuned version of Snowflake/Arctic-Text2SQL-R1-7B on the NL2SQL++ v8 dataset with code-with-thought reasoning.

Model Details

Base Model: Snowflake/Arctic-Text2SQL-R1-7B
Task: Text-to-SQL generation
Dataset: NL2SQL++ v8 with code-with-thought reasoning
Fine-tuning Method: LoRA (Low-Rank Adaptation) with Unsloth
Quantization: 16-bit merged weights
Training Dataset Size: (5000, 1) examples
Validation Dataset Size: (0, 0) examples

Training Configuration

output_dir: trainer_output
per_device_train_batch_size: 8
num_train_epochs: 3.0
max_steps: -1
learning_rate: 1e-05
lr_scheduler_type: SchedulerType.COSINE
lr_scheduler_kwargs: None
warmup_steps: 0.1
optim: OptimizerNames.ADAMW_TORCH_FUSED
optim_args: None
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
optim_target_modules: None
gradient_accumulation_steps: 4
average_tokens_across_devices: True
max_grad_norm: 1.0
label_smoothing_factor: 0.0
bf16: True
fp16: False
bf16_full_eval: False
fp16_full_eval: False
tf32: None
gradient_checkpointing: True
gradient_checkpointing_kwargs: None
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
use_liger_kernel: False
liger_kernel_config: None
use_cache: False
neftune_noise_alpha: None
torch_empty_cache_steps: None
auto_find_batch_size: False
logging_strategy: IntervalStrategy.STEPS
logging_steps: 3
logging_first_step: False
log_on_each_node: True
logging_nan_inf_filter: True
include_num_input_tokens_seen: no
log_level: passive
log_level_replica: warning
disable_tqdm: False
report_to: ['wandb']
run_name: None
project: huggingface
trackio_space_id: trackio
eval_strategy: IntervalStrategy.STEPS
eval_steps: 50
eval_delay: 0
per_device_eval_batch_size: 8
prediction_loss_only: False
eval_on_start: False
eval_do_concat_batches: True
eval_use_gather_object: False
eval_accumulation_steps: 10
include_for_metrics: []
batch_eval_metrics: False
save_only_model: False
save_strategy: SaveStrategy.BEST
save_steps: 50
save_on_each_node: False
save_total_limit: 2
enable_jit_checkpoint: False
push_to_hub: False
hub_token: None
hub_private_repo: None
hub_model_id: None
hub_strategy: HubStrategy.EVERY_SAVE
hub_always_push: False
hub_revision: None
load_best_model_at_end: True
metric_for_best_model: eval_loss
greater_is_better: False
ignore_data_skip: False
restore_callback_states_from_checkpoint: False
full_determinism: False
seed: 3407
data_seed: None
use_cpu: False
accelerator_config: AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None, use_configured_state=False)
parallelism_config: None
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_pin_memory: True
dataloader_persistent_workers: False
dataloader_prefetch_factor: None
remove_unused_columns: True
label_names: None
train_sampling_strategy: random
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: None
ddp_backend: None
ddp_timeout: 1800
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
deepspeed: None
debug: []
skip_memory_metrics: True
do_train: False
do_eval: True
do_predict: False
resume_from_checkpoint: None
warmup_ratio: 0.1
logging_dir: None
local_rank: -1
model_init_kwargs: None
chat_template_path: None
dataset_text_field: text
dataset_kwargs: None
dataset_num_proc: None
eos_token: None
pad_token: None
max_length: 1024
packing: False
packing_strategy: bfd
padding_free: False
pad_to_multiple_of: None
eval_packing: None
completion_only_loss: None
assistant_only_loss: False
loss_type: nll
activation_offloading: False
vllm_sampling_params: None
unsloth_num_chunks: -1
unsloth_logit_chunk_multiplier: None
unsloth_grpo_mini_batch: None
max_seq_length: 12500
model_name: Snowflake/Arctic-Text2SQL-R1-7B
train_batch_size: 4
val_batch_size: 1
num_epochs: 1.5
lora_use_rslora: True
lora_r: 64
lora_alpha: 128
lora_dropout: 0.1
model_specs: ArcticSpecs(instruction_part='<|im_start|>user', response_part='<|im_start|>assistant', chat_template=None)

Train Dataset Example

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user

You are an expert in SQL++ query generation. Given a document schema and a natural language query, generate a valid SQL++ query.

Task Instructions:
- Backtick-quote field names that are reserved keywords or contain spaces/special characters.
  WRONG: SELECT value, Enrollment (K-12) ...
  RIGHT: SELECT `value`, `Enrollment (K-12)` ...

- SUBSTR is 0-based: SUBSTR(str, 0, 4) returns the first 4 characters. Use this for year extraction from date strings.
  WRONG: SUBSTR(dob, 1, 4) = '1990'
  RIGHT: SUBSTR(dob, 0, 4) = '1990'

- Only use keyspaces and fields present in the schema; do not infer array, object, or foreign-key structure unless the schema shows it.
  WRONG: UNNEST t.tags AS tag (when `tags` is a plain string in schema)
  RIGHT: WHERE t.tags = 'sports'

- Never use CAST(); it is not supported in SQL++.
  WRONG: CAST(price AS FLOAT)
  RIGHT: TO_NUMBER(price)

- Use the exact field named in the question; do not substitute a related variant.
  WRONG: question asks for `revenue`, query uses `total_sales`
  RIGHT: query uses `revenue`

- When similar fields exist, prefer the one whose name most literally matches the question; use sample values to distinguish (e.g., `type` vs `types`, `id` vs `uuid`).
  WRONG: question asks for "account type", query uses `types` (samples: [1,2,3])
  RIGHT: uses `type` (samples: ["savings","checking"])

- Prefer a direct count or pre-aggregated field over computing it from related records when one exists.
  WRONG: (SELECT COUNT(*) FROM reviews r WHERE r.product_id = p.id) >= 3
  RIGHT: WHERE p.review_count >= 3

- Wrap string fields in TO_NUMBER() before numeric aggregation or ordering.
  WRONG: AVG(p.score)  when score is stored as "8.5"
  RIGHT: AVG(TO_NUMBER(p.score))

- If one collection contains all needed fields and filters, do not join.
  WRONG: FROM orders o JOIN orders o2 ON ...
  RIGHT: FROM orders o WHERE o.status = 'shipped'

- Use DISTINCT when unique values are requested or when a join could produce duplicates.
  WRONG: SELECT c.id FROM customers c JOIN orders o ON c.id = o.customer_id
  RIGHT: SELECT DISTINCT c.id ...

- For yes/no questions, return a single existence answer, not matching rows.
  WRONG: SELECT e.name FROM employees e WHERE e.dept = 'HR'
  RIGHT: SELECT COUNT(*) > 0 FROM employees e WHERE e.dept = 'HR'

- When listing entities with no specified attribute, return the entity identifier.
  WRONG: question says "list employees", query returns SELECT e.name
  RIGHT: SELECT e.id

- No colon after FROM.
  WRONG: FROM: orders o
  RIGHT: FROM orders o

- Every alias in a statement must be unique. Couchbase does not allow the same alias to be assigned more than once, even across subqueries or when referencing the same collection.
  WRONG: SELECT * FROM orders o WHERE o.id IN (SELECT RAW o.ref_id FROM orders o WHERE ...)
  RIGHT: SELECT * FROM orders o WHERE o.id IN (SELECT RAW o2.ref_id FROM orders o2 WHERE ...)

- Match literal types to schema field types; quote string-typed fields even when values look numeric.
  WRONG: WHERE zip_code = 10001  (zip_code type is string in the schema)
  RIGHT: WHERE zip_code = "10001"

Bucket Name: 
Scope Name: 

Collection Schema:
{"`bird_training_bucket`.`video_games`.`region_sales`": {"Flavor": "", "properties": {"game_platform_id": {"samples": [585], "type": "number"}, "num_sales": {"samples": [0], "type": "number"}, "region_id": {"samples": [1], "type": "number"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`region`": {"Flavor": "", "properties": {"id": {"samples": [1], "type": "number"}, "region_name": {"samples": ["Europe"], "type": "string"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`game_platform`": {"Flavor": "", "properties": {"game_publisher_id": {"samples": [5017], "type": "number"}, "id": {"samples": [3357], "type": "number"}, "platform_id": {"samples": [7], "type": "number"}, "release_year": {"samples": [1995], "type": "number"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`game_publisher`": {"Flavor": "", "properties": {"game_id": {"samples": [65], "type": "number"}, "id": {"samples": [289], "type": "number"}, "publisher_id": {"samples": [11], "type": "number"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`publisher`": {"Flavor": "", "properties": {"id": {"samples": [62], "type": "number"}, "publisher_name": {"samples": ["Banpresto"], "type": "string"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`platform`": {"Flavor": "", "properties": {"id": {"samples": [5], "type": "number"}, "platform_name": {"samples": ["GEN"], "type": "string"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`game`": {"Flavor": "", "properties": {"game_name": {"samples": ["3rd Super Robot Wars Z: Tengok..."], "type": "string"}, "genre_id": {"samples": [1], "type": "number"}, "id": {"samples": [64], "type": "number"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`genre`": {"Flavor": "", "properties": {"genre_name": {"samples": ["Puzzle"], "type": "string"}, "id": {"samples": [6], "type": "number"}}, "type": "object"}}

Natural Language Query:
game refers to game_name; on the SCD platform refers to platform_name = 'SCD'
Provide the games that can be played on the SCD platform.

SQL++ Query:
<|im_end|>
<|im_start|>assistant
<think>
Step 1 — Read the Question
I start from the natural language request: provide the games that can be played on the SCD platform, where game refers to game_name and SCD refers to platform_name = 'SCD'.

From this, the query needs to:
- Retrieve game names.
- Restrict results to rows associated with a specific platform name, SCD.
- Use the video game domain entities: game, platform, and the linking structures that connect games to platforms.
- Return only the requested game_name field, not extra descriptive columns.
- No aggregation, no ranking, and no temporal logic are implied.

Step 2 — Yes/No Collection Scan
`bird_training_bucket`.`video_games`.`region_sales` -> No: the question is about games on a platform, not regional sales.
`bird_training_bucket`.`video_games`.`region` -> No: regions are unrelated to identifying games on the SCD platform.
`bird_training_bucket`.`video_games`.`game_platform` -> Yes: this looks like the bridge that can connect games to platforms.
`bird_training_bucket`.`video_games`.`game_publisher` -> Yes: this may be part of the path from game-related rows to games.
`bird_training_bucket`.`video_games`.`publisher` -> No: publisher names are not asked for and do not identify platform availability.
`bird_training_bucket`.`video_games`.`platform` -> Yes: the filter is explicitly on platform_name = 'SCD'.
`bird_training_bucket`.`video_games`.`game` -> Yes: the output is game_name.
`bird_training_bucket`.`video_games`.`genre` -> No: genre information is not part of the request.

Step 3 — Inventory Each Yes Collection
Full path: `bird_training_bucket`.`video_games`.`game_platform`
Holds: id (row identifier), platform_id (numeric reference to a platform), game_publisher_id (numeric reference to a game-publisher link), release_year (when the platform release is recorded)
Does NOT hold: game_name, platform_name, or any direct readable game title; those must come from other collections.

Full path: `bird_training_bucket`.`video_games`.`game_publisher`
Holds: id (row identifier), game_id (numeric reference to a game), publisher_id (numeric reference to a publisher)
Does NOT hold: platform_name or game_name as a literal text field; it only provides the path from a platform-linked row to a game.

Full path: `bird_training_bucket`.`video_games`.`platform`
Holds: id (platform identifier), platform_name (human-readable platform label)
Does NOT hold: game_name or any game linkage directly; it only identifies the platform and supports the filter.

Full path: `bird_training_bucket`.`video_games`.`game`
Holds: id (game identifier), game_name (human-readable game title), genre_id (genre reference)
Does NOT hold: platform_name or platform_id; it is the destination for the final output but not the source of the platform filter.

Step 4 — Assign Collection Roles
Primary collection: `bird_training_bucket`.`video_games`.`game_platform` — because Step 3 shows it contains the platform-related identifier and the bridge field needed to reach the game through another collection.

Secondary collections:
- `bird_training_bucket`.`video_games`.`platform` — needed because the primary does not contain platform_name, which is required for filtering to SCD.
- `bird_training_bucket`.`video_games`.`game_publisher` — needed because the primary does not contain game_id, and this link is necessary to reach the game row.
- `bird_training_bucket`.`video_games`.`game` — needed because the final output field game_name exists here, not in the bridge collections.

Rejected Yes collections: none. Each Yes collection contributes a required part of the actual path to the output or filter.

Step 5 — Map Every Required Field
Output fields:
- T4.game_name -> output as game_name

Filter fields:
- T2.platform_name = 'SCD' — keep only rows where the platform label is SCD

Join keys:
- T1.platform_id = T2.id — the platform reference in the bridge matches the platform document identifier
- T1.game_publisher_id = T3.id — the bridge row points to a game_publisher row by its identifier
- T3.game_id = T4.id — the game_publisher row points to the game row by its identifier

Nested paths: none beyond these direct fields.

Step 6 — Decide Join and UNNEST Strategy
Join 1: `game_platform` ↔ `platform`
- Relationship: each bridge row records which platform it belongs to.
- Key: T1.platform_id = T2.id
- Type: INNER — the query only wants games for an actual matching SCD platform, so unmatched rows must not survive.

Join 2: result ↔ `game_publisher`
- Relationship: the bridge row connects onward to a game-publisher row.
- Key: T1.game_publisher_id = T3.id
- Type: INNER — only rows with a valid downstream link should remain, since the output depends on reaching a game.

Join 3: result ↔ `game`
- Relationship: the game-publisher row identifies the game.
- Key: T3.game_id = T4.id
- Type: INNER — the query needs the game name, so only matched game rows survive.

There are no UNNEST operations because no array fields are involved.

Step 7 — Separate ON from WHERE
ON conditions:
- T1.platform_id = T2.id — defines the relationship between the bridge and the platform table.
- T1.game_publisher_id = T3.id — defines the relationship between the bridge and the game_publisher table.
- T3.game_id = T4.id — defines the relationship between game_publisher and game.

WHERE conditions:
- T2.platform_name = 'SCD' — filters by platform value, not by relationship.

Step 8 — Handle Arrays
No arrays involved. All referenced fields are scalar numeric or string fields, so UNNEST, ANY...SATISFIES, ARRAY_LENGTH, and array projection are not needed.

Step 9 — Filters and Aggregation
WHERE conditions:
- T2.platform_name = 'SCD' — only rows for the SCD platform are kept.

Aggregation:
- None. The query is a straightforward filtered lookup returning game names.

Step 10 — Output Shape
SELECT fields:
- T4.game_name AS game_name

DISTINCT:
- Not used. The statement returns all matching rows as produced by the joins, with no deduplication step.

ORDER BY:
- Not used.

LIMIT:
- Not used.

RAW:
- Not used.

Step 11 — Alias Usage
- T1 -> `bird_training_bucket`.`video_games`.`game_platform`, the bridge table starting point.
- T2 -> `bird_training_bucket`.`video_games`.`platform`, used for the platform_name filter.
- T3 -> `bird_training_bucket`.`video_games`.`game_publisher`, used to reach the game identifier.
- T4 -> `bird_training_bucket`.`video_games`.`game`, used for the final game_name output.

Disambiguation:
- T2.platform_name is the platform label being filtered as SCD.
- T4.game_name is the game title being returned.
- The repeated id fields are distinguished by alias: T1.platform_id, T2.id, T1.game_publisher_id, T3.id, T3.game_id, and T4.id each serve different links in the chain.
</think>

The correct SQL++ query is:

<sqlpp>
SELECT T4.`game_name` FROM `bird_training_bucket`.`video_games`.`game_platform` AS T1 INNER JOIN `bird_training_bucket`.`video_games`.`platform` AS T2 ON T1.`platform_id` = T2.`id` INNER JOIN `bird_training_bucket`.`video_games`.`game_publisher` AS T3 ON T1.`game_publisher_id` = T3.`id` INNER JOIN `bird_training_bucket`.`video_games`.`game` AS T4 ON T3.`game_id` = T4.`id` WHERE T2.`platform_name` = 'SCD'
</sqlpp>
<|im_end|>