Snowflake/Arctic-Text2SQL-R1-7B Fine-tuned for NL2SQL++ v8

This model is a fine-tuned version of Snowflake/Arctic-Text2SQL-R1-7B on the NL2SQL++ v8 dataset with code-with-thought reasoning.

Model Details

Base Model: Snowflake/Arctic-Text2SQL-R1-7B
Task: Text-to-SQL generation
Dataset: NL2SQL++ v8 with code-with-thought reasoning
Fine-tuning Method: LoRA (Low-Rank Adaptation) with Unsloth
Quantization: 16-bit merged weights
Training Dataset Size: (39243, 1) examples
Validation Dataset Size: (1062, 1) examples

Training Configuration

output_dir: trainer_output
overwrite_output_dir: False
do_train: False
do_eval: True
do_predict: False
eval_strategy: IntervalStrategy.STEPS
prediction_loss_only: False
per_device_train_batch_size: 8
per_device_eval_batch_size: 8
per_gpu_train_batch_size: None
per_gpu_eval_batch_size: None
gradient_accumulation_steps: 2
eval_accumulation_steps: 10
eval_delay: 0
torch_empty_cache_steps: None
learning_rate: 1e-05
weight_decay: 0.01
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
max_grad_norm: 1.0
num_train_epochs: 3.0
max_steps: -1
lr_scheduler_type: SchedulerType.COSINE
lr_scheduler_kwargs: {}
warmup_ratio: 0.1
warmup_steps: 0
log_level: passive
log_level_replica: warning
log_on_each_node: True
logging_dir: trainer_output/runs/May16_17-37-18_ip-172-31-10-229.ap-northeast-1.compute.internal
logging_strategy: IntervalStrategy.STEPS
logging_first_step: False
logging_steps: 3
logging_nan_inf_filter: True
save_strategy: SaveStrategy.BEST
save_steps: 50
save_total_limit: 2
save_safetensors: True
save_on_each_node: False
save_only_model: False
restore_callback_states_from_checkpoint: False
no_cuda: False
use_cpu: False
use_mps_device: False
seed: 3407
data_seed: None
jit_mode_eval: False
bf16: True
fp16: False
fp16_opt_level: O1
half_precision_backend: auto
bf16_full_eval: False
fp16_full_eval: False
tf32: None
local_rank: 0
ddp_backend: None
tpu_num_cores: None
tpu_metrics_debug: False
debug: []
dataloader_drop_last: False
eval_steps: 50
dataloader_num_workers: 0
dataloader_prefetch_factor: None
past_index: -1
run_name: None
disable_tqdm: False
remove_unused_columns: True
label_names: None
load_best_model_at_end: True
metric_for_best_model: eval_loss
greater_is_better: False
ignore_data_skip: False
fsdp: []
fsdp_min_num_params: 0
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
fsdp_transformer_layer_cls_to_wrap: None
accelerator_config: AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None, use_configured_state=False)
parallelism_config: None
deepspeed: None
label_smoothing_factor: 0.0
optim: OptimizerNames.ADAMW_TORCH_FUSED
optim_args: None
adafactor: False
group_by_length: False
length_column_name: length
report_to: ['wandb']
project: huggingface
trackio_space_id: trackio
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: None
dataloader_pin_memory: True
dataloader_persistent_workers: False
skip_memory_metrics: True
use_legacy_prediction_loop: False
push_to_hub: False
resume_from_checkpoint: None
hub_model_id: None
hub_strategy: HubStrategy.EVERY_SAVE
hub_token: None
hub_private_repo: None
hub_always_push: False
hub_revision: None
gradient_checkpointing: True
gradient_checkpointing_kwargs: None
include_inputs_for_metrics: False
include_for_metrics: []
eval_do_concat_batches: True
fp16_backend: auto
push_to_hub_model_id: None
push_to_hub_organization: None
push_to_hub_token: None
_n_gpu: 1
mp_parameters:
auto_find_batch_size: False
full_determinism: False
torchdynamo: None
ray_scope: last
ddp_timeout: 1800
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
include_tokens_per_second: False
include_num_input_tokens_seen: no
neftune_noise_alpha: None
optim_target_modules: None
batch_eval_metrics: False
eval_on_start: False
use_liger_kernel: False
liger_kernel_config: None
eval_use_gather_object: False
average_tokens_across_devices: True
model_init_kwargs: None
chat_template_path: None
dataset_text_field: text
dataset_kwargs: None
dataset_num_proc: None
eos_token: None
pad_token: None
max_length: 1024
packing: False
packing_strategy: bfd
padding_free: False
pad_to_multiple_of: None
eval_packing: None
completion_only_loss: None
assistant_only_loss: False
loss_type: nll
activation_offloading: False
vllm_sampling_params: None
unsloth_num_chunks: -1
unsloth_logit_chunk_multiplier: None
unsloth_grpo_mini_batch: None
max_seq_length: 5500
model_name: Snowflake/Arctic-Text2SQL-R1-7B
train_batch_size: 8
val_batch_size: 1
num_epochs: 2
lora_use_rslora: True
lora_r: 64
lora_alpha: 128
lora_dropout: 0.1

Train Dataset Example

<|im_start|>system
You are an expert in SQL++ query generation. Given a document schema and a natural language query, generate a valid SQL++ query.

Task Instructions:
- Backtick-quote field names that are reserved keywords or contain spaces/special characters.
  WRONG: SELECT value, Enrollment (K-12) ...
  RIGHT: SELECT `value`, `Enrollment (K-12)` ...

- SUBSTR is 0-based: SUBSTR(str, 0, 4) returns the first 4 characters. Use this for year extraction from date strings.
  WRONG: SUBSTR(dob, 1, 4) = '1990'
  RIGHT: SUBSTR(dob, 0, 4) = '1990'

- Only use keyspaces and fields present in the schema; do not infer array, object, or foreign-key structure unless the schema shows it.
  WRONG: UNNEST t.tags AS tag (when `tags` is a plain string in schema)
  RIGHT: WHERE t.tags = 'sports'

- Never use CAST(); it is not supported in SQL++.
  WRONG: CAST(price AS FLOAT)
  RIGHT: TO_NUMBER(price)

- Use the exact field named in the question; do not substitute a related variant.
  WRONG: question asks for `revenue`, query uses `total_sales`
  RIGHT: query uses `revenue`

- When similar fields exist, prefer the one whose name most literally matches the question; use sample values to distinguish (e.g., `type` vs `types`, `id` vs `uuid`).
  WRONG: question asks for "account type", query uses `types` (samples: [1,2,3])
  RIGHT: uses `type` (samples: ["savings","checking"])

- Prefer a direct count or pre-aggregated field over computing it from related records when one exists.
  WRONG: (SELECT COUNT(*) FROM reviews r WHERE r.product_id = p.id) >= 3
  RIGHT: WHERE p.review_count >= 3

- Wrap string fields in TO_NUMBER() before numeric aggregation or ordering.
  WRONG: AVG(p.score)  when score is stored as "8.5"
  RIGHT: AVG(TO_NUMBER(p.score))

- If one collection contains all needed fields and filters, do not join.
  WRONG: FROM orders o JOIN orders o2 ON ...
  RIGHT: FROM orders o WHERE o.status = 'shipped'

- Use DISTINCT when unique values are requested or when a join could produce duplicates.
  WRONG: SELECT c.id FROM customers c JOIN orders o ON c.id = o.customer_id
  RIGHT: SELECT DISTINCT c.id ...

- For yes/no questions, return a single existence answer, not matching rows.
  WRONG: SELECT e.name FROM employees e WHERE e.dept = 'HR'
  RIGHT: SELECT COUNT(*) > 0 FROM employees e WHERE e.dept = 'HR'

- When listing entities with no specified attribute, return the entity identifier.
  WRONG: question says "list employees", query returns SELECT e.name
  RIGHT: SELECT e.id

- No colon after FROM.
  WRONG: FROM: orders o
  RIGHT: FROM orders o

- Every alias in a statement must be unique. Couchbase does not allow the same alias to be assigned more than once, even across subqueries or when referencing the same collection.
  WRONG: SELECT * FROM orders o WHERE o.id IN (SELECT RAW o.ref_id FROM orders o WHERE ...)
  RIGHT: SELECT * FROM orders o WHERE o.id IN (SELECT RAW o2.ref_id FROM orders o2 WHERE ...)

- Match literal types to schema field types; quote string-typed fields even when values look numeric.
  WRONG: WHERE zip_code = 10001  (zip_code type is string in the schema)
  RIGHT: WHERE zip_code = "10001"
<|im_end|>
<|im_start|>user

Bucket Name: training_bucket
Scope Name: Publishing_House_and_Book_Distribution_scope

Collection Schema:
{}

Natural Language Query:
Help me find all blog posts published on our platform that have received comments from users, where I need to see the full post content including title, author information, publication date, category tags, and view count, along with all the comment details such as commenter name, comment text, timestamp, and any replies to those comments, and also include posts that haven't received any comments yet so I can see which content isn't engaging our audience.

The SQL++ query you generated:

```sql++
-- NOTE: Collections 'blog_posts' and 'comments' do not exist in the specified scope.
-- Without collection schemas, this query cannot be corrected.

SELECT bp.*, c.commenter_name, c.comment_text, c.timestamp AS comment_timestamp, c.replies
FROM `training_bucket`.`Publishing_House_and_Book_Distribution_scope`.blog_posts AS bp
LEFT JOIN `training_bucket`.`Publishing_House_and_Book_Distribution_scope`.comments AS c ON META(bp).id = c.post_id;

returned an error: first_error_code: 12003 | first_error_message: Keyspace not found in CB datastore: default:training_bucket.Publishing_House_and_Book_Distribution_scope.blog_posts (near line 5, column 6) | finderr: [{'$1': [{'applies_to': 'Server', 'code': 12003, 'description': 'Keyspace not found in CB datastore: «details»', 'reason': ['A keyspace referenced in the statement did not exist.'], 'user_action': ['Check all expected keyspaces have been created and are correctly referenced in the statement.'], 'user_error': 'Yes'}]}]

Please analyze the error and provide a corrected SQL++ query. <|im_end|> <|im_start|>assistant

**Error Analysis:** Error code 12003 indicates 'Publishing_House_and_Book_Distribution_scope' with 'blog_posts' and 'comments' does not exist. After checking system:keyspaces, there is no traditional blog system, but 'Educational_Learning_Management_Ecosystem_scope' has 'discussion_posts' which can represent blog posts and comments in a hierarchical structure. The natural language requires LEFT JOIN (focus keyword) to include posts without comments, allowing identification of non-engaging content. The corrected query uses a self-join on discussion_posts where parent posts (parent_id IS MISSING/NULL) are left-joined with their child comments (posts with a parent_id), retrieving all post details and associated comment information.

Corrected SQL++ query:

SELECT p.*, p.`author` AS author_information, p.`published_at` AS publication_date, p.`tags` AS category_tags, p.`views` AS view_count, c.`author` AS commenter_name, c.`content` AS comment_text, c.`created_at` AS comment_timestamp FROM `training_bucket`.`Educational_Learning_Management_Ecosystem_scope`.`discussion_posts` AS p LEFT JOIN `training_bucket`.`Educational_Learning_Management_Ecosystem_scope`.`discussion_posts` AS c ON META(p).`id` = c.`parent_id` WHERE p.`parent_id` IS MISSING OR p.`parent_id` IS NULL;

<|im_end|>