Instructions to use jastorj/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio new
How to use jastorj/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jastorj/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for jastorj/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for jastorj/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="jastorj/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K", max_seq_length=2048, )
Snowflake/Arctic-Text2SQL-R1-7B Fine-tuned for NL2SQL++ v8
This model is a fine-tuned version of Snowflake/Arctic-Text2SQL-R1-7B on the NL2SQL++ v8 dataset with code-with-thought reasoning.
Model Details
- Base Model: Snowflake/Arctic-Text2SQL-R1-7B
- Task: Text-to-SQL generation
- Dataset: NL2SQL++ v8 with code-with-thought reasoning
- Fine-tuning Method: LoRA (Low-Rank Adaptation) with Unsloth
- Quantization: 16-bit merged weights
- Training Dataset Size: (39243, 1) examples
- Validation Dataset Size: (1062, 1) examples
Training Configuration
- output_dir: trainer_output
- overwrite_output_dir: False
- do_train: False
- do_eval: True
- do_predict: False
- eval_strategy: IntervalStrategy.STEPS
- prediction_loss_only: False
- per_device_train_batch_size: 8
- per_device_eval_batch_size: 8
- per_gpu_train_batch_size: None
- per_gpu_eval_batch_size: None
- gradient_accumulation_steps: 2
- eval_accumulation_steps: 10
- eval_delay: 0
- torch_empty_cache_steps: None
- learning_rate: 1e-05
- weight_decay: 0.01
- adam_beta1: 0.9
- adam_beta2: 0.999
- adam_epsilon: 1e-08
- max_grad_norm: 1.0
- num_train_epochs: 3.0
- max_steps: -1
- lr_scheduler_type: SchedulerType.COSINE
- lr_scheduler_kwargs: {}
- warmup_ratio: 0.1
- warmup_steps: 0
- log_level: passive
- log_level_replica: warning
- log_on_each_node: True
- logging_dir: trainer_output/runs/May16_17-37-18_ip-172-31-10-229.ap-northeast-1.compute.internal
- logging_strategy: IntervalStrategy.STEPS
- logging_first_step: False
- logging_steps: 3
- logging_nan_inf_filter: True
- save_strategy: SaveStrategy.BEST
- save_steps: 50
- save_total_limit: 2
- save_safetensors: True
- save_on_each_node: False
- save_only_model: False
- restore_callback_states_from_checkpoint: False
- no_cuda: False
- use_cpu: False
- use_mps_device: False
- seed: 3407
- data_seed: None
- jit_mode_eval: False
- bf16: True
- fp16: False
- fp16_opt_level: O1
- half_precision_backend: auto
- bf16_full_eval: False
- fp16_full_eval: False
- tf32: None
- local_rank: 0
- ddp_backend: None
- tpu_num_cores: None
- tpu_metrics_debug: False
- debug: []
- dataloader_drop_last: False
- eval_steps: 50
- dataloader_num_workers: 0
- dataloader_prefetch_factor: None
- past_index: -1
- run_name: None
- disable_tqdm: False
- remove_unused_columns: True
- label_names: None
- load_best_model_at_end: True
- metric_for_best_model: eval_loss
- greater_is_better: False
- ignore_data_skip: False
- fsdp: []
- fsdp_min_num_params: 0
- fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- fsdp_transformer_layer_cls_to_wrap: None
- accelerator_config: AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None, use_configured_state=False)
- parallelism_config: None
- deepspeed: None
- label_smoothing_factor: 0.0
- optim: OptimizerNames.ADAMW_TORCH_FUSED
- optim_args: None
- adafactor: False
- group_by_length: False
- length_column_name: length
- report_to: ['wandb']
- project: huggingface
- trackio_space_id: trackio
- ddp_find_unused_parameters: None
- ddp_bucket_cap_mb: None
- ddp_broadcast_buffers: None
- dataloader_pin_memory: True
- dataloader_persistent_workers: False
- skip_memory_metrics: True
- use_legacy_prediction_loop: False
- push_to_hub: False
- resume_from_checkpoint: None
- hub_model_id: None
- hub_strategy: HubStrategy.EVERY_SAVE
- hub_token: None
- hub_private_repo: None
- hub_always_push: False
- hub_revision: None
- gradient_checkpointing: True
- gradient_checkpointing_kwargs: None
- include_inputs_for_metrics: False
- include_for_metrics: []
- eval_do_concat_batches: True
- fp16_backend: auto
- push_to_hub_model_id: None
- push_to_hub_organization: None
- push_to_hub_token: None
- _n_gpu: 1
- mp_parameters:
- auto_find_batch_size: False
- full_determinism: False
- torchdynamo: None
- ray_scope: last
- ddp_timeout: 1800
- torch_compile: False
- torch_compile_backend: None
- torch_compile_mode: None
- include_tokens_per_second: False
- include_num_input_tokens_seen: no
- neftune_noise_alpha: None
- optim_target_modules: None
- batch_eval_metrics: False
- eval_on_start: False
- use_liger_kernel: False
- liger_kernel_config: None
- eval_use_gather_object: False
- average_tokens_across_devices: True
- model_init_kwargs: None
- chat_template_path: None
- dataset_text_field: text
- dataset_kwargs: None
- dataset_num_proc: None
- eos_token: None
- pad_token: None
- max_length: 1024
- packing: False
- packing_strategy: bfd
- padding_free: False
- pad_to_multiple_of: None
- eval_packing: None
- completion_only_loss: None
- assistant_only_loss: False
- loss_type: nll
- activation_offloading: False
- vllm_sampling_params: None
- unsloth_num_chunks: -1
- unsloth_logit_chunk_multiplier: None
- unsloth_grpo_mini_batch: None
- max_seq_length: 5500
- model_name: Snowflake/Arctic-Text2SQL-R1-7B
- train_batch_size: 8
- val_batch_size: 1
- num_epochs: 2
- lora_use_rslora: True
- lora_r: 64
- lora_alpha: 128
- lora_dropout: 0.1
Train Dataset Example
<|im_start|>system
You are an expert in SQL++ query generation. Given a document schema and a natural language query, generate a valid SQL++ query.
Task Instructions:
- Backtick-quote field names that are reserved keywords or contain spaces/special characters.
WRONG: SELECT value, Enrollment (K-12) ...
RIGHT: SELECT `value`, `Enrollment (K-12)` ...
- SUBSTR is 0-based: SUBSTR(str, 0, 4) returns the first 4 characters. Use this for year extraction from date strings.
WRONG: SUBSTR(dob, 1, 4) = '1990'
RIGHT: SUBSTR(dob, 0, 4) = '1990'
- Only use keyspaces and fields present in the schema; do not infer array, object, or foreign-key structure unless the schema shows it.
WRONG: UNNEST t.tags AS tag (when `tags` is a plain string in schema)
RIGHT: WHERE t.tags = 'sports'
- Never use CAST(); it is not supported in SQL++.
WRONG: CAST(price AS FLOAT)
RIGHT: TO_NUMBER(price)
- Use the exact field named in the question; do not substitute a related variant.
WRONG: question asks for `revenue`, query uses `total_sales`
RIGHT: query uses `revenue`
- When similar fields exist, prefer the one whose name most literally matches the question; use sample values to distinguish (e.g., `type` vs `types`, `id` vs `uuid`).
WRONG: question asks for "account type", query uses `types` (samples: [1,2,3])
RIGHT: uses `type` (samples: ["savings","checking"])
- Prefer a direct count or pre-aggregated field over computing it from related records when one exists.
WRONG: (SELECT COUNT(*) FROM reviews r WHERE r.product_id = p.id) >= 3
RIGHT: WHERE p.review_count >= 3
- Wrap string fields in TO_NUMBER() before numeric aggregation or ordering.
WRONG: AVG(p.score) when score is stored as "8.5"
RIGHT: AVG(TO_NUMBER(p.score))
- If one collection contains all needed fields and filters, do not join.
WRONG: FROM orders o JOIN orders o2 ON ...
RIGHT: FROM orders o WHERE o.status = 'shipped'
- Use DISTINCT when unique values are requested or when a join could produce duplicates.
WRONG: SELECT c.id FROM customers c JOIN orders o ON c.id = o.customer_id
RIGHT: SELECT DISTINCT c.id ...
- For yes/no questions, return a single existence answer, not matching rows.
WRONG: SELECT e.name FROM employees e WHERE e.dept = 'HR'
RIGHT: SELECT COUNT(*) > 0 FROM employees e WHERE e.dept = 'HR'
- When listing entities with no specified attribute, return the entity identifier.
WRONG: question says "list employees", query returns SELECT e.name
RIGHT: SELECT e.id
- No colon after FROM.
WRONG: FROM: orders o
RIGHT: FROM orders o
- Every alias in a statement must be unique. Couchbase does not allow the same alias to be assigned more than once, even across subqueries or when referencing the same collection.
WRONG: SELECT * FROM orders o WHERE o.id IN (SELECT RAW o.ref_id FROM orders o WHERE ...)
RIGHT: SELECT * FROM orders o WHERE o.id IN (SELECT RAW o2.ref_id FROM orders o2 WHERE ...)
- Match literal types to schema field types; quote string-typed fields even when values look numeric.
WRONG: WHERE zip_code = 10001 (zip_code type is string in the schema)
RIGHT: WHERE zip_code = "10001"
<|im_end|>
<|im_start|>user
Bucket Name: training_bucket
Scope Name: Publishing_House_and_Book_Distribution_scope
Collection Schema:
{}
Natural Language Query:
Help me find all blog posts published on our platform that have received comments from users, where I need to see the full post content including title, author information, publication date, category tags, and view count, along with all the comment details such as commenter name, comment text, timestamp, and any replies to those comments, and also include posts that haven't received any comments yet so I can see which content isn't engaging our audience.
The SQL++ query you generated:
```sql++
-- NOTE: Collections 'blog_posts' and 'comments' do not exist in the specified scope.
-- Without collection schemas, this query cannot be corrected.
SELECT bp.*, c.commenter_name, c.comment_text, c.timestamp AS comment_timestamp, c.replies
FROM `training_bucket`.`Publishing_House_and_Book_Distribution_scope`.blog_posts AS bp
LEFT JOIN `training_bucket`.`Publishing_House_and_Book_Distribution_scope`.comments AS c ON META(bp).id = c.post_id;
returned an error: first_error_code: 12003 | first_error_message: Keyspace not found in CB datastore: default:training_bucket.Publishing_House_and_Book_Distribution_scope.blog_posts (near line 5, column 6) | finderr: [{'$1': [{'applies_to': 'Server', 'code': 12003, 'description': 'Keyspace not found in CB datastore: «details»', 'reason': ['A keyspace referenced in the statement did not exist.'], 'user_action': ['Check all expected keyspaces have been created and are correctly referenced in the statement.'], 'user_error': 'Yes'}]}]
Please analyze the error and provide a corrected SQL++ query. <|im_end|> <|im_start|>assistant
**Error Analysis:** Error code 12003 indicates 'Publishing_House_and_Book_Distribution_scope' with 'blog_posts' and 'comments' does not exist. After checking system:keyspaces, there is no traditional blog system, but 'Educational_Learning_Management_Ecosystem_scope' has 'discussion_posts' which can represent blog posts and comments in a hierarchical structure. The natural language requires LEFT JOIN (focus keyword) to include posts without comments, allowing identification of non-engaging content. The corrected query uses a self-join on discussion_posts where parent posts (parent_id IS MISSING/NULL) are left-joined with their child comments (posts with a parent_id), retrieving all post details and associated comment information.Corrected SQL++ query:
SELECT p.*, p.`author` AS author_information, p.`published_at` AS publication_date, p.`tags` AS category_tags, p.`views` AS view_count, c.`author` AS commenter_name, c.`content` AS comment_text, c.`created_at` AS comment_timestamp FROM `training_bucket`.`Educational_Learning_Management_Ecosystem_scope`.`discussion_posts` AS p LEFT JOIN `training_bucket`.`Educational_Learning_Management_Ecosystem_scope`.`discussion_posts` AS c ON META(p).`id` = c.`parent_id` WHERE p.`parent_id` IS MISSING OR p.`parent_id` IS NULL;
<|im_end|>
- Downloads last month
- 121
Model tree for jastorj/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.8_phase_1-cw-5K
Base model
Qwen/Qwen2.5-7B