Instructions to use Sohoxic/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.5_sft_5k-cw-12K with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Local Apps
- Unsloth Studio new
How to use Sohoxic/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.5_sft_5k-cw-12K with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Sohoxic/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.5_sft_5k-cw-12K to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Sohoxic/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.5_sft_5k-cw-12K to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Sohoxic/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.5_sft_5k-cw-12K to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Sohoxic/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.5_sft_5k-cw-12K", max_seq_length=2048, )
Snowflake/Arctic-Text2SQL-R1-7B Fine-tuned for NL2SQL++ v8
This model is a fine-tuned version of Snowflake/Arctic-Text2SQL-R1-7B on the NL2SQL++ v8 dataset with code-with-thought reasoning.
Model Details
- Base Model: Snowflake/Arctic-Text2SQL-R1-7B
- Task: Text-to-SQL generation
- Dataset: NL2SQL++ v8 with code-with-thought reasoning
- Fine-tuning Method: LoRA (Low-Rank Adaptation) with Unsloth
- Quantization: 16-bit merged weights
- Training Dataset Size: (5000, 1) examples
- Validation Dataset Size: (0, 0) examples
Training Configuration
- output_dir: trainer_output
- per_device_train_batch_size: 8
- num_train_epochs: 3.0
- max_steps: -1
- learning_rate: 1e-05
- lr_scheduler_type: SchedulerType.COSINE
- lr_scheduler_kwargs: None
- warmup_steps: 0.1
- optim: OptimizerNames.ADAMW_TORCH_FUSED
- optim_args: None
- weight_decay: 0.01
- adam_beta1: 0.9
- adam_beta2: 0.999
- adam_epsilon: 1e-08
- optim_target_modules: None
- gradient_accumulation_steps: 4
- average_tokens_across_devices: True
- max_grad_norm: 1.0
- label_smoothing_factor: 0.0
- bf16: True
- fp16: False
- bf16_full_eval: False
- fp16_full_eval: False
- tf32: None
- gradient_checkpointing: True
- gradient_checkpointing_kwargs: None
- torch_compile: False
- torch_compile_backend: None
- torch_compile_mode: None
- use_liger_kernel: False
- liger_kernel_config: None
- use_cache: False
- neftune_noise_alpha: None
- torch_empty_cache_steps: None
- auto_find_batch_size: False
- logging_strategy: IntervalStrategy.STEPS
- logging_steps: 3
- logging_first_step: False
- log_on_each_node: True
- logging_nan_inf_filter: True
- include_num_input_tokens_seen: no
- log_level: passive
- log_level_replica: warning
- disable_tqdm: False
- report_to: ['wandb']
- run_name: None
- project: huggingface
- trackio_space_id: trackio
- eval_strategy: IntervalStrategy.STEPS
- eval_steps: 50
- eval_delay: 0
- per_device_eval_batch_size: 8
- prediction_loss_only: False
- eval_on_start: False
- eval_do_concat_batches: True
- eval_use_gather_object: False
- eval_accumulation_steps: 10
- include_for_metrics: []
- batch_eval_metrics: False
- save_only_model: False
- save_strategy: SaveStrategy.BEST
- save_steps: 50
- save_on_each_node: False
- save_total_limit: 2
- enable_jit_checkpoint: False
- push_to_hub: False
- hub_token: None
- hub_private_repo: None
- hub_model_id: None
- hub_strategy: HubStrategy.EVERY_SAVE
- hub_always_push: False
- hub_revision: None
- load_best_model_at_end: True
- metric_for_best_model: eval_loss
- greater_is_better: False
- ignore_data_skip: False
- restore_callback_states_from_checkpoint: False
- full_determinism: False
- seed: 3407
- data_seed: None
- use_cpu: False
- accelerator_config: AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True, non_blocking=False, gradient_accumulation_kwargs=None, use_configured_state=False)
- parallelism_config: None
- dataloader_drop_last: False
- dataloader_num_workers: 0
- dataloader_pin_memory: True
- dataloader_persistent_workers: False
- dataloader_prefetch_factor: None
- remove_unused_columns: True
- label_names: None
- train_sampling_strategy: random
- length_column_name: length
- ddp_find_unused_parameters: None
- ddp_bucket_cap_mb: None
- ddp_broadcast_buffers: None
- ddp_backend: None
- ddp_timeout: 1800
- fsdp: []
- fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
- deepspeed: None
- debug: []
- skip_memory_metrics: True
- do_train: False
- do_eval: True
- do_predict: False
- resume_from_checkpoint: None
- warmup_ratio: 0.1
- logging_dir: None
- local_rank: -1
- model_init_kwargs: None
- chat_template_path: None
- dataset_text_field: text
- dataset_kwargs: None
- dataset_num_proc: None
- eos_token: None
- pad_token: None
- max_length: 1024
- packing: False
- packing_strategy: bfd
- padding_free: False
- pad_to_multiple_of: None
- eval_packing: None
- completion_only_loss: None
- assistant_only_loss: False
- loss_type: nll
- activation_offloading: False
- vllm_sampling_params: None
- unsloth_num_chunks: -1
- unsloth_logit_chunk_multiplier: None
- unsloth_grpo_mini_batch: None
- max_seq_length: 12500
- model_name: Snowflake/Arctic-Text2SQL-R1-7B
- train_batch_size: 4
- val_batch_size: 1
- num_epochs: 1.5
- lora_use_rslora: True
- lora_r: 64
- lora_alpha: 128
- lora_dropout: 0.1
- model_specs: ArcticSpecs(instruction_part='<|im_start|>user', response_part='<|im_start|>assistant', chat_template=None)
Train Dataset Example
<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
You are an expert in SQL++ query generation. Given a document schema and a natural language query, generate a valid SQL++ query.
Task Instructions:
- Backtick-quote field names that are reserved keywords or contain spaces/special characters.
WRONG: SELECT value, Enrollment (K-12) ...
RIGHT: SELECT `value`, `Enrollment (K-12)` ...
- SUBSTR is 0-based: SUBSTR(str, 0, 4) returns the first 4 characters. Use this for year extraction from date strings.
WRONG: SUBSTR(dob, 1, 4) = '1990'
RIGHT: SUBSTR(dob, 0, 4) = '1990'
- Only use keyspaces and fields present in the schema; do not infer array, object, or foreign-key structure unless the schema shows it.
WRONG: UNNEST t.tags AS tag (when `tags` is a plain string in schema)
RIGHT: WHERE t.tags = 'sports'
- Never use CAST(); it is not supported in SQL++.
WRONG: CAST(price AS FLOAT)
RIGHT: TO_NUMBER(price)
- Use the exact field named in the question; do not substitute a related variant.
WRONG: question asks for `revenue`, query uses `total_sales`
RIGHT: query uses `revenue`
- When similar fields exist, prefer the one whose name most literally matches the question; use sample values to distinguish (e.g., `type` vs `types`, `id` vs `uuid`).
WRONG: question asks for "account type", query uses `types` (samples: [1,2,3])
RIGHT: uses `type` (samples: ["savings","checking"])
- Prefer a direct count or pre-aggregated field over computing it from related records when one exists.
WRONG: (SELECT COUNT(*) FROM reviews r WHERE r.product_id = p.id) >= 3
RIGHT: WHERE p.review_count >= 3
- Wrap string fields in TO_NUMBER() before numeric aggregation or ordering.
WRONG: AVG(p.score) when score is stored as "8.5"
RIGHT: AVG(TO_NUMBER(p.score))
- If one collection contains all needed fields and filters, do not join.
WRONG: FROM orders o JOIN orders o2 ON ...
RIGHT: FROM orders o WHERE o.status = 'shipped'
- Use DISTINCT when unique values are requested or when a join could produce duplicates.
WRONG: SELECT c.id FROM customers c JOIN orders o ON c.id = o.customer_id
RIGHT: SELECT DISTINCT c.id ...
- For yes/no questions, return a single existence answer, not matching rows.
WRONG: SELECT e.name FROM employees e WHERE e.dept = 'HR'
RIGHT: SELECT COUNT(*) > 0 FROM employees e WHERE e.dept = 'HR'
- When listing entities with no specified attribute, return the entity identifier.
WRONG: question says "list employees", query returns SELECT e.name
RIGHT: SELECT e.id
- No colon after FROM.
WRONG: FROM: orders o
RIGHT: FROM orders o
- Every alias in a statement must be unique. Couchbase does not allow the same alias to be assigned more than once, even across subqueries or when referencing the same collection.
WRONG: SELECT * FROM orders o WHERE o.id IN (SELECT RAW o.ref_id FROM orders o WHERE ...)
RIGHT: SELECT * FROM orders o WHERE o.id IN (SELECT RAW o2.ref_id FROM orders o2 WHERE ...)
- Match literal types to schema field types; quote string-typed fields even when values look numeric.
WRONG: WHERE zip_code = 10001 (zip_code type is string in the schema)
RIGHT: WHERE zip_code = "10001"
Bucket Name:
Scope Name:
Collection Schema:
{"`bird_training_bucket`.`video_games`.`region_sales`": {"Flavor": "", "properties": {"game_platform_id": {"samples": [585], "type": "number"}, "num_sales": {"samples": [0], "type": "number"}, "region_id": {"samples": [1], "type": "number"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`region`": {"Flavor": "", "properties": {"id": {"samples": [1], "type": "number"}, "region_name": {"samples": ["Europe"], "type": "string"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`game_platform`": {"Flavor": "", "properties": {"game_publisher_id": {"samples": [5017], "type": "number"}, "id": {"samples": [3357], "type": "number"}, "platform_id": {"samples": [7], "type": "number"}, "release_year": {"samples": [1995], "type": "number"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`game_publisher`": {"Flavor": "", "properties": {"game_id": {"samples": [65], "type": "number"}, "id": {"samples": [289], "type": "number"}, "publisher_id": {"samples": [11], "type": "number"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`publisher`": {"Flavor": "", "properties": {"id": {"samples": [62], "type": "number"}, "publisher_name": {"samples": ["Banpresto"], "type": "string"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`platform`": {"Flavor": "", "properties": {"id": {"samples": [5], "type": "number"}, "platform_name": {"samples": ["GEN"], "type": "string"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`game`": {"Flavor": "", "properties": {"game_name": {"samples": ["3rd Super Robot Wars Z: Tengok..."], "type": "string"}, "genre_id": {"samples": [1], "type": "number"}, "id": {"samples": [64], "type": "number"}}, "type": "object"}, "`bird_training_bucket`.`video_games`.`genre`": {"Flavor": "", "properties": {"genre_name": {"samples": ["Puzzle"], "type": "string"}, "id": {"samples": [6], "type": "number"}}, "type": "object"}}
Natural Language Query:
game refers to game_name; on the SCD platform refers to platform_name = 'SCD'
Provide the games that can be played on the SCD platform.
SQL++ Query:
<|im_end|>
<|im_start|>assistant
<think>
Step 1 — Read the Question
I start from the natural language request: provide the games that can be played on the SCD platform, where game refers to game_name and SCD refers to platform_name = 'SCD'.
From this, the query needs to:
- Retrieve game names.
- Restrict results to rows associated with a specific platform name, SCD.
- Use the video game domain entities: game, platform, and the linking structures that connect games to platforms.
- Return only the requested game_name field, not extra descriptive columns.
- No aggregation, no ranking, and no temporal logic are implied.
Step 2 — Yes/No Collection Scan
`bird_training_bucket`.`video_games`.`region_sales` -> No: the question is about games on a platform, not regional sales.
`bird_training_bucket`.`video_games`.`region` -> No: regions are unrelated to identifying games on the SCD platform.
`bird_training_bucket`.`video_games`.`game_platform` -> Yes: this looks like the bridge that can connect games to platforms.
`bird_training_bucket`.`video_games`.`game_publisher` -> Yes: this may be part of the path from game-related rows to games.
`bird_training_bucket`.`video_games`.`publisher` -> No: publisher names are not asked for and do not identify platform availability.
`bird_training_bucket`.`video_games`.`platform` -> Yes: the filter is explicitly on platform_name = 'SCD'.
`bird_training_bucket`.`video_games`.`game` -> Yes: the output is game_name.
`bird_training_bucket`.`video_games`.`genre` -> No: genre information is not part of the request.
Step 3 — Inventory Each Yes Collection
Full path: `bird_training_bucket`.`video_games`.`game_platform`
Holds: id (row identifier), platform_id (numeric reference to a platform), game_publisher_id (numeric reference to a game-publisher link), release_year (when the platform release is recorded)
Does NOT hold: game_name, platform_name, or any direct readable game title; those must come from other collections.
Full path: `bird_training_bucket`.`video_games`.`game_publisher`
Holds: id (row identifier), game_id (numeric reference to a game), publisher_id (numeric reference to a publisher)
Does NOT hold: platform_name or game_name as a literal text field; it only provides the path from a platform-linked row to a game.
Full path: `bird_training_bucket`.`video_games`.`platform`
Holds: id (platform identifier), platform_name (human-readable platform label)
Does NOT hold: game_name or any game linkage directly; it only identifies the platform and supports the filter.
Full path: `bird_training_bucket`.`video_games`.`game`
Holds: id (game identifier), game_name (human-readable game title), genre_id (genre reference)
Does NOT hold: platform_name or platform_id; it is the destination for the final output but not the source of the platform filter.
Step 4 — Assign Collection Roles
Primary collection: `bird_training_bucket`.`video_games`.`game_platform` — because Step 3 shows it contains the platform-related identifier and the bridge field needed to reach the game through another collection.
Secondary collections:
- `bird_training_bucket`.`video_games`.`platform` — needed because the primary does not contain platform_name, which is required for filtering to SCD.
- `bird_training_bucket`.`video_games`.`game_publisher` — needed because the primary does not contain game_id, and this link is necessary to reach the game row.
- `bird_training_bucket`.`video_games`.`game` — needed because the final output field game_name exists here, not in the bridge collections.
Rejected Yes collections: none. Each Yes collection contributes a required part of the actual path to the output or filter.
Step 5 — Map Every Required Field
Output fields:
- T4.game_name -> output as game_name
Filter fields:
- T2.platform_name = 'SCD' — keep only rows where the platform label is SCD
Join keys:
- T1.platform_id = T2.id — the platform reference in the bridge matches the platform document identifier
- T1.game_publisher_id = T3.id — the bridge row points to a game_publisher row by its identifier
- T3.game_id = T4.id — the game_publisher row points to the game row by its identifier
Nested paths: none beyond these direct fields.
Step 6 — Decide Join and UNNEST Strategy
Join 1: `game_platform` ↔ `platform`
- Relationship: each bridge row records which platform it belongs to.
- Key: T1.platform_id = T2.id
- Type: INNER — the query only wants games for an actual matching SCD platform, so unmatched rows must not survive.
Join 2: result ↔ `game_publisher`
- Relationship: the bridge row connects onward to a game-publisher row.
- Key: T1.game_publisher_id = T3.id
- Type: INNER — only rows with a valid downstream link should remain, since the output depends on reaching a game.
Join 3: result ↔ `game`
- Relationship: the game-publisher row identifies the game.
- Key: T3.game_id = T4.id
- Type: INNER — the query needs the game name, so only matched game rows survive.
There are no UNNEST operations because no array fields are involved.
Step 7 — Separate ON from WHERE
ON conditions:
- T1.platform_id = T2.id — defines the relationship between the bridge and the platform table.
- T1.game_publisher_id = T3.id — defines the relationship between the bridge and the game_publisher table.
- T3.game_id = T4.id — defines the relationship between game_publisher and game.
WHERE conditions:
- T2.platform_name = 'SCD' — filters by platform value, not by relationship.
Step 8 — Handle Arrays
No arrays involved. All referenced fields are scalar numeric or string fields, so UNNEST, ANY...SATISFIES, ARRAY_LENGTH, and array projection are not needed.
Step 9 — Filters and Aggregation
WHERE conditions:
- T2.platform_name = 'SCD' — only rows for the SCD platform are kept.
Aggregation:
- None. The query is a straightforward filtered lookup returning game names.
Step 10 — Output Shape
SELECT fields:
- T4.game_name AS game_name
DISTINCT:
- Not used. The statement returns all matching rows as produced by the joins, with no deduplication step.
ORDER BY:
- Not used.
LIMIT:
- Not used.
RAW:
- Not used.
Step 11 — Alias Usage
- T1 -> `bird_training_bucket`.`video_games`.`game_platform`, the bridge table starting point.
- T2 -> `bird_training_bucket`.`video_games`.`platform`, used for the platform_name filter.
- T3 -> `bird_training_bucket`.`video_games`.`game_publisher`, used to reach the game identifier.
- T4 -> `bird_training_bucket`.`video_games`.`game`, used for the final game_name output.
Disambiguation:
- T2.platform_name is the platform label being filtered as SCD.
- T4.game_name is the game title being returned.
- The repeated id fields are distinguished by alias: T1.platform_id, T2.id, T1.game_publisher_id, T3.id, T3.game_id, and T4.id each serve different links in the chain.
</think>
The correct SQL++ query is:
<sqlpp>
SELECT T4.`game_name` FROM `bird_training_bucket`.`video_games`.`game_platform` AS T1 INNER JOIN `bird_training_bucket`.`video_games`.`platform` AS T2 ON T1.`platform_id` = T2.`id` INNER JOIN `bird_training_bucket`.`video_games`.`game_publisher` AS T3 ON T1.`game_publisher_id` = T3.`id` INNER JOIN `bird_training_bucket`.`video_games`.`game` AS T4 ON T3.`game_id` = T4.`id` WHERE T2.`platform_name` = 'SCD'
</sqlpp>
<|im_end|>
- Downloads last month
- 246
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for Sohoxic/snowflake_arctic_text2sql_r1_7b-nl2sqlpp-16bit-v5.7.5_sft_5k-cw-12K
Base model
Qwen/Qwen2.5-7B Finetuned
Qwen/Qwen2.5-Coder-7B Finetuned
Qwen/Qwen2.5-Coder-7B-Instruct Finetuned
Snowflake/Arctic-Text2SQL-R1-7B