sriram882004 commited on
Commit
8b270e3
·
verified ·
1 Parent(s): b76ac41

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -18
README.md CHANGED
@@ -1,29 +1,154 @@
1
- ## Method
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
- Our approach is structured in three phases to support Socratic SQL instruction for higher education.
4
 
5
- ### Phase 1: SQL-Instruct Corpus Construction
6
 
7
- We construct **SQL-Instruct**, a domain-specific Socratic instruction corpus, by mining high-quality interactions from Stack Overflow. This platform provides a rich source of real-world misconceptions, debugging challenges, and conceptual difficulties encountered by both students and practitioners, making it well-suited for training models that emphasize understanding over code replication.
8
 
9
- To ensure data quality, we filter SQL-tagged questions based on community impact. The resulting dataset reflects substantial engagement, with a cumulative reach of approximately **1.27 billion views** and an average of **128,535 views per question**. For each selected instance, we extract:
10
- - The core problem description
11
- - User-provided SQL attempts (when available)
12
- - Executable SQL blocks from the accepted solution
13
 
14
- This process yields **9,916 unique questions**, which are then transformed into Socratic instructional data using GPT-4o. We leverage GPT-4o for its strong reasoning capabilities to generate **pedagogical hints and guided reasoning steps**, ensuring that the dataset emphasizes conceptual scaffolding rather than direct answers.
15
 
16
- The dataset is intentionally skewed toward higher cognitive complexity, with:
17
- - **8,604 intermediate-level questions**
18
- - **629 advanced-level questions**
19
 
20
- Additionally, we identify a subset of **531 debugging tasks**, enabling models to learn how to guide students through error identification and correction in SQL queries.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- The corpus spans a wide range of SQL topics, with particular emphasis on:
23
  - JOIN operations
24
- - Aggregation and grouping
25
- - Query optimization and performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
 
27
- By selecting questions with a **median Stack Overflow score of 27**, we ensure that the underlying solutions—and therefore the derived instructional signals—are technically reliable.
 
28
 
29
- This corpus serves as the foundation for training models that prioritize **Socratic reasoning, misconception-aware feedback, and conceptual understanding** over direct SQL solution generation.
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - text-to-sql
5
+ - education
6
+ - socratic-learning
7
+ - instruction-tuning
8
+ - sql
9
+ - STEM
10
+ - pedagogy
11
+ datasets:
12
+ - SQL-Instruct
13
+ ---
14
 
15
+ # SQL Socratic Models
16
 
17
+ ## Model Description
18
 
19
+ SQL Socratic Models are a collection of fine-tuned large language models designed for **Socratic SQL instruction in higher education**. Unlike standard Text-to-SQL systems, these models are trained to **guide learners through reasoning steps without producing final SQL solutions**, supporting conceptual understanding and active learning in STEM contexts.
20
 
21
+ Supported architectures:
22
+ - Phi-3
23
+ - Qwen2.5
24
+ - Gemma2
25
 
26
+ ---
27
 
28
+ ## Intended Use
 
 
29
 
30
+ These models are designed for:
31
+
32
+ - Teaching SQL concepts in higher education
33
+ - Supporting STEM learners through guided reasoning
34
+ - Providing step-by-step Socratic hints for SQL problems
35
+ - Assisting debugging and conceptual clarification
36
+
37
+ ### Important Constraint
38
+ The models are intentionally trained to:
39
+ - ✅ Provide reasoning steps and conceptual hints
40
+ - ❌ Avoid generating complete SQL solutions
41
+
42
+ This ensures alignment with pedagogical goals such as scaffolding and learner engagement.
43
+
44
+ ---
45
+
46
+ ## Training Data: SQL-Instruct Corpus
47
+
48
+ We construct **SQL-Instruct**, a domain-specific Socratic instruction corpus, by mining high-quality interactions from Stack Overflow. This platform captures real-world misconceptions, debugging challenges, and conceptual gaps encountered by learners and practitioners.
49
+
50
+ ### Data Collection
51
+
52
+ To ensure high-quality instructional signals, we filter SQL-tagged questions based on community impact. The resulting dataset has:
53
+
54
+ - **1.27 billion total views**
55
+ - **128,535 average views per question**
56
+
57
+ For each selected entry, we extract:
58
+ - Problem descriptions
59
+ - User-submitted SQL attempts
60
+ - Executable SQL from accepted solutions
61
+
62
+ This yields **9,916 unique questions**.
63
+
64
+ ---
65
+
66
+ ### Socratic Augmentation
67
+
68
+ Each example is transformed into a Socratic instructional format using GPT-4o, which generates:
69
+
70
+ - Guided reasoning steps
71
+ - Conceptual hints
72
+ - Question decomposition
73
+
74
+ This ensures the dataset emphasizes **instructional scaffolding rather than answer generation**.
75
+
76
+ ---
77
+
78
+ ### Dataset Composition
79
+
80
+ - **Intermediate questions:** 8,604
81
+ - **Advanced questions:** 629
82
+ - **Debugging tasks:** 531
83
+
84
+ The dataset emphasizes challenging reasoning scenarios, particularly:
85
 
 
86
  - JOIN operations
87
+ - Aggregations and grouping
88
+ - Query optimization
89
+
90
+ We further ensure reliability by selecting entries with a **median Stack Overflow score of 27**.
91
+
92
+ ---
93
+
94
+ ## Training Procedure
95
+
96
+ ### Phase 2: Fine-Tuning
97
+
98
+ We apply **Full Fine-Tuning (FFT)** on small, open-source LLMs under pedagogical constraints designed to:
99
+
100
+ - Encourage conceptual scaffolding
101
+ - Promote step-by-step reasoning
102
+ - Discourage direct SQL answer generation
103
+
104
+ ---
105
+
106
+ ## Evaluation
107
+
108
+ ### Phase 3 Metrics
109
+
110
+ Models are evaluated using:
111
+
112
+ - **BERTScore** → semantic alignment with expected reasoning
113
+ - **ROUGE-L** → detection of answer leakage (i.e., unintended full SQL generation)
114
+
115
+ ---
116
+
117
+ ## Key Contributions
118
+
119
+ - Socratic SQL instruction tuning for higher education
120
+ - SQL-Instruct dataset derived from real-world misconceptions
121
+ - Multi-model fine-tuning across Phi-3, Qwen2.5, and Gemma2
122
+ - Evaluation framework balancing reasoning quality and answer leakage
123
+ - Ablation study identifying factors enabling:
124
+ - Misconception-based feedback
125
+ - Iterative guidance
126
+ - Instructor-like reasoning behavior
127
+
128
+ ---
129
+
130
+ ## Limitations
131
+
132
+ - Models may still occasionally generate partial SQL fragments
133
+ - Evaluation focuses on semantic similarity rather than full pedagogical outcomes
134
+ - Dataset is derived from Stack Overflow and may reflect community biases
135
+
136
+ ---
137
+
138
+ ## Ethical Considerations
139
+
140
+ These models are designed to support learning, not replace it. By avoiding full solution generation, they aim to:
141
+
142
+ - Encourage critical thinking
143
+ - Reduce over-reliance on AI-generated answers
144
+ - Support equitable access to SQL learning resources
145
+
146
+ ---
147
+
148
+ ## Usage
149
 
150
+ ```python
151
+ from transformers import AutoModelForCausalLM, AutoTokenizer
152
 
153
+ model = AutoModelForCausalLM.from_pretrained("sriram882004/SQL-Socratic-Models/phi3_rq4")
154
+ tokenizer = AutoTokenizer.from_pretrained("sriram882004/SQL-Socratic-Models/phi3_rq4")