Shen-Pandi commited on
Commit
abe4da5
·
verified ·
1 Parent(s): b84780e

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +78 -0
README.md ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ tags:
6
+ - llama
7
+ - data-management
8
+ - data-engineering
9
+ - migration
10
+ - sql
11
+ - reasoning
12
+ - grpo
13
+ - rlhf
14
+ license: other
15
+ base_model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
16
+ ---
17
+
18
+ # Agentic Data 1
19
+
20
+ A specialized 8B reasoning model fine-tuned for Data Management, Data Engineering, and Migration tasks.
21
+
22
+ ## Model Details
23
+
24
+ - **Base**: DeepSeek-R1-Distill-Llama-8B
25
+ - **Training**: 3-stage pipeline (SFT QLoRA → Doc-Grounded SFT → GRPO Reinforcement Learning)
26
+ - **Format**: BF16 SafeTensors (PyTorch / HuggingFace Transformers compatible)
27
+ - **Parameters**: 8B
28
+
29
+ ## Training Pipeline
30
+
31
+ | Stage | Method | Data | Hardware |
32
+ |-------|--------|------|----------|
33
+ | Stage 1 | QLoRA SFT (3 versions) | 14,666 synthetic pairs + 7,558 doc-grounded chunks | Apple Silicon M-Series |
34
+ | Stage 2 | GRPO Reinforcement Learning | 100 reasoning prompts with reward functions | NVIDIA H100 80GB |
35
+
36
+ ## Capabilities
37
+
38
+ - **SQL Dialect Conversion**: Oracle ↔ PostgreSQL ↔ T-SQL ↔ Snowflake ↔ BigQuery ↔ Databricks
39
+ - **ETL Pipeline Migration**: Informatica → dbt, DataStage → Spark, BODS → Airflow
40
+ - **Legacy System Modernization**: COBOL, JCL, SAS, ABAP → modern stacks
41
+ - **Data Quality & Governance**: Assessment, validation, and compliance
42
+ - **Migration Lifecycle**: Discovery → Risk → Planning → Conversion → Verification
43
+ - **Step-by-Step Reasoning**: Uses `<think>...</think>` tags for chain-of-thought reasoning
44
+
45
+ ## Usage
46
+
47
+ ```python
48
+ from transformers import AutoModelForCausalLM, AutoTokenizer
49
+ import torch
50
+
51
+ model = AutoModelForCausalLM.from_pretrained(
52
+ "DataManagement-AI/Agentic-Data-1",
53
+ torch_dtype=torch.bfloat16,
54
+ device_map="auto",
55
+ )
56
+ tokenizer = AutoTokenizer.from_pretrained("DataManagement-AI/Agentic-Data-1")
57
+
58
+ messages = [
59
+ {"role": "system", "content": "You are Agentic Data 1, an expert data management and migration reasoning model. Think step-by-step before answering."},
60
+ {"role": "user", "content": "Convert this Oracle PL/SQL stored procedure to PostgreSQL PL/pgSQL."}
61
+ ]
62
+ inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
63
+ outputs = model.generate(inputs, max_new_tokens=1500)
64
+ print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))
65
+ ```
66
+
67
+ ## Benchmarks (SFT V3)
68
+
69
+ | Metric | Base Model | Agentic Data 1 | Improvement |
70
+ |--------|-----------|-----------------|-------------|
71
+ | Overall Score | 0.554 | **0.636** | +14.8% |
72
+ | Implementation Quality | 0.584 | **0.761** | +30.3% |
73
+ | Think-Tag Rate | 0% | **100%** | ∞ |
74
+ | Reasoning Quality | 0.534 | **0.622** | +16.5% |
75
+
76
+ ## License
77
+
78
+ For research and educational purposes.