sweelol commited on
Commit
059b1c1
·
verified ·
1 Parent(s): 63f4f09

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +165 -6
README.md CHANGED
@@ -1,29 +1,188 @@
1
-
2
  ---
3
  license: apache-2.0
4
  tags:
5
  - sweelol-ai
 
 
6
  - text-generation
7
  - gemma
8
  - distillation
9
  - pruning
10
  - lora
11
  - prompt-tuning
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- # {model_name}
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
  ## Model Description
17
 
18
  This model is part of the **Sweelol AI Hub** collection, resulting from experiments in efficient fine-tuning and knowledge distillation on the Gemma-3-270m architecture using the Databricks Dolly-15k dataset on Kaggle TPUs/GPUs.
19
 
20
- **Full Research Notebook & Benchmark Results:** [Link to your final Kaggle Benchmark notebook here]
21
 
22
  **Key Details:**
23
  * **Base Model:** `google/gemma-3-270m`
24
  * **Training Data:** Databricks Dolly-15k (subset)
25
- * **Fine-Tuning Method:** {method_description}
26
- * **Purpose:** {purpose}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
- This is a placeholder README. A detailed model card with full results and usage instructions will be added shortly.
 
 
29
 
 
 
1
  ---
2
  license: apache-2.0
3
  tags:
4
  - sweelol-ai
5
+ - gemma
6
+ - google
7
  - text-generation
8
  - gemma
9
  - distillation
10
  - pruning
11
  - lora
12
  - prompt-tuning
13
+ - instruction-tuning
14
+ datasets:
15
+ - databricks/databricks-dolly-15k
16
+ language:
17
+ - en
18
+ base_model:
19
+ - google/gemma-3-270m
20
+ library_name: transformers
21
+ pipeline_tag: text-generation
22
+
23
  ---
24
 
25
+ # sweelol/kd-gemma3-pruned-dolly
26
+
27
+ This model is part of the **Sweelol AI Hub**, a research project focused on efficient fine-tuning of modern language models on Kaggle accelerators.
28
+
29
+ **Full Research Notebook & Benchmark Results:** [Coming soon]
30
+
31
+ This model is part of the **Sweelol AI Hub** collection, resulting from experiments in efficient fine-tuning, optimization strategies and knowledge distillation on the Gemma-3-270m architecture using the Databricks Dolly-15k dataset on Kaggle TPUs/GPUs.
32
+
33
+ - **Developed by:** Sweelol AI
34
+ - **Shared by:** Sweelol AI
35
+ - **Model type:** Causal Language Model
36
+ - **Language(s) (NLP):** English
37
+ - **License:** Apache 2.0
38
+ - **Base Model:** `google/gemma-3-270m`
39
 
40
  ## Model Description
41
 
42
  This model is part of the **Sweelol AI Hub** collection, resulting from experiments in efficient fine-tuning and knowledge distillation on the Gemma-3-270m architecture using the Databricks Dolly-15k dataset on Kaggle TPUs/GPUs.
43
 
 
44
 
45
  **Key Details:**
46
  * **Base Model:** `google/gemma-3-270m`
47
  * **Training Data:** Databricks Dolly-15k (subset)
48
+ * **Fine-Tuning Method:** `Knowledge Distillation`
49
+ * **Purpose:** `Knowledge Distillation on TPU`
50
+
51
+ ### Model Sources
52
+
53
+ - **Repository:** `https://huggingface.co/sweelol/kd-gemma3-pruned-dolly`
54
+ - **GitHub:**
55
+
56
+ ## Uses
57
+
58
+ ### How to Get Started with the Model
59
+
60
+ ```python
61
+ from transformers import AutoModelForCausalLM, AutoTokenizer
62
+ # For PEFT models like LoRA or Prompt Tuning, you will also need:
63
+ # from peft import PeftModel
64
+
65
+ # This is the repository ID for your specialized model
66
+ model_id = "sweelol/kd-gemma3-pruned-dolly"
67
+
68
+ # For Full Fine-Tuned models:
69
+ model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
70
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
71
+
72
+ # For PEFT models (LoRA, Prompt Tuning):
73
+ # base_model = AutoModelForCausalLM.from_pretrained("{base_model}", torch_dtype="auto")
74
+ # model = PeftModel.from_pretrained(base_model, model_id)
75
+ # tokenizer = AutoTokenizer.from_pretrained(model_id)
76
+
77
+
78
+ # Example usage:
79
+ prompt = "Instruction:\nWhat is the capital of France?\n\nResponse:\n"
80
+ inputs = tokenizer(prompt, return_tensors="pt")
81
+
82
+ generate_ids = model.generate(inputs.input_ids, max_length=50)
83
+ result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
84
+
85
+ print(result)
86
+ ```
87
+
88
+ ## Evaluation
89
+
90
+ ### Testing Data & Metrics
91
+
92
+ The model was evaluated on a comprehensive suite of tasks from the `lm-evaluation-harness`, including 5 diverse subsets of **MMLU** (for academic reasoning) and **HellaSwag** (for common-sense reasoning). The primary metric is zero-shot accuracy on a 200-sample subset of each task's test split.
93
+
94
+ ### Results
95
+
96
+ This table summarizes the final benchmark scores for the `sweelol/kd-gemma3-pruned-dolly` model. It is compared against the original, un-tuned baseline model.
97
+
98
+ | Benchmark Task | Sweelol KD-Pruned | Baseline (Gemma-3-270m) |
99
+ | :--- | :--- | :--- |
100
+ | **Average MMLU (5 tasks)** | **23.98%** | **24.88%** |
101
+ | HellaSwag (Common Sense) | 33.00% | 43.50% |
102
+ | ---------------------------------- | ---------- | ---------- |
103
+ | *MMLU Sub-task Breakdown:* | | |
104
+ | MMLU - High School Computer Science | 26.00% | 24.00% |
105
+ | MMLU - Formal Logic | 25.40% | 25.40% |
106
+ | MMLU - Professional Law | 25.00% | 27.00% |
107
+ | MMLU - High School Mathematics | 21.50% | 26.00% |
108
+ | MMLU - Abstract Algebra | 22.00% | 22.00% |
109
+
110
+ #### Summary of Findings
111
+
112
+ * **Mixed Performance:** The Knowledge Distillation and Pruning process resulted in a model with a fascinating performance profile.
113
+ * **Strengths:** It showed a notable improvement in **High School Computer Science**, suggesting the fine-tuning process was effective for that specific domain.
114
+ * **Weaknesses:** The model showed a significant decrease in performance on **HellaSwag** and **High School Mathematics** compared to the baseline. This indicates that the distillation process, while teaching the target task, may have resulted in a loss of the model's broader, pre-trained common-sense and numerical reasoning abilities (a phenomenon known as "alignment tax").
115
+
116
+ *Full comparative results with other techniques can be found in our main research notebook linked at the top of this card.*
117
+
118
+
119
+ ### Description
120
+
121
+ Gemma is a family of lightweight, state-of-the-art open models from Google,
122
+ built from the same research and technology used to create the Gemini models.
123
+ Gemma 3 models are multimodal, handling text and image input and generating text
124
+ output, with open weights for both pre-trained variants and instruction-tuned
125
+ variants. Gemma 3 has a large, 128K context window, multilingual support in over
126
+ 140 languages, and is available in more sizes than previous versions. Gemma 3
127
+ models are well-suited for a variety of text generation and image understanding
128
+ tasks, including question answering, summarization, and reasoning. Their
129
+ relatively small size makes it possible to deploy them in environments with
130
+ limited resources such as laptops, desktops or your own cloud infrastructure,
131
+ democratizing access to state of the art AI models and helping foster innovation
132
+ for everyone.
133
+
134
+ ### Inputs and outputs
135
+
136
+ - **Input:**
137
+ - Text string, such as a question, a prompt, or a document to be summarized
138
+ - Images, normalized to 896 x 896 resolution and encoded to 256 tokens
139
+ each, for the 4B, 12B, and 27B sizes.
140
+ - Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
141
+ 32K tokens for the 1B and 270M sizes.
142
+
143
+ - **Output:**
144
+ - Generated text in response to the input, such as an answer to a
145
+ question, analysis of image content, or a summary of a document
146
+ - Total output context up to 128K tokens for the 4B, 12B, and 27B sizes,
147
+ and 32K tokens for the 1B and 270M sizes per request, subtracting the
148
+ request input tokens
149
+ ### Citation
150
+
151
+ ```none
152
+ @article{gemma_2025,
153
+ title={Gemma 3},
154
+ url={https://arxiv.org/abs/2503.19786},
155
+ publisher={Google DeepMind},
156
+ author={Gemma Team},
157
+ year={2025}
158
+ }
159
+ ```
160
+
161
+ ## Model Data
162
+
163
+ Data used for model training and how the data was processed.
164
+
165
+ ### Training Dataset
166
+
167
+ These models were trained on a dataset of text data that includes a wide variety
168
+ of sources. The 27B model was trained with 14 trillion tokens, the 12B model was
169
+ trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens,
170
+ the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The
171
+ knowledge cutoff date for the training data was August 2024. Here are the key
172
+ components:
173
+
174
+ - Web Documents: A diverse collection of web text ensures the model is
175
+ exposed to a broad range of linguistic styles, topics, and vocabulary. The
176
+ training dataset includes content in over 140 languages.
177
+ - Code: Exposing the model to code helps it to learn the syntax and
178
+ patterns of programming languages, which improves its ability to generate
179
+ code and understand code-related questions.
180
+ - Mathematics: Training on mathematical text helps the model learn logical
181
+ reasoning, symbolic representation, and to address mathematical queries.
182
+ - Images: A wide range of images enables the model to perform image
183
+ analysis and visual data extraction tasks.
184
 
185
+ The combination of these diverse data sources is crucial for training a powerful
186
+ multimodal model that can handle a wide variety of different tasks and data
187
+ formats.
188