Sweelol-ai
/

kd-gemma3-pruned-dolly

@@ -1,29 +1,188 @@
 ---
 license: apache-2.0
 tags:
 - sweelol-ai
 - text-generation
 - gemma
 - distillation
 - pruning
 - lora
 - prompt-tuning
 ---
-# {model_name}
 ## Model Description
 This model is part of the **Sweelol AI Hub** collection, resulting from experiments in efficient fine-tuning and knowledge distillation on the Gemma-3-270m architecture using the Databricks Dolly-15k dataset on Kaggle TPUs/GPUs.
-**Full Research Notebook & Benchmark Results:** [Link to your final Kaggle Benchmark notebook here]
 **Key Details:**
 *   **Base Model:** `google/gemma-3-270m`
 *   **Training Data:** Databricks Dolly-15k (subset)
-*   **Fine-Tuning Method:** {method_description}
-*   **Purpose:** {purpose}
-This is a placeholder README. A detailed model card with full results and usage instructions will be added shortly.

 ---
 license: apache-2.0
 tags:
 - sweelol-ai
+- gemma
+- google
 - text-generation
 - gemma
 - distillation
 - pruning
 - lora
 - prompt-tuning
+- instruction-tuning
+datasets:
+- databricks/databricks-dolly-15k
+language:
+- en
+base_model:
+- google/gemma-3-270m
+library_name: transformers
+pipeline_tag: text-generation
 ---
+# sweelol/kd-gemma3-pruned-dolly
+This model is part of the **Sweelol AI Hub**, a research project focused on efficient fine-tuning of modern language models on Kaggle accelerators.
+**Full Research Notebook & Benchmark Results:** [Coming soon]
+This model is part of the **Sweelol AI Hub** collection, resulting from experiments in efficient fine-tuning, optimization strategies and knowledge distillation on the Gemma-3-270m architecture using the Databricks Dolly-15k dataset on Kaggle TPUs/GPUs.
+- **Developed by:** Sweelol AI
+- **Shared by:** Sweelol AI
+- **Model type:** Causal Language Model
+- **Language(s) (NLP):** English
+- **License:** Apache 2.0
+- **Base Model:** `google/gemma-3-270m`
 ## Model Description
 This model is part of the **Sweelol AI Hub** collection, resulting from experiments in efficient fine-tuning and knowledge distillation on the Gemma-3-270m architecture using the Databricks Dolly-15k dataset on Kaggle TPUs/GPUs.
 **Key Details:**
 *   **Base Model:** `google/gemma-3-270m`
 *   **Training Data:** Databricks Dolly-15k (subset)
+*   **Fine-Tuning Method:** `Knowledge Distillation`
+*   **Purpose:** `Knowledge Distillation on TPU`
+### Model Sources
+- **Repository:** `https://huggingface.co/sweelol/kd-gemma3-pruned-dolly`
+- **GitHub:**
+## Uses
+### How to Get Started with the Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# For PEFT models like LoRA or Prompt Tuning, you will also need:
+# from peft import PeftModel
+# This is the repository ID for your specialized model
+model_id = "sweelol/kd-gemma3-pruned-dolly"
+# For Full Fine-Tuned models:
+model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+# For PEFT models (LoRA, Prompt Tuning):
+# base_model = AutoModelForCausalLM.from_pretrained("{base_model}", torch_dtype="auto")
+# model = PeftModel.from_pretrained(base_model, model_id)
+# tokenizer = AutoTokenizer.from_pretrained(model_id)
+# Example usage:
+prompt = "Instruction:\nWhat is the capital of France?\n\nResponse:\n"
+inputs = tokenizer(prompt, return_tensors="pt")
+generate_ids = model.generate(inputs.input_ids, max_length=50)
+result = tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
+print(result)
+```
+## Evaluation
+### Testing Data & Metrics
+The model was evaluated on a comprehensive suite of tasks from the `lm-evaluation-harness`, including 5 diverse subsets of **MMLU** (for academic reasoning) and **HellaSwag** (for common-sense reasoning). The primary metric is zero-shot accuracy on a 200-sample subset of each task's test split.
+### Results
+This table summarizes the final benchmark scores for the `sweelol/kd-gemma3-pruned-dolly` model. It is compared against the original, un-tuned baseline model.
+| Benchmark Task | Sweelol KD-Pruned | Baseline (Gemma-3-270m) |
+| :--- | :--- | :--- |
+| **Average MMLU (5 tasks)** | **23.98%** | **24.88%** |
+| HellaSwag (Common Sense) | 33.00% | 43.50% |
+| ---------------------------------- | ---------- | ---------- |
+| *MMLU Sub-task Breakdown:* | | |
+| MMLU - High School Computer Science | 26.00% | 24.00% |
+| MMLU - Formal Logic | 25.40% | 25.40% |
+| MMLU - Professional Law | 25.00% | 27.00% |
+| MMLU - High School Mathematics | 21.50% | 26.00% |
+| MMLU - Abstract Algebra | 22.00% | 22.00% |
+#### Summary of Findings
+*   **Mixed Performance:** The Knowledge Distillation and Pruning process resulted in a model with a fascinating performance profile.
+*   **Strengths:** It showed a notable improvement in **High School Computer Science**, suggesting the fine-tuning process was effective for that specific domain.
+*   **Weaknesses:** The model showed a significant decrease in performance on **HellaSwag** and **High School Mathematics** compared to the baseline. This indicates that the distillation process, while teaching the target task, may have resulted in a loss of the model's broader, pre-trained common-sense and numerical reasoning abilities (a phenomenon known as "alignment tax").
+*Full comparative results with other techniques can be found in our main research notebook linked at the top of this card.*
+### Description
+Gemma is a family of lightweight, state-of-the-art open models from Google,
+built from the same research and technology used to create the Gemini models.
+Gemma 3 models are multimodal, handling text and image input and generating text
+output, with open weights for both pre-trained variants and instruction-tuned
+variants. Gemma 3 has a large, 128K context window, multilingual support in over
+140 languages, and is available in more sizes than previous versions. Gemma 3
+models are well-suited for a variety of text generation and image understanding
+tasks, including question answering, summarization, and reasoning. Their
+relatively small size makes it possible to deploy them in environments with
+limited resources such as laptops, desktops or your own cloud infrastructure,
+democratizing access to state of the art AI models and helping foster innovation
+for everyone.
+### Inputs and outputs
+-   **Input:**
+    -  Text string, such as a question, a prompt, or a document to be summarized
+    -  Images, normalized to 896 x 896 resolution and encoded to 256 tokens
+       each, for the 4B, 12B, and 27B sizes.
+    -  Total input context of 128K tokens for the 4B, 12B, and 27B sizes, and
+       32K tokens for the 1B and 270M sizes.
+-   **Output:**
+    -   Generated text in response to the input, such as an answer to a
+        question, analysis of image content, or a summary of a document
+    -   Total output context up to 128K tokens for the 4B, 12B, and 27B sizes,
+        and 32K tokens for the 1B and 270M sizes per request, subtracting the
+        request input tokens
+### Citation
+```none
+@article{gemma_2025,
+    title={Gemma 3},
+    url={https://arxiv.org/abs/2503.19786},
+    publisher={Google DeepMind},
+    author={Gemma Team},
+    year={2025}
+}
+```
+## Model Data
+Data used for model training and how the data was processed.
+### Training Dataset
+These models were trained on a dataset of text data that includes a wide variety
+of sources. The 27B model was trained with 14 trillion tokens, the 12B model was
+trained with 12 trillion tokens, 4B model was trained with 4 trillion tokens,
+the 1B with 2 trillion tokens, and the 270M with 6 trillion tokens. The
+knowledge cutoff date for the training data was August 2024. Here are the key
+components:
+-   Web Documents: A diverse collection of web text ensures the model is
+    exposed to a broad range of linguistic styles, topics, and vocabulary. The
+    training dataset includes content in over 140 languages.
+-   Code: Exposing the model to code helps it to learn the syntax and
+    patterns of programming languages, which improves its ability to generate
+    code and understand code-related questions.
+-   Mathematics: Training on mathematical text helps the model learn logical
+    reasoning, symbolic representation, and to address mathematical queries.
+-   Images: A wide range of images enables the model to perform image
+    analysis and visual data extraction tasks.
+The combination of these diverse data sources is crucial for training a powerful
+multimodal model that can handle a wide variety of different tasks and data
+formats.