ThingsAI commited on
Commit
d6f2218
·
verified ·
1 Parent(s): 06af95a

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -0
README.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: apache-2.0
4
+ tags:
5
+ - pretraining
6
+ - small-language-model
7
+ - educational
8
+ - causal-lm
9
+ datasets:
10
+ - HuggingFaceTB/smollm-corpus
11
+ ---
12
+
13
+ # Quark-50m-Instruct
14
+
15
+ **Quark-50m-Instruct** is a causal (decoder-only) language model with approximately **50 million parameters**, trained from scratch on 5 billion tokens from the `smollm-corpus` dataset. It is designed to be lightweight, fast, and suitable for resource-constrained environments (e.g., RTX 3070, 8 GB VRAM), while retaining good text understanding and generation capabilities.
16
+
17
+ The name *Quark* reflects its compact and elementary nature, ideal for on-device applications, lightweight conversational assistants, or as a base for domain-specific fine-tuning.
18
+
19
+ ## Model Details
20
+
21
+ | Property | Value |
22
+ |------------------------|--------------------------------------|
23
+ | Architecture | Transformer decoder-only (SmolLM-style) |
24
+ | Parameters | ~50 M (effective, with weight tying) |
25
+ | Context length | 2048 tokens |
26
+ | Vocabulary size | 49,152 (cosmo2 tokenizer) |
27
+ | Model identifier | `OvercastLab/Quark-50m-Instruct` |
28
+ | Primary language | English (training data) |
29
+
30
+ ## Architecture Details
31
+
32
+ The model follows the style of **SmolLM** and **Qwen2.5** with the following characteristics:
33
+
34
+ - **Grouped-Query Attention (GQA)** – ratio `n_heads / n_kv_heads = 3` to reduce KV cache footprint.
35
+ - **SwiGLU** activation in feed-forward networks, with intermediate dimension `d_ff = 1024`.
36
+ - **RMSNorm** applied before attention and FFN (pre-normalization).
37
+ - **Rotary Positional Embeddings (RoPE)** with `theta = 10,000`.
38
+ - **Weight tying** – input embedding and output projection share weights.
39
+ - **Bias** – only on QKV projections (`qkv_bias = True`) for better numerical stability.
40
+
41
+ | Component | Configuration |
42
+ |----------------------|---------------------------------------|
43
+ | `d_model` | 384 |
44
+ | `n_layers` | 24 |
45
+ | `n_heads` | 6 |
46
+ | `n_kv_heads` | 2 |
47
+ | `head_dim` | 64 |
48
+ | `d_ff` | 1024 |
49
+ | `dropout` | 0.0 (no dropout during pretraining) |
50
+
51
+ ## Training Data
52
+
53
+ The model was pretrained on **5 billion tokens** sampled from the [`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) dataset with the following distribution:
54
+
55
+ | Sub-dataset | Percentage | Tokens (billions) | Main content |
56
+ |---------------------------|------------|-------------------|----------------------------------------------|
57
+ | `cosmopedia-v2` | 60% | 3.0 | Synthetic textbooks, educational articles, stories |
58
+ | `fineweb-edu-dedup` | 40% | 2.0 | Web pages filtered for educational quality |
59
+
60
+ Data were tokenized using the `HuggingFaceTB/cosmo2-tokenizer` (vocabulary size 49,152), with the EOS token appended to each document. Training sequences have a fixed length of **2048** tokens (with packing).
61
+
62
+ ## Training Procedure
63
+
64
+ - **Framework**: PyTorch with `torch.compile` and `GradScaler` for mixed precision.
65
+ - **Precision**: `bfloat16` (Ampere RTX 3070).
66
+ - **Optimizer**: AdamW (`β₁=0.9`, `β₂=0.95`, weight decay = 0.1).
67
+ - **Learning rate**: `3e-4` with linear warmup for 1,000 steps, then cosine decay to `3e-5`.
68
+ - **Effective batch size**: 64 sequences × 2048 tokens = **131,072 tokens per step**.
69
+ - Micro-batch: 4 sequences, gradient accumulation over 16 steps.
70
+ - **Gradient clipping**: 1.0.
71
+ - **Total steps**: approximately 38,000 (to reach 5B tokens).
72
+
73
+ ## Usage
74
+
75
+ You can load and use the model directly with the `transformers` library:
76
+
77
+ ```python
78
+ from transformers import AutoModelForCausalLM, AutoTokenizer
79
+
80
+ model_name = "OvercastLab/Quark-50m-Instruct"
81
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
82
+ model = AutoModelForCausalLM.from_pretrained(model_name)
83
+
84
+ input_text = "The theory of relativity"
85
+ inputs = tokenizer(input_text, return_tensors="pt")
86
+ outputs = model.generate(**inputs, max_new_tokens=100)
87
+ print(tokenizer.decode(outputs[0]))