ThingsAI commited on
Commit
9136f06
·
verified ·
1 Parent(s): d6f2218

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,87 +1,57 @@
1
  ---
2
- language: en
3
- license: apache-2.0
4
  tags:
5
- - pretraining
6
- - small-language-model
7
- - educational
8
- - causal-lm
9
- datasets:
10
- - HuggingFaceTB/smollm-corpus
11
  ---
12
 
13
- # Quark-50m-Instruct
14
 
15
- **Quark-50m-Instruct** is a causal (decoder-only) language model with approximately **50 million parameters**, trained from scratch on 5 billion tokens from the `smollm-corpus` dataset. It is designed to be lightweight, fast, and suitable for resource-constrained environments (e.g., RTX 3070, 8 GB VRAM), while retaining good text understanding and generation capabilities.
 
16
 
17
- The name *Quark* reflects its compact and elementary nature, ideal for on-device applications, lightweight conversational assistants, or as a base for domain-specific fine-tuning.
18
 
19
- ## Model Details
20
-
21
- | Property | Value |
22
- |------------------------|--------------------------------------|
23
- | Architecture | Transformer decoder-only (SmolLM-style) |
24
- | Parameters | ~50 M (effective, with weight tying) |
25
- | Context length | 2048 tokens |
26
- | Vocabulary size | 49,152 (cosmo2 tokenizer) |
27
- | Model identifier | `OvercastLab/Quark-50m-Instruct` |
28
- | Primary language | English (training data) |
29
-
30
- ## Architecture Details
31
-
32
- The model follows the style of **SmolLM** and **Qwen2.5** with the following characteristics:
33
-
34
- - **Grouped-Query Attention (GQA)** – ratio `n_heads / n_kv_heads = 3` to reduce KV cache footprint.
35
- - **SwiGLU** activation in feed-forward networks, with intermediate dimension `d_ff = 1024`.
36
- - **RMSNorm** applied before attention and FFN (pre-normalization).
37
- - **Rotary Positional Embeddings (RoPE)** with `theta = 10,000`.
38
- - **Weight tying** – input embedding and output projection share weights.
39
- - **Bias** – only on QKV projections (`qkv_bias = True`) for better numerical stability.
40
 
41
- | Component | Configuration |
42
- |----------------------|---------------------------------------|
43
- | `d_model` | 384 |
44
- | `n_layers` | 24 |
45
- | `n_heads` | 6 |
46
- | `n_kv_heads` | 2 |
47
- | `head_dim` | 64 |
48
- | `d_ff` | 1024 |
49
- | `dropout` | 0.0 (no dropout during pretraining) |
50
 
51
- ## Training Data
52
 
53
- The model was pretrained on **5 billion tokens** sampled from the [`HuggingFaceTB/smollm-corpus`](https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus) dataset with the following distribution:
54
 
55
- | Sub-dataset | Percentage | Tokens (billions) | Main content |
56
- |---------------------------|------------|-------------------|----------------------------------------------|
57
- | `cosmopedia-v2` | 60% | 3.0 | Synthetic textbooks, educational articles, stories |
58
- | `fineweb-edu-dedup` | 40% | 2.0 | Web pages filtered for educational quality |
59
 
60
- Data were tokenized using the `HuggingFaceTB/cosmo2-tokenizer` (vocabulary size 49,152), with the EOS token appended to each document. Training sequences have a fixed length of **2048** tokens (with packing).
61
 
62
- ## Training Procedure
63
 
64
- - **Framework**: PyTorch with `torch.compile` and `GradScaler` for mixed precision.
65
- - **Precision**: `bfloat16` (Ampere RTX 3070).
66
- - **Optimizer**: AdamW (`β₁=0.9`, `β₂=0.95`, weight decay = 0.1).
67
- - **Learning rate**: `3e-4` with linear warmup for 1,000 steps, then cosine decay to `3e-5`.
68
- - **Effective batch size**: 64 sequences × 2048 tokens = **131,072 tokens per step**.
69
- - Micro-batch: 4 sequences, gradient accumulation over 16 steps.
70
- - **Gradient clipping**: 1.0.
71
- - **Total steps**: approximately 38,000 (to reach 5B tokens).
72
 
73
- ## Usage
 
 
 
 
74
 
75
- You can load and use the model directly with the `transformers` library:
76
 
77
- ```python
78
- from transformers import AutoModelForCausalLM, AutoTokenizer
79
 
80
- model_name = "OvercastLab/Quark-50m-Instruct"
81
- tokenizer = AutoTokenizer.from_pretrained(model_name)
82
- model = AutoModelForCausalLM.from_pretrained(model_name)
83
 
84
- input_text = "The theory of relativity"
85
- inputs = tokenizer(input_text, return_tensors="pt")
86
- outputs = model.generate(**inputs, max_new_tokens=100)
87
- print(tokenizer.decode(outputs[0]))
 
 
 
 
 
 
 
 
1
  ---
2
+ library_name: transformers
3
+ model_name: sft_conv
4
  tags:
5
+ - generated_from_trainer
6
+ - trl
7
+ - sft
8
+ licence: license
 
 
9
  ---
10
 
11
+ # Model Card for sft_conv
12
 
13
+ This model is a fine-tuned version of [None](https://huggingface.co/None).
14
+ It has been trained using [TRL](https://github.com/huggingface/trl).
15
 
16
+ ## Quick start
17
 
18
+ ```python
19
+ from transformers import pipeline
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
+ question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
22
+ generator = pipeline("text-generation", model="None", device="cuda")
23
+ output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
24
+ print(output["generated_text"])
25
+ ```
 
 
 
 
26
 
27
+ ## Training procedure
28
 
29
+
30
 
 
 
 
 
31
 
 
32
 
33
+ This model was trained with SFT.
34
 
35
+ ### Framework versions
 
 
 
 
 
 
 
36
 
37
+ - TRL: 1.2.0
38
+ - Transformers: 5.6.1
39
+ - Pytorch: 2.4.1+cu124
40
+ - Datasets: 4.8.4
41
+ - Tokenizers: 0.22.2
42
 
43
+ ## Citations
44
 
 
 
45
 
 
 
 
46
 
47
+ Cite TRL as:
48
+
49
+ ```bibtex
50
+ @software{vonwerra2020trl,
51
+ title = {{TRL: Transformers Reinforcement Learning}},
52
+ author = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},
53
+ license = {Apache-2.0},
54
+ url = {https://github.com/huggingface/trl},
55
+ year = {2020}
56
+ }
57
+ ```
config.json CHANGED
@@ -1,30 +1,34 @@
1
  {
2
- "architectures": ["LlamaForCausalLM"],
3
- "model_type": "llama",
4
- "vocab_size": 49152,
 
 
 
 
 
 
 
5
  "hidden_size": 384,
 
6
  "intermediate_size": 1024,
7
- "num_hidden_layers": 24,
 
 
 
8
  "num_attention_heads": 6,
 
9
  "num_key_value_heads": 2,
10
- "head_dim": 64,
11
- "hidden_act": "silu",
12
- "max_position_embeddings": 2048,
13
- "initializer_range": 0.02,
14
  "rms_norm_eps": 1e-05,
15
- "rope_theta": 10000.0,
16
- "rope_scaling": null,
17
  "rope_interleaved": false,
18
- "attention_bias": true,
19
- "attention_dropout": 0.0,
20
- "mlp_bias": false,
 
21
  "tie_word_embeddings": true,
22
- "torch_dtype": "bfloat16",
23
- "bos_token_id": 1,
24
- "eos_token_id": 2,
25
- "pad_token_id": 2,
26
- "use_cache": true,
27
- "pretraining_tp": 1,
28
- "is_llama_config": true,
29
- "transformers_version": "4.42.3"
30
  }
 
1
  {
2
+ "architectures": [
3
+ "LlamaForCausalLM"
4
+ ],
5
+ "attention_bias": true,
6
+ "attention_dropout": 0.0,
7
+ "bos_token_id": 0,
8
+ "dtype": "bfloat16",
9
+ "eos_token_id": 0,
10
+ "head_dim": 64,
11
+ "hidden_act": "silu",
12
  "hidden_size": 384,
13
+ "initializer_range": 0.02,
14
  "intermediate_size": 1024,
15
+ "is_llama_config": true,
16
+ "max_position_embeddings": 2048,
17
+ "mlp_bias": false,
18
+ "model_type": "llama",
19
  "num_attention_heads": 6,
20
+ "num_hidden_layers": 24,
21
  "num_key_value_heads": 2,
22
+ "pad_token_id": 0,
23
+ "pretraining_tp": 1,
 
 
24
  "rms_norm_eps": 1e-05,
 
 
25
  "rope_interleaved": false,
26
+ "rope_parameters": {
27
+ "rope_theta": 10000.0,
28
+ "rope_type": "default"
29
+ },
30
  "tie_word_embeddings": true,
31
+ "transformers_version": "5.6.1",
32
+ "use_cache": false,
33
+ "vocab_size": 49152
 
 
 
 
 
34
  }
generation_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": [
5
+ 0,
6
+ 2
7
+ ],
8
+ "pad_token_id": 0,
9
+ "transformers_version": "5.6.1",
10
+ "use_cache": true
11
+ }
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:8ab7ac8c0adebf58cf54da3fa735f0a12e74fcaf7a404837825723eb355df2b3
3
- size 113346352
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:431dbd275b83cb41bd28cdd1bb6d9c30e87ed1c5da31957e70867c9cc095efa7
3
+ size 113367352
tokenizer_config.json CHANGED
@@ -24,9 +24,10 @@
24
  "<jupyter_script>",
25
  "<empty_output>"
26
  ],
27
- "is_local": false,
 
28
  "model_max_length": 1000000000000000019884624838656,
29
- "pad_token": null,
30
  "tokenizer_class": "GPT2Tokenizer",
31
  "unk_token": "<|endoftext|>",
32
  "vocab_size": 49152
 
24
  "<jupyter_script>",
25
  "<empty_output>"
26
  ],
27
+ "is_local": true,
28
+ "local_files_only": false,
29
  "model_max_length": 1000000000000000019884624838656,
30
+ "pad_token": "<|endoftext|>",
31
  "tokenizer_class": "GPT2Tokenizer",
32
  "unk_token": "<|endoftext|>",
33
  "vocab_size": 49152
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7111b6742ad6cc5ab900057295f3d9c66f5ee720c6c73b73b0f9abad6b7f195c
3
+ size 5304