mamounyosef commited on
Commit
9031dc1
·
verified ·
1 Parent(s): 6988873

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +17 -7
README.md CHANGED
@@ -9,7 +9,9 @@ tags:
9
  - commit-message-generation
10
  - code-summarization
11
  - generated_from_trainer
12
- license: apache-2.0
 
 
13
  language:
14
  - en
15
  ---
@@ -33,7 +35,8 @@ This model is a **QLoRA (4-bit quantized LoRA)** adapter trained on the Qwen2.5-
33
  - **Developed by:** Mamoun Yosef
34
  - **Model type:** Causal Language Model (Decoder-only Transformer) with LoRA adapters
35
  - **Language(s):** English
36
- - **License:** Apache 2.0
 
37
  - **Finetuned from model:** Qwen/Qwen2.5-Coder-0.5B
38
 
39
  ### Model Sources
@@ -41,6 +44,12 @@ This model is a **QLoRA (4-bit quantized LoRA)** adapter trained on the Qwen2.5-
41
  - **Repository:** [commit-message-llm](https://github.com/mamounyosef/commit-message-llm)
42
  - **Base Model:** [Qwen/Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
43
 
 
 
 
 
 
 
44
  ## Uses
45
 
46
  ### Direct Use
@@ -88,6 +97,7 @@ Can be integrated into:
88
  - Diffs from non-programming languages
89
  - Extremely large diffs (>8000 characters)
90
  - Commit messages requiring deep domain knowledge beyond code structure
 
91
 
92
  ## Bias, Risks, and Limitations
93
 
@@ -174,9 +184,9 @@ print(message)
174
  **Preprocessing:**
175
  - Removed trivial messages (fix, update, wip, etc.)
176
  - Filtered out reference-only commits (fix #123)
177
- - Removed placeholder tokens (<HASH>, <URL>)
178
  - Kept diffs between 50-8000 characters
179
- - Required messages with semantic content (3 words)
180
 
181
  **Final dataset sizes:**
182
  - Training: 120,000 samples
@@ -197,10 +207,10 @@ Prompt tokens (diff + separator) are masked with label `-100` so loss is compute
197
 
198
  #### Preprocessing
199
 
200
- 1. Normalize newlines (CRLF LF)
201
  2. Tokenize diff + separator + message
202
  3. Mask prompt labels to `-100`
203
- 4. Truncate to max_length=512 tokens
204
  5. Append EOS token to target
205
 
206
  #### Training Hyperparameters
@@ -257,7 +267,7 @@ Prompt tokens (diff + separator) are masked with label `-100` so loss is compute
257
  - **Loss:** Cross-entropy loss on commit message tokens
258
  - **Perplexity:** exp(loss), measures model confidence
259
  - Lower perplexity = better prediction quality
260
- - Perplexity 17 is strong for this task
261
 
262
  ### Results
263
 
 
9
  - commit-message-generation
10
  - code-summarization
11
  - generated_from_trainer
12
+ license: cc-by-nc-4.0
13
+ datasets:
14
+ - Maxscha/commitbench
15
  language:
16
  - en
17
  ---
 
35
  - **Developed by:** Mamoun Yosef
36
  - **Model type:** Causal Language Model (Decoder-only Transformer) with LoRA adapters
37
  - **Language(s):** English
38
+ - **License:** CC BY-NC 4.0 (non-commercial for this trained adapter)
39
+ - **Base model license:** Apache 2.0 (`Qwen/Qwen2.5-Coder-0.5B`)
40
  - **Finetuned from model:** Qwen/Qwen2.5-Coder-0.5B
41
 
42
  ### Model Sources
 
44
  - **Repository:** [commit-message-llm](https://github.com/mamounyosef/commit-message-llm)
45
  - **Base Model:** [Qwen/Qwen2.5-Coder-0.5B](https://huggingface.co/Qwen/Qwen2.5-Coder-0.5B)
46
 
47
+ ## License and Usage
48
+
49
+ - This adapter was trained using **CommitBench** (`Maxscha/commitbench`), licensed **CC BY-NC 4.0**.
50
+ - This trained adapter is therefore **non-commercial use only**.
51
+ - The base model (`Qwen/Qwen2.5-Coder-0.5B`) remains licensed under **Apache-2.0**.
52
+
53
  ## Uses
54
 
55
  ### Direct Use
 
97
  - Diffs from non-programming languages
98
  - Extremely large diffs (>8000 characters)
99
  - Commit messages requiring deep domain knowledge beyond code structure
100
+ - Commercial usage of this trained adapter
101
 
102
  ## Bias, Risks, and Limitations
103
 
 
184
  **Preprocessing:**
185
  - Removed trivial messages (fix, update, wip, etc.)
186
  - Filtered out reference-only commits (fix #123)
187
+ - Removed placeholder tokens (`<HASH>`, `<URL>`)
188
  - Kept diffs between 50-8000 characters
189
+ - Required messages with semantic content (>=3 words)
190
 
191
  **Final dataset sizes:**
192
  - Training: 120,000 samples
 
207
 
208
  #### Preprocessing
209
 
210
+ 1. Normalize newlines (CRLF -> LF)
211
  2. Tokenize diff + separator + message
212
  3. Mask prompt labels to `-100`
213
+ 4. Truncate to `max_length=512` tokens
214
  5. Append EOS token to target
215
 
216
  #### Training Hyperparameters
 
267
  - **Loss:** Cross-entropy loss on commit message tokens
268
  - **Perplexity:** exp(loss), measures model confidence
269
  - Lower perplexity = better prediction quality
270
+ - Perplexity ~17 is strong for this task
271
 
272
  ### Results
273