windlx commited on
Commit
2df0e60
·
verified ·
1 Parent(s): 8a6094e

Add model card

Browse files
Files changed (1) hide show
  1. README.md +68 -0
README.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - zh
5
+ license: mit
6
+ tags:
7
+ - url-classification
8
+ - binary-classification
9
+ - autoresearch
10
+ datasets:
11
+ - iowacat
12
+ metrics:
13
+ - accuracy: 0.9962
14
+ model_index:
15
+ - name: url-classifier
16
+ results:
17
+ - task:
18
+ type: text-classification
19
+ name: URL Binary Classification
20
+ dataset:
21
+ type: iowacat
22
+ name: URL Classification Dataset
23
+ metrics:
24
+ - type: accuracy
25
+ value: 0.9962
26
+ ---
27
+
28
+ # URL Classifier — Autoresearch
29
+
30
+ Binary classifier that predicts whether a URL is a **list page (A)** or a **detail page (B)**.
31
+
32
+ ## Model Details
33
+
34
+ - **Architecture**: Custom transformer (Autoresearch framework)
35
+ - **Parameters**: ~161M
36
+ - **Depth**: 4 layers
37
+ - **Model dim**: 384
38
+ - **Vocab**: cl100k_base (100,277 tokens)
39
+ - **Max seq len**: 64
40
+ - **Training time**: 5 minutes on RTX 4060 Laptop
41
+
42
+ ## Training
43
+
44
+ Trained with the Autoresearch framework, which combines:
45
+ - **Muon** optimizer for attention/MLP layers
46
+ - **AdamW** for embeddings
47
+ - **Sliding window attention** (SSSL pattern)
48
+ - **Value embeddings** for alternating layers
49
+
50
+ Final loss: ~0.002 | Accuracy: **99.62%**
51
+
52
+ ## Usage
53
+
54
+ ```python
55
+ from src.prepare import Tokenizer
56
+
57
+ tokenizer = Tokenizer.from_directory()
58
+ # Encode a URL
59
+ ids = tokenizer.encode("https://example.com/product/123")
60
+ # Run through model + class_head for classification
61
+ ```
62
+
63
+ ## Class Labels
64
+
65
+ | Label | Meaning |
66
+ |-------|---------|
67
+ | 0 | A — List page |
68
+ | 1 | B — Detail page |