windlx commited on
Commit
095e819
·
verified ·
1 Parent(s): e9073c6

Add model card

Browse files
Files changed (1) hide show
  1. README.md +36 -27
README.md CHANGED
@@ -7,28 +7,29 @@ tags:
7
  - url-classification
8
  - binary-classification
9
  - autoresearch
10
- datasets:
11
- - iowacat
12
  metrics:
13
- - accuracy: 0.9962
14
  model_index:
15
- - name: url-classifier
16
  results:
17
  - task:
18
  type: text-classification
19
- name: URL Binary Classification
20
  dataset:
21
- type: iowacat
22
- name: URL Classification Dataset
23
  metrics:
24
  - type: accuracy
25
- value: 0.9962
26
  ---
27
 
28
- # URL Classifier — Autoresearch
29
 
30
  Binary classifier that predicts whether a URL is a **list page (A)** or a **detail page (B)**.
31
 
 
 
32
  ## Model Details
33
 
34
  - **Architecture**: Custom transformer (Autoresearch framework)
@@ -37,32 +38,40 @@ Binary classifier that predicts whether a URL is a **list page (A)** or a **deta
37
  - **Model dim**: 384
38
  - **Vocab**: cl100k_base (100,277 tokens)
39
  - **Max seq len**: 64
40
- - **Training time**: 5 minutes on RTX 4060 Laptop
41
-
42
- ## Training
43
 
44
- Trained with the Autoresearch framework, which combines:
45
- - **Muon** optimizer for attention/MLP layers
46
- - **AdamW** for embeddings
47
- - **Sliding window attention** (SSSL pattern)
48
- - **Value embeddings** for alternating layers
49
 
50
- Final loss: ~0.002 | Accuracy: **99.62%**
 
 
 
 
 
 
 
 
 
51
 
52
  ## Usage
53
 
54
- ```python
55
- from src.prepare import Tokenizer
56
-
57
- tokenizer = Tokenizer.from_directory()
58
- # Encode a URL
59
- ids = tokenizer.encode("https://example.com/product/123")
60
- # Run through model + class_head for classification
61
  ```
62
 
63
  ## Class Labels
64
 
65
  | Label | Meaning |
66
  |-------|---------|
67
- | 0 | A List page |
68
- | 1 | BDetail page |
 
 
 
 
 
 
 
7
  - url-classification
8
  - binary-classification
9
  - autoresearch
10
+ - multi-domain
 
11
  metrics:
12
+ - accuracy
13
  model_index:
14
+ - name: url-classifier-v2
15
  results:
16
  - task:
17
  type: text-classification
18
+ name: URL Binary Classification (Multi-Domain)
19
  dataset:
20
+ type: "synthetic-diverse (26 domains)"
21
+ name: URL Classification Diverse Dataset
22
  metrics:
23
  - type: accuracy
24
+ value: 1.0000
25
  ---
26
 
27
+ # URL Classifier v2 — Autoresearch (Multi-Domain)
28
 
29
  Binary classifier that predicts whether a URL is a **list page (A)** or a **detail page (B)**.
30
 
31
+ Trained on **26 diverse domains** across e-commerce, recruitment, news, social, video, travel, education, and tech documentation — significantly improved generalization over the v1 single-domain model.
32
+
33
  ## Model Details
34
 
35
  - **Architecture**: Custom transformer (Autoresearch framework)
 
38
  - **Model dim**: 384
39
  - **Vocab**: cl100k_base (100,277 tokens)
40
  - **Max seq len**: 64
41
+ - **Training**: 30 min on RTX 4060 Laptop
42
+ - **Training samples**: 2,600 (A=1,300, B=1,300)
43
+ - **Training accuracy**: 100%
44
 
45
+ ## Supported Domains
 
 
 
 
46
 
47
+ | Category | Domains |
48
+ |----------|---------|
49
+ | E-commerce | Amazon, JD, Taobao, Tmall, Pinduoduo |
50
+ | Recruitment | Zhilian, BOSS, Lagou |
51
+ | News | Sina, NetEase, Tencent News, 36kr |
52
+ | Social | Zhihu, Douban, Xiaohongshu, Reddit |
53
+ | Video | YouTube, Bilibili |
54
+ | Travel | Ctrip, Qunar, Mafengwo |
55
+ | Education | icourse163, imooc |
56
+ | Tech Docs | GitHub, ReadTheDocs, MDN |
57
 
58
  ## Usage
59
 
60
+ ```bash
61
+ pip install torch tiktoken
62
+ python src/infer.py "https://example.com/product/123" # detail page
63
+ python src/infer.py "https://example.com/search?q=foo" # list page
 
 
 
64
  ```
65
 
66
  ## Class Labels
67
 
68
  | Label | Meaning |
69
  |-------|---------|
70
+ | 0 (A) | List page — search results, category pages, rankings |
71
+ | 1 (B) | Detail page product page, article, profile, video |
72
+
73
+ ## Limitations
74
+
75
+ - Bilibili ranking pages may be misclassified as detail pages
76
+ - Very short URLs or URL shorteners may have lower accuracy
77
+ - Third-party evaluation accuracy (~55%) indicates room for improvement with real-world labeled data