Telmo commited on
Commit
b7e2589
·
verified ·
1 Parent(s): 248e89c

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - multilingual
5
+ - en
6
+ - de
7
+ - fr
8
+ - es
9
+ - pt
10
+ - nl
11
+ base_model: distilbert-base-multilingual-cased
12
+ tags:
13
+ - token-classification
14
+ - semantic-parsing
15
+ - hypergraph
16
+ - nlp
17
+ pipeline_tag: token-classification
18
+ library_name: transformers
19
+ ---
20
+
21
+ # Atom Classifier
22
+
23
+ A multilingual token classifier for **semantic hypergraph parsing**. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the [Alpha-Beta semantic hypergraph parser](https://github.com/hyperquest-hq/hyperbase-parser-ab).
24
+
25
+ ## Model Details
26
+
27
+ - **Architecture:** DistilBertForTokenClassification
28
+ - **Base model:** distilbert-base-multilingual-cased
29
+ - **Labels:** 39 semantic atom types
30
+ - **Max sequence length:** 512
31
+
32
+ ## Label Taxonomy
33
+
34
+ Atoms are typed according to the [Semantic Hyperedge (SH) notation system](https://hyperquest.ai/hyperbase/manual/notation/. The 7 main types and their subtypes:
35
+
36
+ ### Concepts (C)
37
+ | Label | Description |
38
+ |-------|-------------|
39
+ | `C` | Generic concept |
40
+ | `Cc` | Common noun |
41
+ | `Cp` | Proper noun |
42
+ | `Ca` | Adjective (as concept) |
43
+ | `Ci` | Pronoun |
44
+ | `Cd` | Determiner (as concept) |
45
+ | `Cm` | Nominal modifier |
46
+ | `Cw` | Interrogative word |
47
+ | `C#` | Number |
48
+
49
+ ### Predicates (P)
50
+ | Label | Description |
51
+ |-------|-------------|
52
+ | `P` | Generic predicate |
53
+ | `Pd` | Declarative predicate |
54
+ | `P!` | Imperative predicate |
55
+
56
+ ### Modifiers (M)
57
+ | Label | Description |
58
+ |-------|-------------|
59
+ | `M` | Generic modifier |
60
+ | `Ma` | Adjective modifier |
61
+ | `Mc` | Conceptual modifier |
62
+ | `Md` | Determiner modifier |
63
+ | `Me` | Adverbial modifier |
64
+ | `Mi` | Infinitive particle |
65
+ | `Mj` | Conjunctional modifier |
66
+ | `Ml` | Particle |
67
+ | `Mm` | Modal (auxiliary verb) |
68
+ | `Mn` | Negation |
69
+ | `Mp` | Possessive modifier |
70
+ | `Ms` | Superlative modifier |
71
+ | `Mt` | Prepositional modifier |
72
+ | `Mv` | Verbal modifier |
73
+ | `Mw` | Specifier |
74
+ | `M#` | Number modifier |
75
+ | `M=` | Comparative modifier |
76
+ | `M^` | Degree modifier |
77
+
78
+ ### Builders (B)
79
+ | Label | Description |
80
+ |-------|-------------|
81
+ | `B` | Generic builder |
82
+ | `Bp` | Possessive builder |
83
+ | `Br` | Relational builder (preposition) |
84
+
85
+ ### Triggers (T)
86
+ | Label | Description |
87
+ |-------|-------------|
88
+ | `T` | Generic trigger |
89
+ | `Tt` | Temporal trigger |
90
+ | `Tv` | Verbal trigger |
91
+
92
+ ### Conjunctions (J)
93
+ | Label | Description |
94
+ |-------|-------------|
95
+ | `J` | Generic conjunction |
96
+ | `Jr` | Relational conjunction |
97
+
98
+ ### Special
99
+ | Label | Description |
100
+ |-------|-------------|
101
+ | `X` | Excluded token (punctuation, etc.) |
102
+
103
+ ## Usage
104
+
105
+ ```python
106
+ from transformers import AutoTokenizer, AutoModelForTokenClassification
107
+ import torch
108
+
109
+ tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier")
110
+ model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier")
111
+
112
+ sentence = "Berlin is the capital of Germany."
113
+ encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
114
+ offset_mapping = encoded.pop("offset_mapping")
115
+
116
+ with torch.no_grad():
117
+ outputs = model(**encoded)
118
+
119
+ predictions = outputs.logits.argmax(-1)[0].tolist()
120
+ word_ids = encoded.word_ids(0)
121
+
122
+ for idx, word_id in enumerate(word_ids):
123
+ if word_id is not None:
124
+ start, end = offset_mapping[0][idx].tolist()
125
+ label = model.config.id2label[predictions[idx]]
126
+ print(f"{sentence[start:end]:15s} -> {label}")
127
+ ```
128
+
129
+ ## Intended Use
130
+
131
+ This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (`hyperbase-parser-ab`). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage.
132
+
133
+ ## Part of
134
+
135
+ - [hyperbase](https://github.com/hyperquest-hq/hyperbase) -- Semantic Hypergraph toolkit
136
+ - [hyperbase-parser-ab](https://github.com/hyperquest-hq/hyperbase-parser-ab) -- Alpha-Beta parser
config.json ADDED
@@ -0,0 +1,107 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "distilbert-base-multilingual-cased",
3
+ "activation": "gelu",
4
+ "architectures": [
5
+ "DistilBertForTokenClassification"
6
+ ],
7
+ "attention_dropout": 0.1,
8
+ "dim": 768,
9
+ "dropout": 0.1,
10
+ "hidden_dim": 3072,
11
+ "id2label": {
12
+ "0": "Pd",
13
+ "1": "Ma",
14
+ "2": "Cp",
15
+ "3": "M^",
16
+ "4": "Br",
17
+ "5": "Cw",
18
+ "6": "Mm",
19
+ "7": "Tt",
20
+ "8": "Ml",
21
+ "9": "Bp",
22
+ "10": "Ci",
23
+ "11": "Ms",
24
+ "12": "Md",
25
+ "13": "P!",
26
+ "14": "Cd",
27
+ "15": "Mi",
28
+ "16": "C",
29
+ "17": "Cc",
30
+ "18": "Ca",
31
+ "19": "Mj",
32
+ "20": "M=",
33
+ "21": "X",
34
+ "22": "Mp",
35
+ "23": "Cm",
36
+ "24": "Mt",
37
+ "25": "Me",
38
+ "26": "Mv",
39
+ "27": "Jr",
40
+ "28": "M",
41
+ "29": "Tv",
42
+ "30": "J",
43
+ "31": "M#",
44
+ "32": "B",
45
+ "33": "Mc",
46
+ "34": "Mn",
47
+ "35": "Mw",
48
+ "36": "C#",
49
+ "37": "T",
50
+ "38": "P"
51
+ },
52
+ "initializer_range": 0.02,
53
+ "label2id": {
54
+ "B": 32,
55
+ "Bp": 9,
56
+ "Br": 4,
57
+ "C": 16,
58
+ "C#": 36,
59
+ "Ca": 18,
60
+ "Cc": 17,
61
+ "Cd": 14,
62
+ "Ci": 10,
63
+ "Cm": 23,
64
+ "Cp": 2,
65
+ "Cw": 5,
66
+ "J": 30,
67
+ "Jr": 27,
68
+ "M": 28,
69
+ "M#": 31,
70
+ "M=": 20,
71
+ "M^": 3,
72
+ "Ma": 1,
73
+ "Mc": 33,
74
+ "Md": 12,
75
+ "Me": 25,
76
+ "Mi": 15,
77
+ "Mj": 19,
78
+ "Ml": 8,
79
+ "Mm": 6,
80
+ "Mn": 34,
81
+ "Mp": 22,
82
+ "Ms": 11,
83
+ "Mt": 24,
84
+ "Mv": 26,
85
+ "Mw": 35,
86
+ "P": 38,
87
+ "P!": 13,
88
+ "Pd": 0,
89
+ "T": 37,
90
+ "Tt": 7,
91
+ "Tv": 29,
92
+ "X": 21
93
+ },
94
+ "max_position_embeddings": 512,
95
+ "model_type": "distilbert",
96
+ "n_heads": 12,
97
+ "n_layers": 6,
98
+ "output_past": true,
99
+ "pad_token_id": 0,
100
+ "qa_dropout": 0.1,
101
+ "seq_classif_dropout": 0.2,
102
+ "sinusoidal_pos_embds": false,
103
+ "tie_weights_": true,
104
+ "torch_dtype": "float32",
105
+ "transformers_version": "4.49.0",
106
+ "vocab_size": 119547
107
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2567ad7bcd298490a578acda9a82ce1e52ed074c0c89bc36f5c6d328874b045e
3
+ size 539068644
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": true,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "[PAD]",
6
+ "lstrip": false,
7
+ "normalized": false,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "100": {
13
+ "content": "[UNK]",
14
+ "lstrip": false,
15
+ "normalized": false,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "101": {
21
+ "content": "[CLS]",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "102": {
29
+ "content": "[SEP]",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "103": {
37
+ "content": "[MASK]",
38
+ "lstrip": false,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "clean_up_tokenization_spaces": false,
46
+ "cls_token": "[CLS]",
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "pad_token": "[PAD]",
52
+ "sep_token": "[SEP]",
53
+ "strip_accents": null,
54
+ "tokenize_chinese_chars": true,
55
+ "tokenizer_class": "DistilBertTokenizer",
56
+ "unk_token": "[UNK]"
57
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a8e983a187e2f77ff8ef15fb1de6aff11f9fdd325c133ae7fcc631c464bb7cbd
3
+ size 5240
vocab.txt ADDED
The diff for this file is too large to render. See raw diff