Upload folder using huggingface_hub

Browse files

Files changed (8) hide show

README.md +136 -0
config.json +107 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +57 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,136 @@

+---
+license: apache-2.0
+language:
+  - multilingual
+  - en
+  - de
+  - fr
+  - es
+  - pt
+  - nl
+base_model: distilbert-base-multilingual-cased
+tags:
+  - token-classification
+  - semantic-parsing
+  - hypergraph
+  - nlp
+pipeline_tag: token-classification
+library_name: transformers
+---
+# Atom Classifier
+A multilingual token classifier for **semantic hypergraph parsing**. It classifies each token in a sentence into one of 39 semantic atom types/subtypes, serving as the first stage (alpha) of the [Alpha-Beta semantic hypergraph parser](https://github.com/hyperquest-hq/hyperbase-parser-ab).
+## Model Details
+- **Architecture:** DistilBertForTokenClassification
+- **Base model:** distilbert-base-multilingual-cased
+- **Labels:** 39 semantic atom types
+- **Max sequence length:** 512
+## Label Taxonomy
+Atoms are typed according to the [Semantic Hyperedge (SH) notation system](https://hyperquest.ai/hyperbase/manual/notation/. The 7 main types and their subtypes:
+### Concepts (C)
+| Label | Description |
+|-------|-------------|
+| `C` | Generic concept |
+| `Cc` | Common noun |
+| `Cp` | Proper noun |
+| `Ca` | Adjective (as concept) |
+| `Ci` | Pronoun |
+| `Cd` | Determiner (as concept) |
+| `Cm` | Nominal modifier |
+| `Cw` | Interrogative word |
+| `C#` | Number |
+### Predicates (P)
+| Label | Description |
+|-------|-------------|
+| `P` | Generic predicate |
+| `Pd` | Declarative predicate |
+| `P!` | Imperative predicate |
+### Modifiers (M)
+| Label | Description |
+|-------|-------------|
+| `M` | Generic modifier |
+| `Ma` | Adjective modifier |
+| `Mc` | Conceptual modifier |
+| `Md` | Determiner modifier |
+| `Me` | Adverbial modifier |
+| `Mi` | Infinitive particle |
+| `Mj` | Conjunctional modifier |
+| `Ml` | Particle |
+| `Mm` | Modal (auxiliary verb) |
+| `Mn` | Negation |
+| `Mp` | Possessive modifier |
+| `Ms` | Superlative modifier |
+| `Mt` | Prepositional modifier |
+| `Mv` | Verbal modifier |
+| `Mw` | Specifier |
+| `M#` | Number modifier |
+| `M=` | Comparative modifier |
+| `M^` | Degree modifier |
+### Builders (B)
+| Label | Description |
+|-------|-------------|
+| `B` | Generic builder |
+| `Bp` | Possessive builder |
+| `Br` | Relational builder (preposition) |
+### Triggers (T)
+| Label | Description |
+|-------|-------------|
+| `T` | Generic trigger |
+| `Tt` | Temporal trigger |
+| `Tv` | Verbal trigger |
+### Conjunctions (J)
+| Label | Description |
+|-------|-------------|
+| `J` | Generic conjunction |
+| `Jr` | Relational conjunction |
+### Special
+| Label | Description |
+|-------|-------------|
+| `X` | Excluded token (punctuation, etc.) |
+## Usage
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification
+import torch
+tokenizer = AutoTokenizer.from_pretrained("hyperquest/atom-classifier")
+model = AutoModelForTokenClassification.from_pretrained("hyperquest/atom-classifier")
+sentence = "Berlin is the capital of Germany."
+encoded = tokenizer(sentence, return_tensors="pt", return_offsets_mapping=True)
+offset_mapping = encoded.pop("offset_mapping")
+with torch.no_grad():
+    outputs = model(**encoded)
+predictions = outputs.logits.argmax(-1)[0].tolist()
+word_ids = encoded.word_ids(0)
+for idx, word_id in enumerate(word_ids):
+    if word_id is not None:
+        start, end = offset_mapping[0][idx].tolist()
+        label = model.config.id2label[predictions[idx]]
+        print(f"{sentence[start:end]:15s} -> {label}")
+```
+## Intended Use
+This model is designed to be used as the first stage of the Alpha-Beta semantic hypergraph parser (`hyperbase-parser-ab`). It assigns atom types to tokens, which are then combined into nested hypergraph structures by rule-based grammar in the beta stage.
+## Part of
+- [hyperbase](https://github.com/hyperquest-hq/hyperbase) -- Semantic Hypergraph toolkit
+- [hyperbase-parser-ab](https://github.com/hyperquest-hq/hyperbase-parser-ab) -- Alpha-Beta parser

config.json ADDED Viewed

	@@ -0,0 +1,107 @@

+{
+  "_name_or_path": "distilbert-base-multilingual-cased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForTokenClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "Pd",
+    "1": "Ma",
+    "2": "Cp",
+    "3": "M^",
+    "4": "Br",
+    "5": "Cw",
+    "6": "Mm",
+    "7": "Tt",
+    "8": "Ml",
+    "9": "Bp",
+    "10": "Ci",
+    "11": "Ms",
+    "12": "Md",
+    "13": "P!",
+    "14": "Cd",
+    "15": "Mi",
+    "16": "C",
+    "17": "Cc",
+    "18": "Ca",
+    "19": "Mj",
+    "20": "M=",
+    "21": "X",
+    "22": "Mp",
+    "23": "Cm",
+    "24": "Mt",
+    "25": "Me",
+    "26": "Mv",
+    "27": "Jr",
+    "28": "M",
+    "29": "Tv",
+    "30": "J",
+    "31": "M#",
+    "32": "B",
+    "33": "Mc",
+    "34": "Mn",
+    "35": "Mw",
+    "36": "C#",
+    "37": "T",
+    "38": "P"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "B": 32,
+    "Bp": 9,
+    "Br": 4,
+    "C": 16,
+    "C#": 36,
+    "Ca": 18,
+    "Cc": 17,
+    "Cd": 14,
+    "Ci": 10,
+    "Cm": 23,
+    "Cp": 2,
+    "Cw": 5,
+    "J": 30,
+    "Jr": 27,
+    "M": 28,
+    "M#": 31,
+    "M=": 20,
+    "M^": 3,
+    "Ma": 1,
+    "Mc": 33,
+    "Md": 12,
+    "Me": 25,
+    "Mi": 15,
+    "Mj": 19,
+    "Ml": 8,
+    "Mm": 6,
+    "Mn": 34,
+    "Mp": 22,
+    "Ms": 11,
+    "Mt": 24,
+    "Mv": 26,
+    "Mw": 35,
+    "P": 38,
+    "P!": 13,
+    "Pd": 0,
+    "T": 37,
+    "Tt": 7,
+    "Tv": 29,
+    "X": 21
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "output_past": true,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.49.0",
+  "vocab_size": 119547
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2567ad7bcd298490a578acda9a82ce1e52ed074c0c89bc36f5c6d328874b045e
+size 539068644

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "add_prefix_space": true,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a8e983a187e2f77ff8ef15fb1de6aff11f9fdd325c133ae7fcc631c464bb7cbd
+size 5240

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff