feat: init

Browse files

Files changed (5) hide show

NuZero_token_token_metrics.txt +35 -0
README.md +108 -0
gliner_config.json +27 -0
pytorch_model.bin +3 -0
zero_shot_performance_unzero_token.png +0 -0

NuZero_token_token_metrics.txt ADDED Viewed

	@@ -0,0 +1,35 @@

+##############################################
+step: final
+Table for all datasets except crossNER
+ACE 2004            : 38.5%
+ACE 2005            : 39.6%
+AnatEM              : 48.8%
+Broad Tweet Corpus  : 64.5%
+CoNLL 2003          : 66.0%
+FabNER              : 35.8%
+FindVehicle         : 48.8%
+GENIA_NER           : 59.4%
+HarveyNER           : 28.0%
+MultiNERD           : 61.8%
+Ontonotes           : 39.7%
+PolyglotNER         : 48.5%
+TweetNER7           : 52.3%
+WikiANN en          : 69.6%
+WikiNeural          : 75.0%
+bc2gm               : 64.6%
+bc4chemd            : 64.6%
+bc5cdr              : 74.1%
+ncbi                : 74.9%
+Average             : 55.5%
+Table for zero-shot benchmark
+CrossNER_AI         : 59.1%
+CrossNER_literature : 72.4%
+CrossNER_music      : 76.0%
+CrossNER_politics   : 83.1%
+CrossNER_science    : 66.6%
+mit-movie           : 65.2%
+mit-restaurant      : 53.6%
+Average             : 68.0%
+##############################################

README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+---
+license: mit
+datasets:
+- numind/NuNER
+library_name: gliner
+language:
+- en
+pipeline_tag: token-classification
+tags:
+- entity recognition
+- NER
+- named entity recognition
+- zero shot
+- zero-shot
+---
+NuNER Zero is a zero-shot Named Entity Recognition (NER) Model. (Check [NuNER](https://huggingface.co/collections/numind/nuner-token-classification-and-ner-backbones-65e1f6e14639e2a465af823b) for the few-shot setting).
+NuNER Zero uses the [GLiNER](https://huggingface.co/papers/2311.08526) architecture: its input should be a concatenation of entity types and text.
+Unlike GliNER, NuNER Zero is a token classifier, which allows detect arbitrary long entities.
+NuNER Zero was trained on [NuNER v2.0](https://huggingface.co/numind/NuNER-v2.0) dataset, which combines subsets of Pile and C4 annotated via LLMs using [NuNER's procedure](https://huggingface.co/papers/2402.15343).
+NuNER Zero is (at the time of its release) the best compact zero-shot NER model (+3.1% token-level F1-Score over GLiNER-large-v2.1 on GLiNERS's benchmark)
+<p align="left">
+<img src="zero_shot_performance_unzero_token.png" width="600">
+</p>
+## Installation & Usage
+```
+!pip install gliner
+```
+**NuZero requires labels to be lower-cased**
+```python
+from gliner import GLiNER
+def merge_entities(entities):
+    if not entities:
+        return []
+    merged = []
+    current = entities[0]
+    for next_entity in entities[1:]:
+        if next_entity['label'] == current['label'] and (next_entity['start'] == current['end'] + 1 or next_entity['start'] == current['end']):
+            current['text'] = text[current['start']: next_entity['end']].strip()
+            current['end'] = next_entity['end']
+        else:
+            merged.append(current)
+            current = next_entity
+    # Append the last entity
+    merged.append(current)
+    return merged
+model = GLiNER.from_pretrained("numind/NuNerZero")
+# NuZero requires labels to be lower-cased!
+labels = ["organization", "initiative", "project"]
+labels = [l.lower() for l in labels]
+text = "At the annual technology summit, the keynote address was delivered by a senior member of the Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory, which recently launched an expansive initiative titled 'Quantum Computing and Algorithmic Innovations: Shaping the Future of Technology'. This initiative explores the implications of quantum mechanics on next-generation computing and algorithm design and is part of a broader effort that includes the 'Global Computational Science Advancement Project'. The latter focuses on enhancing computational methodologies across scientific disciplines, aiming to set new benchmarks in computational efficiency and accuracy."
+entities = model.predict_entities(text, labels)
+entities = merge_entities(entities)
+for entity in entities:
+    print(entity["text"], "=>", entity["label"])
+```
+```
+Association for Computing Machinery Special Interest Group on Algorithms and Computation Theory => organization
+Quantum Computing and Algorithmic Innovations: Shaping the Future of Technology => initiative
+Global Computational Science Advancement Project => project
+```
+## Fine-tuning
+A fine-tuning script can be found [here](https://colab.research.google.com/drive/1-hk5AIdX-TZdyes1yx-0qzS34YYEf3d2?usp=sharing).
+## Citation
+### This work
+```bibtex
+@misc{bogdanov2024nuner,
+      title={NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data},
+      author={Sergei Bogdanov and Alexandre Constantin and Timothée Bernard and Benoit Crabbé and Etienne Bernard},
+      year={2024},
+      eprint={2402.15343},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```
+### Previous work
+```bibtex
+@misc{zaratiana2023gliner,
+      title={GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer},
+      author={Urchade Zaratiana and Nadi Tomeh and Pierre Holat and Thierry Charnois},
+      year={2023},
+      eprint={2311.08526},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
+}
+```

gliner_config.json ADDED Viewed

	@@ -0,0 +1,27 @@

+{
+  "lr_encoder": "1e-5",
+  "lr_others": "5e-5",
+  "num_steps": 60000,
+  "warmup_ratio": 0.1,
+  "train_batch_size": 4,
+  "gradient_accumulation_steps": 2,
+  "eval_every": 2500,
+  "max_width": 1,
+  "model_name": "microsoft/deberta-v3-large",
+  "fine_tune": true,
+  "subtoken_pooling": "first",
+  "hidden_size": 768,
+  "span_mode": "marker",
+  "dropout": 0.4,
+  "root_dir": "ablation_backbone",
+  "train_data": "NuMinds_custom_data_mix.json",
+  "prev_path": "none",
+  "size_sup": -1,
+  "max_types": 25,
+  "shuffle_types": true,
+  "random_drop": true,
+  "max_neg_type_ratio": 1,
+  "max_len": 384,
+  "name": "large",
+  "log_dir": "logs"
+}

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:96a89110ff7d5d029a1b1bc1236dc46b9e01202ca807a8d319fd4fe3009403f5
+size 1795685762

zero_shot_performance_unzero_token.png ADDED Viewed