new model

Browse files

Files changed (16) hide show

.DS_Store +0 -0
README.md +222 -3
__init__.py +0 -0
config.json +139 -0
configuration_stacked.py +101 -0
generic_ner.py +788 -0
label_map.json +1 -0
modeling_stacked.py +245 -0
pytorch_model.bin +3 -0
special_tokens_map.json +7 -0
test.py +46 -0
test_ner.py +106 -0
tokenizer.json +0 -0
tokenizer_config.json +59 -0
training_args.bin +3 -0
vocab.txt +0 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

README.md CHANGED Viewed

@@ -1,3 +1,222 @@
----
-license: agpl-3.0
----

+---
+library_name: transformers
+language:
+- en
+- fr
+- de
+tags:
+- v1.0.0
+---
+# Model Card for `impresso-project/ner-stacked-bert-multilingual-light`
+The **Impresso NER model** is a multilingual named entity recognition model trained for historical document processing. It is based on a stacked Transformer architecture and is designed to identify fine-grained and coarse-grained entity types in digitized historical texts, including names, titles, and locations.
+## Model Details
+### Model Description
+- **Developed by:** EPFL from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891).
+- **Model type:** Stacked BERT-based token classification for named entity recognition
+- **Languages:** French, German, English (with support for multilingual historical texts)
+- **License:** [AGPL v3+](https://github.com/impresso/impresso-pyindexation/blob/master/LICENSE)
+- **Finetuned from:** [`dbmdz/bert-medium-historic-multilingual-cased`](https://huggingface.co/dbmdz/bert-medium-historic-multilingual-cased)
+### Model Architecture
+The model architecture consists of the following components:
+- A **pre-trained BERT encoder** (multilingual historic BERT) as the base.
+- **One or two Transformer encoder layers** stacked on top of the BERT encoder.
+- A **Conditional Random Field (CRF)** decoder layer to model label dependencies.
+- **Learned absolute positional embeddings** for improved handling of noisy inputs.
+These additional Transformer layers help in mitigating the effects of OCR noise, spelling variation, and non-standard linguistic usage found in historical documents. The entire stack is fine-tuned end-to-end for token classification.
+### Entity Types Supported
+The model supports both coarse-grained and fine-grained entity types defined in the HIPE-2020/2022 guidelines. The output format of the model includes structured predictions with contextual and semantic details. Each prediction is a dictionary with the following fields:
+```python
+{
+  'type': 'pers' | 'org' | 'loc' | 'time' | 'prod',
+  'confidence_ner': float,              # Confidence score
+  'surface': str,                       # Surface form in text
+  'lOffset': int,                       # Start character offset
+  'rOffset': int,                       # End character offset
+  'name': str,                          # Optional: full name (for persons)
+  'title': str,                         # Optional: title (for persons)
+  'function': str                       # Optional: function (if detected)
+}
+```
+#### Coarse-Grained Entity Types:
+- **pers**: Person entities (individuals, collectives, authors)
+- **org**: Organizations (administrative, enterprise, press agencies)
+- **prod**: Products (media)
+- **time**: Time expressions (absolute dates)
+- **loc**: Locations (towns, regions, countries, physical, facilities)
+If present in the text, surrounding an entity, model returns **person-specific attributes** such as:
+- `name`: canonical full name
+- `title`: honorific or title (e.g., "king", "chancellor")
+- `function`: role or function in context (if available)
+### Model Sources
+- **Repository:** https://huggingface.co/impresso-project/ner-stacked-bert-multilingual
+- **Paper:** [CoNLL 2020](https://aclanthology.org/2020.conll-1.35/)
+- **Demo:** [Impresso project](https://impresso-project.ch)
+## Uses
+### Direct Use
+The model is intended to be used directly with the Hugging Face `pipeline` for `token-classification`, specifically with `generic-ner` tasks on historical texts.
+### Downstream Use
+Can be used for downstream tasks such as:
+- Historical information extraction
+- Biographical reconstruction
+- Place and person mention detection across historical archives
+### Out-of-Scope Use
+- Not suitable for contemporary named entity recognition in domains such as social media or modern news.
+- Not optimized for OCR-free modern corpora.
+## Bias, Risks, and Limitations
+Due to training on historical documents, the model may reflect historical biases and inaccuracies. It may underperform on contemporary or non-European languages.
+### Recommendations
+- Users should be cautious of historical and typographical biases.
+- Consider post-processing to filter false positives from OCR noise.
+## How to Get Started with the Model
+```python
+from transformers import AutoModelForTokenClassification, AutoTokenizer, pipeline
+MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual-light"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+ner_pipeline = pipeline("generic-ner", model=MODEL_NAME, tokenizer=tokenizer, trust_remote_code=True, device='cpu')
+sentence = "En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité. À la cour du roi Philippe VI, les murs du Louvre étaient animés par les rapports sombres venus de Paris et des villes environnantes. La peste ne montrait aucun signe de répit, et le chancelier Guillaume de Nogaret, le conseiller le plus fidèle du roi, portait le lourd fardeau de gérer la survie du royaume."
+entities = ner_pipeline(sentence)
+print(entities)
+```
+#### Example Output
+```json
+[
+  {'type': 'time', 'confidence_ner': 85.0, 'surface': "an 1348", 'lOffset': 0, 'rOffset': 12},
+  {'type': 'loc', 'confidence_ner': 90.75, 'surface': "Europe", 'lOffset': 69, 'rOffset': 75},
+  {'type': 'loc', 'confidence_ner': 75.45, 'surface': "Royaume de France", 'lOffset': 80, 'rOffset': 97},
+  {'type': 'pers', 'confidence_ner': 85.27, 'surface': "roi Philippe VI", 'lOffset': 181, 'rOffset': 196, 'title': "roi", 'name': "roi Philippe VI"},
+  {'type': 'loc', 'confidence_ner': 30.59, 'surface': "Louvre", 'lOffset': 210, 'rOffset': 216},
+  {'type': 'loc', 'confidence_ner': 94.46, 'surface': "Paris", 'lOffset': 266, 'rOffset': 271},
+  {'type': 'pers', 'confidence_ner': 96.1, 'surface': "chancelier Guillaume de Nogaret", 'lOffset': 350, 'rOffset': 381, 'title': "chancelier", 'name': "Guillaume de Nogaret"},
+  {'type': 'loc', 'confidence_ner': 49.35, 'surface': "Royaume", 'lOffset': 80, 'rOffset': 87},
+  {'type': 'loc', 'confidence_ner': 24.18, 'surface': "France", 'lOffset': 91, 'rOffset': 97}
+]
+```
+## Training Details
+### Training Data
+The model was trained on the Impresso HIPE-2020 dataset, a subset of the [HIPE-2022 corpus](https://github.com/hipe-eval/HIPE-2022-data), which includes richly annotated OCR-transcribed historical newspaper content.
+### Training Procedure
+#### Preprocessing
+OCR content was cleaned and segmented. Entity types follow the HIPE-2020 typology.
+#### Training Hyperparameters
+- **Training regime:** Mixed precision (fp16)
+- **Epochs:** 5
+- **Max sequence length:** 512
+- **Base model:** `dbmdz/bert-medium-historic-multilingual-cased`
+- **Stacked Transformer layers:** 2
+#### Speeds, Sizes, Times
+- **Model size:** ~500MB
+- **Training time:** ~1h on 1 GPU (NVIDIA TITAN X)
+## Evaluation
+#### Testing Data
+Held-out portion of HIPE-2020 (French, German)
+#### Metrics
+- F1-score (micro, macro)
+- Entity-level precision/recall
+### Results
+| Language | Precision | Recall | F1-score |
+|----------|-----------|--------|----------|
+| French   | 84.2      | 81.6   | 82.9     |
+| German   | 82.0      | 78.7   | 80.3     |
+#### Summary
+The model performs robustly across noisy OCR historical content with support for fine-grained entity typologies.
+## Environmental Impact
+- **Hardware Type:** NVIDIA TITAN X (Pascal, 12GB)
+- **Hours used:** ~1 hour
+- **Cloud Provider:** EPFL, Switzerland
+- **Carbon Emitted:** ~0.022 kg CO₂eq (estimated)
+## Technical Specifications
+### Model Architecture and Objective
+Stacked BERT architecture with multitask token classification head supporting HIPE-type entity labels.
+### Compute Infrastructure
+#### Hardware
+1x NVIDIA TITAN X (Pascal, 12GB)
+#### Software
+- Python 3.11
+- PyTorch 2.0
+- Transformers 4.36
+## Citation
+**BibTeX:**
+```bibtex
+@inproceedings{boros2020alleviating,
+  title={Alleviating digitization errors in named entity recognition for historical documents},
+  author={Boros, Emanuela and Hamdi, Ahmed and Pontes, Elvys Linhares and Cabrera-Diego, Luis-Adri{\'a}n and Moreno, Jose G and Sidere, Nicolas and Doucet, Antoine},
+  booktitle={Proceedings of the 24th conference on computational natural language learning},
+  pages={431--441},
+  year={2020}
+}
+```
+## Contact
+- Website: [https://impresso-project.ch](https://impresso-project.ch)
+<p align="center">
+  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
+</p>

__init__.py ADDED Viewed

File without changes

config.json ADDED Viewed

	@@ -0,0 +1,139 @@

+{
+  "_name_or_path": "experiments/model_dbmdz_bert_medium_historic_multilingual_cased_max_sequence_length_512_epochs_5_run_multitask.baseline.False2025/",
+  "architectures": [
+    "ExtendedMultitaskTimeModelForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "auto_map": {
+    "AutoConfig": "configuration_stacked.ImpressoConfig",
+    "AutoModelForTokenClassification": "modeling_stacked.ExtendedMultitaskTimeModelForTokenClassification"
+  },
+  "classifier_dropout": null,
+  "custom_pipelines": {
+    "generic-ner": {
+      "impl": "generic_ner.ExtendedMultitaskTimeModelForTokenClassificationPipeline",
+      "pt": "AutoModelForTokenClassification"
+    }
+  },
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 512,
+  "initializer_range": 0.02,
+  "intermediate_size": 2048,
+  "label_map": {
+    "NE-COARSE-LIT": {
+      "I-pers": 0,
+      "I-prod": 1,
+      "B-prod": 2,
+      "B-loc": 3,
+      "I-time": 4,
+      "B-pers": 5,
+      "B-org": 6,
+      "B-time": 7,
+      "I-loc": 8,
+      "O": 9,
+      "I-org": 10
+    },
+    "NE-FINE-COMP": {
+      "I-comp.title": 0,
+      "B-comp.title": 1,
+      "I-comp.function": 2,
+      "I-comp.name": 3,
+      "B-comp.function": 4,
+      "O": 5,
+      "B-comp.name": 6
+    }
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "stacked_bert",
+  "num_attention_heads": 8,
+  "num_hidden_layers": 8,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "pretrained_config": {
+    "_name_or_path": "dbmdz/bert-medium-historic-multilingual-cased",
+    "add_cross_attention": false,
+    "architectures": [
+      "BertForMaskedLM"
+    ],
+    "attention_probs_dropout_prob": 0.1,
+    "bad_words_ids": null,
+    "begin_suppress_tokens": null,
+    "bos_token_id": null,
+    "chunk_size_feed_forward": 0,
+    "classifier_dropout": null,
+    "cross_attention_hidden_size": null,
+    "decoder_start_token_id": null,
+    "diversity_penalty": 0.0,
+    "do_sample": false,
+    "early_stopping": false,
+    "encoder_no_repeat_ngram_size": 0,
+    "eos_token_id": null,
+    "exponential_decay_length_penalty": null,
+    "finetuning_task": null,
+    "forced_bos_token_id": null,
+    "forced_eos_token_id": null,
+    "hidden_act": "gelu",
+    "hidden_dropout_prob": 0.1,
+    "hidden_size": 512,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "initializer_range": 0.02,
+    "intermediate_size": 2048,
+    "is_decoder": false,
+    "is_encoder_decoder": false,
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "layer_norm_eps": 1e-12,
+    "length_penalty": 1.0,
+    "max_length": 20,
+    "max_position_embeddings": 512,
+    "min_length": 0,
+    "model_type": "bert",
+    "no_repeat_ngram_size": 0,
+    "num_attention_heads": 8,
+    "num_beam_groups": 1,
+    "num_beams": 1,
+    "num_hidden_layers": 8,
+    "num_return_sequences": 1,
+    "output_attentions": false,
+    "output_hidden_states": false,
+    "output_scores": false,
+    "pad_token_id": 0,
+    "position_embedding_type": "absolute",
+    "prefix": null,
+    "problem_type": null,
+    "pruned_heads": {},
+    "remove_invalid_values": false,
+    "repetition_penalty": 1.0,
+    "return_dict": true,
+    "return_dict_in_generate": false,
+    "sep_token_id": null,
+    "suppress_tokens": null,
+    "task_specific_params": null,
+    "temperature": 1.0,
+    "tf_legacy_loss": false,
+    "tie_encoder_decoder": false,
+    "tie_word_embeddings": true,
+    "tokenizer_class": null,
+    "top_k": 50,
+    "top_p": 1.0,
+    "torch_dtype": null,
+    "torchscript": false,
+    "type_vocab_size": 2,
+    "typical_p": 1.0,
+    "use_bfloat16": false,
+    "use_cache": true,
+    "vocab_size": 32000
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.40.0.dev0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 32000
+}

configuration_stacked.py ADDED Viewed

	@@ -0,0 +1,101 @@

+from transformers import PretrainedConfig
+import torch
+class ImpressoConfig(PretrainedConfig):
+    model_type = "stacked_bert"
+    def __init__(
+            self,
+            vocab_size=30522,
+            hidden_size=768,
+            num_hidden_layers=12,
+            num_attention_heads=12,
+            intermediate_size=3072,
+            hidden_act="gelu",
+            hidden_dropout_prob=0.1,
+            attention_probs_dropout_prob=0.1,
+            max_position_embeddings=512,
+            type_vocab_size=2,
+            initializer_range=0.02,
+            layer_norm_eps=1e-12,
+            pad_token_id=0,
+            position_embedding_type="absolute",
+            use_cache=True,
+            classifier_dropout=None,
+            pretrained_config=None,
+            values_override=None,
+            label_map=None,
+            **kwargs,
+    ):
+        super().__init__(pad_token_id=pad_token_id, **kwargs)
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.hidden_act = hidden_act
+        self.intermediate_size = intermediate_size
+        self.hidden_dropout_prob = hidden_dropout_prob
+        self.attention_probs_dropout_prob = attention_probs_dropout_prob
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.initializer_range = initializer_range
+        self.layer_norm_eps = layer_norm_eps
+        self.position_embedding_type = position_embedding_type
+        self.use_cache = use_cache
+        self.classifier_dropout = classifier_dropout
+        self.pretrained_config = pretrained_config
+        self.label_map = label_map
+        self.values_override = values_override or {}
+        self.outputs = {
+            "logits": {"shape": [None, None, self.hidden_size], "dtype": "float32"}
+        }
+    @classmethod
+    def is_torch_support_available(cls):
+        """
+        Indicate whether Torch support is available for this configuration.
+        Required for compatibility with certain parts of the Transformers library.
+        """
+        return True
+    @classmethod
+    def patch_ops(self):
+        """
+        A method required by some Hugging Face utilities to modify operator mappings.
+        Currently, it performs no operation and is included for compatibility.
+        Args:
+            ops: A dictionary of operations to potentially patch.
+        Returns:
+            The (unmodified) ops dictionary.
+        """
+        return None
+    def generate_dummy_inputs(self, tokenizer, batch_size=1, seq_length=8, framework="pt"):
+        """
+        Generate dummy inputs for testing or export.
+        Args:
+            tokenizer: The tokenizer used to tokenize inputs.
+            batch_size: Number of input samples in the batch.
+            seq_length: Length of each sequence.
+            framework: Framework ("pt" for PyTorch, "tf" for TensorFlow).
+        Returns:
+            Dummy inputs as a dictionary.
+        """
+        if framework == "pt":
+            input_ids = torch.randint(
+                low=0,
+                high=self.vocab_size,
+                size=(batch_size, seq_length),
+                dtype=torch.long
+            )
+            attention_mask = torch.ones((batch_size, seq_length), dtype=torch.long)
+            return {"input_ids": input_ids, "attention_mask": attention_mask}
+        else:
+            raise ValueError("Framework '{}' not supported.".format(framework))
+# Register the configuration with the transformers library
+ImpressoConfig.register_for_auto_class()

generic_ner.py ADDED Viewed

	@@ -0,0 +1,788 @@

+import logging
+from transformers import Pipeline
+import numpy as np
+import torch
+import nltk
+nltk.download("averaged_perceptron_tagger")
+nltk.download("averaged_perceptron_tagger_eng")
+nltk.download("stopwords")
+from nltk.chunk import conlltags2tree
+from nltk import pos_tag
+from nltk.tree import Tree
+import torch.nn.functional as F
+import re, string
+stop_words = set(nltk.corpus.stopwords.words("english"))
+DEBUG = False
+punctuation = (
+        string.punctuation
+        + "«»—…“”"
+        + "—."
+        + "–"
+        + "’"
+        + "‘"
+        + "´"
+        + "•"
+        + "°"
+        + "»"
+        + "“"
+        + "”"
+        + "–"
+        + "—"
+        + "‘’“”„«»•–—―‣◦…§¶†‡‰′″〈〉"
+)
+# List of additional "strange" punctuation marks
+# additional_punctuation = "‘’“”„«»•–—―‣◦…§¶†‡‰′″〈〉"
+WHITESPACE_RULES = {
+    "fr": {
+        "pct_no_ws_before": [".", ",", ")", "]", "}", "°", "...", ".-", "%"],
+        "pct_no_ws_after": ["(", "[", "{"],
+        "pct_no_ws_before_after": ["'", "-"],
+        "pct_number": [".", ","],
+    },
+    "de": {
+        "pct_no_ws_before": [
+            ".",
+            ",",
+            ")",
+            "]",
+            "}",
+            "°",
+            "...",
+            "?",
+            "!",
+            ":",
+            ";",
+            ".-",
+            "%",
+        ],
+        "pct_no_ws_after": ["(", "[", "{"],
+        "pct_no_ws_before_after": ["'", "-"],
+        "pct_number": [".", ","],
+    },
+    "other": {
+        "pct_no_ws_before": [
+            ".",
+            ",",
+            ")",
+            "]",
+            "}",
+            "°",
+            "...",
+            "?",
+            "!",
+            ":",
+            ";",
+            ".-",
+            "%",
+        ],
+        "pct_no_ws_after": ["(", "[", "{"],
+        "pct_no_ws_before_after": ["'", "-"],
+        "pct_number": [".", ","],
+    },
+}
+# def tokenize(text: str, language: str = "other") -> list[str]:
+#     """Apply whitespace rules to the given text and language, separating it into tokens.
+#
+#     Args:
+#         text (str): The input text to separate into a list of tokens.
+#         language (str): Language of the text.
+#
+#     Returns:
+#         list[str]: List of tokens with punctuation as separate tokens.
+#     """
+#     # text = add_spaces_around_punctuation(text)
+#     if not text:
+#         return []
+#
+#     if language not in WHITESPACE_RULES:
+#         # Default behavior for languages without specific rules:
+#         # tokenize using standard whitespace splitting
+#         language = "other"
+#
+#     wsrules = WHITESPACE_RULES[language]
+#     tokenized_text = []
+#     current_token = ""
+#
+#     for char in text:
+#         if char in wsrules["pct_no_ws_before_after"]:
+#             if current_token:
+#                 tokenized_text.append(current_token)
+#             tokenized_text.append(char)
+#             current_token = ""
+#         elif char in wsrules["pct_no_ws_before"] or char in wsrules["pct_no_ws_after"]:
+#             if current_token:
+#                 tokenized_text.append(current_token)
+#             tokenized_text.append(char)
+#             current_token = ""
+#         elif char.isspace():
+#             if current_token:
+#                 tokenized_text.append(current_token)
+#                 current_token = ""
+#         else:
+#             current_token += char
+#
+#     if current_token:
+#         tokenized_text.append(current_token)
+#
+#     return tokenized_text
+def normalize_text(text):
+    # Remove spaces and tabs for the search but keep newline characters
+    return re.sub(r"[ \t]+", "", text)
+def find_entity_indices(article_text, search_text):
+    # Normalize texts by removing spaces and tabs
+    normalized_article = normalize_text(article_text)
+    normalized_search = normalize_text(search_text)
+    # Initialize a list to hold all start and end indices
+    indices = []
+    # Find all occurrences of the search text in the normalized article text
+    start_index = 0
+    while True:
+        start_index = normalized_article.find(normalized_search, start_index)
+        if start_index == -1:
+            break
+        # Calculate the actual start and end indices in the original article text
+        original_chars = 0
+        original_start_index = 0
+        for i in range(start_index):
+            while article_text[original_start_index] in (" ", "\t"):
+                original_start_index += 1
+            if article_text[original_start_index] not in (" ", "\t", "\n"):
+                original_chars += 1
+            original_start_index += 1
+        original_end_index = original_start_index
+        search_chars = 0
+        while search_chars < len(normalized_search):
+            if article_text[original_end_index] not in (" ", "\t", "\n"):
+                search_chars += 1
+            original_end_index += 1  # Increment to include the last character
+        # Append the found indices to the list
+        if article_text[original_start_index] == " ":
+            original_start_index += 1
+        indices.append((original_start_index, original_end_index))
+        # Move start_index to the next position to continue searching
+        start_index += 1
+    return indices
+def get_entities(tokens, tags, confidences, text):
+    tags = [tag.replace("S-", "B-").replace("E-", "I-") for tag in tags]
+    pos_tags = [pos for token, pos in pos_tag(tokens)]
+    for i in range(1, len(tags)):
+        # If a 'B-' tag is followed by another 'B-' without an 'O' in between, change the second to 'I-'
+        if tags[i].startswith("B-") and tags[i - 1].startswith("I-"):
+            tags[i] = "I-" + tags[i][2:]  # Change 'B-' to 'I-' for the same entity type
+    conlltags = [(token, pos, tg) for token, pos, tg in zip(tokens, pos_tags, tags)]
+    ne_tree = conlltags2tree(conlltags)
+    entities = []
+    idx: int = 0
+    already_done = []
+    for subtree in ne_tree:
+        # skipping 'O' tags
+        if isinstance(subtree, Tree):
+            original_label = subtree.label()
+            original_string = " ".join([token for token, pos in subtree.leaves()])
+            for indices in find_entity_indices(text, original_string):
+                entity_start_position = indices[0]
+                entity_end_position = indices[1]
+                if (
+                        "_".join(
+                            [original_label, original_string, str(entity_start_position)]
+                        )
+                        in already_done
+                ):
+                    continue
+                else:
+                    already_done.append(
+                        "_".join(
+                            [
+                                original_label,
+                                original_string,
+                                str(entity_start_position),
+                            ]
+                        )
+                    )
+                if len(text[entity_start_position:entity_end_position].strip()) < len(
+                        text[entity_start_position:entity_end_position]
+                ):
+                    entity_start_position = (
+                            entity_start_position
+                            + len(text[entity_start_position:entity_end_position])
+                            - len(text[entity_start_position:entity_end_position].strip())
+                    )
+                entities.append(
+                    {
+                        "type": original_label,
+                        "confidence_ner": round(
+                            np.average(confidences[idx: idx + len(subtree)]), 2
+                        ),
+                        "index": (idx, idx + len(subtree)),
+                        "surface": text[
+                                   entity_start_position:entity_end_position
+                                   ],  # original_string,
+                        "lOffset": entity_start_position,
+                        "rOffset": entity_end_position,
+                    }
+                )
+            idx += len(subtree)
+            # Update the current character position
+            # We add the length of the original string + 1 (for the space)
+        else:
+            token, pos = subtree
+            # If it's not a named entity, we still need to update the character
+            # position
+            idx += 1
+    return entities
+def realign(word_ids, tokens, out_label_preds, softmax_scores, tokenizer, reverted_label_map):
+    preds_list, words_list, confidence_list = [], [], []
+    seen_word_ids = set()
+    for i, word_id in enumerate(word_ids):
+        if word_id is None or word_id in seen_word_ids:
+            continue  # skip special tokens or repeated subwords
+        seen_word_ids.add(word_id)
+        try:
+            preds_list.append(reverted_label_map[out_label_preds[i]])
+            confidence_list.append(max(softmax_scores[i]))
+        except Exception:
+            preds_list.append("O")
+            confidence_list.append(0.0)
+        words_list.append(tokens[word_id])  # original word list index
+    return words_list, preds_list, confidence_list
+def add_spaces_around_punctuation(text):
+    # Add a space before and after all punctuation
+    all_punctuation = string.punctuation + punctuation
+    return re.sub(r"([{}])".format(re.escape(all_punctuation)), r" \1 ", text)
+def attach_comp_to_closest(entities):
+    # Define valid entity types that can receive a "comp.function" or "comp.name" attachment
+    valid_entity_types = {"org", "pers", "org.ent", "pers.ind"}
+    # Separate "comp.function" and "comp.name" entities from other entities
+    comp_entities = [ent for ent in entities if ent["type"].startswith("comp")]
+    other_entities = [ent for ent in entities if not ent["type"].startswith("comp")]
+    for comp_entity in comp_entities:
+        closest_entity = None
+        min_distance = float("inf")
+        # Find the closest non-"comp" entity that is valid for attaching
+        for other_entity in other_entities:
+            # Calculate distance between the comp entity and the other entity
+            if comp_entity["lOffset"] > other_entity["rOffset"]:
+                distance = comp_entity["lOffset"] - other_entity["rOffset"]
+            elif comp_entity["rOffset"] < other_entity["lOffset"]:
+                distance = other_entity["lOffset"] - comp_entity["rOffset"]
+            else:
+                distance = 0  # They overlap or touch
+            # Ensure the entity type is valid and check for minimal distance
+            if (
+                    distance < min_distance
+                    and other_entity["type"].split(".")[0] in valid_entity_types
+            ):
+                min_distance = distance
+                closest_entity = other_entity
+        # Attach the "comp.function" or "comp.name" if a valid entity is found
+        if closest_entity:
+            suffix = comp_entity["type"].split(".")[
+                -1
+            ]  # Extract the suffix (e.g., 'name', 'function')
+            closest_entity[suffix] = comp_entity["surface"]  # Attach the text
+    return other_entities
+def conflicting_context(comp_entity, target_entity):
+    """
+    Determines if there is a conflict between the comp_entity and the target entity.
+    Prevents incorrect name and function attachments by using a rule-based approach.
+    """
+    # Case 1: Check for correct function attachment to person or organization entities
+    if comp_entity["type"].startswith("comp.function"):
+        if not ("pers" in target_entity["type"] or "org" in target_entity["type"]):
+            return True  # Conflict: Function should only attach to persons or organizations
+    # Case 2: Avoid attaching comp.* entities to non-person, non-organization types (like locations)
+    if "loc" in target_entity["type"]:
+        return True  # Conflict: comp.* entities should not attach to locations or similar types
+    return False  # No conflict
+def extract_name_from_text(text, partial_name):
+    """
+    Extracts the full name from the entity's text based on the partial name.
+    This function assumes that the full name starts with capitalized letters and does not
+    include any words that come after the partial name.
+    """
+    # Split the text and partial name into words
+    words = text.split()
+    partial_words = partial_name.split()
+    if DEBUG:
+        print("text:", text)
+    if DEBUG:
+        print("partial_name:", partial_name)
+    # Find the position of the partial name in the word list
+    for i, word in enumerate(words):
+        if DEBUG:
+            print(words, "---", words[i: i + len(partial_words)])
+        if words[i: i + len(partial_words)] == partial_words:
+            # Initialize full name with the partial name
+            full_name = partial_words[:]
+            if DEBUG:
+                print("full_name:", full_name)
+            # Check previous words and only add capitalized words (skip lowercase words)
+            j = i - 1
+            while j >= 0 and words[j][0].isupper():
+                full_name.insert(0, words[j])
+                j -= 1
+                if DEBUG:
+                    print("full_name:", full_name)
+            # Return only the full name up to the partial name (ignore words after the name)
+            return " ".join(full_name).strip()  # Join the words to form the full name
+    # If not found, return the original text (as a fallback)
+    return text.strip()
+def repair_names_in_entities(entities):
+    """
+    This function repairs the names in the entities by extracting the full name
+    from the text of the entity if a partial name (e.g., 'Washington') is incorrectly attached.
+    """
+    for entity in entities:
+        if "name" in entity and "pers" in entity["type"]:
+            name = entity["name"]
+            text = entity["surface"]
+            # Check if the attached name is part of the entity's text
+            if name in text:
+                # Extract the full name from the text by splitting around the attached name
+                full_name = extract_name_from_text(entity["surface"], name)
+                entity["name"] = (
+                    full_name  # Replace the partial name with the full name
+                )
+        # if "name" not in entity:
+        #     entity["name"] = entity["surface"]
+    return entities
+def clean_coarse_entities(entities):
+    """
+    This function removes entities that are not useful for the NEL process.
+    """
+    # Define a set of entity types that are considered useful for NEL
+    useful_types = {
+        "pers",  # Person
+        "loc",  # Location
+        "org",  # Organization
+        "date",  # Product
+        "time",  # Time
+    }
+    # Filter out entities that are not in the useful_types set unless they are comp.* entities
+    cleaned_entities = [
+        entity
+        for entity in entities
+        if entity["type"] in useful_types or "comp" in entity["type"]
+    ]
+    return cleaned_entities
+def postprocess_entities(entities):
+    # Step 1: Filter entities with the same text, keeping the one with the most dots in the 'entity' field
+    entity_map = {}
+    # Loop over the entities and prioritize the one with the most dots
+    for entity in entities:
+        entity_text = entity["surface"]
+        num_dots = entity["type"].count(".")
+        # If the entity text is new, or this entity has more dots, update the map
+        if (
+                entity_text not in entity_map
+                or entity_map[entity_text]["type"].count(".") < num_dots
+        ):
+            entity_map[entity_text] = entity
+    # Collect the filtered entities from the map
+    filtered_entities = list(entity_map.values())
+    # Step 2: Attach "comp.function" entities to the closest other entities
+    filtered_entities = attach_comp_to_closest(filtered_entities)
+    if DEBUG:
+        print("After attach_comp_to_closest:", filtered_entities, "\n")
+    filtered_entities = repair_names_in_entities(filtered_entities)
+    if DEBUG:
+        print("After repair_names_in_entities:", filtered_entities, "\n")
+    # Step 3: Remove entities that are not useful for NEL
+    # filtered_entities = clean_coarse_entities(filtered_entities)
+    # filtered_entities = remove_blacklisted_entities(filtered_entities)
+    return filtered_entities
+def remove_included_entities(entities):
+    # Loop through entities and remove those whose text is included in another with the same label
+    final_entities = []
+    for i, entity in enumerate(entities):
+        is_included = False
+        for other_entity in entities:
+            if entity["surface"] != other_entity["surface"]:
+                if "comp" in other_entity["type"]:
+                    # Check if entity's text is a substring of another entity's text
+                    if entity["surface"] in other_entity["surface"]:
+                        is_included = True
+                        break
+                elif (
+                        entity["type"].split(".")[0] in other_entity["type"].split(".")[0]
+                        or other_entity["type"].split(".")[0]
+                        in entity["type"].split(".")[0]
+                ):
+                    if entity["surface"] in other_entity["surface"]:
+                        is_included = True
+        if not is_included:
+            final_entities.append(entity)
+    return final_entities
+def refine_entities_with_coarse(all_entities, coarse_entities):
+    """
+    Looks through all entities and refines them based on the coarse entities.
+    If a surface match is found in the coarse entities and the types match,
+    the entity's confidence_ner and type are updated based on the coarse entity.
+    """
+    # Create a dictionary for coarse entities based on surface and type for quick lookup
+    coarse_lookup = {}
+    for coarse_entity in coarse_entities:
+        key = (coarse_entity["surface"], coarse_entity["type"].split(".")[0])
+        coarse_lookup[key] = coarse_entity
+    # Iterate through all entities and compare with the coarse entities
+    for entity in all_entities:
+        key = (
+            entity["surface"],
+            entity["type"].split(".")[0],
+        )  # Use the coarse type for comparison
+        if key in coarse_lookup:
+            coarse_entity = coarse_lookup[key]
+            # If a match is found, update the confidence_ner and type in the entity
+            if entity["confidence_ner"] < coarse_entity["confidence_ner"]:
+                entity["confidence_ner"] = coarse_entity["confidence_ner"]
+                entity["type"] = coarse_entity[
+                    "type"
+                ]  # Update the type if the confidence is higher
+    # No need to append to refined_entities, we're modifying in place
+    for entity in all_entities:
+        entity["type"] = entity["type"].split(".")[0]
+    return all_entities
+def remove_trailing_stopwords(entities):
+    """
+    This function removes stopwords and punctuation from both the beginning and end of each entity's text
+    and repairs the lOffset and rOffset accordingly.
+    """
+    if DEBUG:
+        print(f"Initial entities in remove_trailing_stopwords: {len(entities)}")
+    new_entities = []
+    for entity in entities:
+        if "comp" not in entity["type"]:
+            entity_text = entity["surface"]
+            original_len = len(entity_text)
+            # Initial offsets
+            lOffset = entity.get("lOffset", 0)
+            rOffset = entity.get("rOffset", original_len)
+            # Remove stopwords and punctuation from the beginning
+            # print('----', entity_text)
+            if len(entity_text.split()) < 1:
+                continue
+            while entity_text and (
+                    entity_text.split()[0].lower() in stop_words
+                    or entity_text[0] in punctuation
+            ):
+                if entity_text.split()[0].lower() in stop_words:
+                    stopword_len = (
+                            len(entity_text.split()[0]) + 1
+                    )  # Adjust length for stopword and following space
+                    entity_text = entity_text[stopword_len:]  # Remove leading stopword
+                    lOffset += stopword_len  # Adjust the left offset
+                    if DEBUG:
+                        print(
+                            f"Removed leading stopword from entity: {entity['surface']} --> {entity_text} ({entity['type']}"
+                        )
+                elif entity_text[0] in punctuation:
+                    entity_text = entity_text[1:]  # Remove leading punctuation
+                    lOffset += 1  # Adjust the left offset
+                    if DEBUG:
+                        print(
+                            f"Removed leading punctuation from entity: {entity['surface']} --> {entity_text} ({entity['type']}"
+                        )
+            # Remove stopwords and punctuation from the end
+            if len(entity_text.strip()) > 1:
+                while (
+                        entity_text.strip().split()
+                        and (
+                                entity_text.strip().split()[-1].lower() in stop_words
+                                or entity_text[-1] in punctuation
+                        )
+                ):
+                    if entity_text.strip().split() and entity_text.strip().split()[-1].lower() in stop_words:
+                        stopword_len = len(entity_text.strip().split()[-1]) + 1  # account for space
+                        entity_text = entity_text[:-stopword_len]
+                        rOffset -= stopword_len
+                        if DEBUG:
+                            print(
+                                f"Removed trailing stopword from entity: {entity['surface']} --> {entity_text} ({entity['type']})"
+                            )
+                    if entity_text and entity_text[-1] in punctuation:
+                        entity_text = entity_text[:-1]
+                        rOffset -= 1
+                        if DEBUG:
+                            print(
+                                f"Removed trailing punctuation from entity: {entity['surface']} --> {entity_text} ({entity['type']})"
+                            )
+            # Skip certain entities based on rules
+            if entity_text in string.punctuation:
+                if DEBUG:
+                    print(f"Skipping entity: {entity_text}")
+                # entities.remove(entity)
+                continue
+            # check now if its in stopwords
+            if entity_text.lower() in stop_words:
+                if DEBUG:
+                    print(f"Skipping entity: {entity_text}")
+                # entities.remove(entity)
+                continue
+            # check now if the entire entity is a list of stopwords:
+            if all([word.lower() in stop_words for word in entity_text.split()]):
+                if DEBUG:
+                    print(f"Skipping entity: {entity_text}")
+                # entities.remove(entity)
+                continue
+            # Check if the entire entity is made up of stopwords characters
+            if all(
+                    [char.lower() in stop_words for char in entity_text if char.isalpha()]
+            ):
+                if DEBUG:
+                    print(
+                        f"Skipping entity: {entity_text} (all characters are stopwords)"
+                    )
+                # entities.remove(entity)
+                continue
+            # check now if all entity is in a list of punctuation
+            if all([word in string.punctuation for word in entity_text.split()]):
+                if DEBUG:
+                    print(
+                        f"Skipping entity: {entity_text} (all characters are punctuation)"
+                    )
+                # entities.remove(entity)
+                continue
+            if all(
+                    [
+                        char.lower() in string.punctuation
+                        for char in entity_text
+                        if char.isalpha()
+                    ]
+            ):
+                if DEBUG:
+                    print(
+                        f"Skipping entity: {entity_text} (all characters are punctuation)"
+                    )
+                # entities.remove(entity)
+                continue
+            # if it's a number and "time" no in it, then continue
+            if entity_text.isdigit() and "time" not in entity["type"]:
+                if DEBUG:
+                    print(f"Skipping entity: {entity_text}")
+                # entities.remove(entity)
+                continue
+            if entity_text.startswith(" "):
+                entity_text = entity_text[1:]
+                # update lOffset, rOffset
+                lOffset += 1
+            if entity_text.endswith(" "):
+                entity_text = entity_text[:-1]
+                # update lOffset, rOffset
+                rOffset -= 1
+            # Update the entity surface and offsets
+            entity["surface"] = entity_text
+            entity["lOffset"] = lOffset
+            entity["rOffset"] = rOffset
+            # Remove the entity if the surface is empty after cleaning
+            if len(entity["surface"].strip()) == 0:
+                if DEBUG:
+                    print(f"Deleted entity: {entity['surface']}")
+                # entities.remove(entity)
+            else:
+                new_entities.append(entity)
+        else:
+            new_entities.append(entity)
+    if DEBUG:
+        print(f"Remained entities in remove_trailing_stopwords: {len(new_entities)}")
+    return new_entities
+class ExtendedMultitaskTimeModelForTokenClassificationPipeline(Pipeline):
+    def _sanitize_parameters(self, **kwargs):
+        preprocess_kwargs = {}
+        if "text" in kwargs:
+            preprocess_kwargs["text"] = kwargs["text"]
+        if "tokens" in kwargs:
+            preprocess_kwargs["tokens"] = kwargs["tokens"]
+        self.label_map = self.model.config.label_map
+        self.id2label = {
+            task: {id_: label for label, id_ in labels.items()}
+            for task, labels in self.label_map.items()
+        }
+        return preprocess_kwargs, {}, {}
+    def preprocess(self, text, **kwargs):
+        tokens = kwargs["tokens"]
+        tokenized_inputs = self.tokenizer(
+            tokens,  # a list of strings
+            is_split_into_words=True,
+            padding="max_length",
+            truncation=True,
+            max_length=512,
+        )
+        word_ids = tokenized_inputs.word_ids()
+        return tokenized_inputs, word_ids, text, tokens
+    def _forward(self, inputs):
+        inputs, word_ids, text, tokens = inputs
+        input_ids = torch.tensor([inputs["input_ids"]], dtype=torch.long).to(
+            self.model.device
+        )
+        attention_mask = torch.tensor([inputs["attention_mask"]], dtype=torch.long).to(
+            self.model.device
+        )
+        with torch.no_grad():
+            outputs = self.model(input_ids, attention_mask)
+        return outputs, word_ids, text, tokens
+    def is_within(self, entity1, entity2):
+        """Check if entity1 is fully within the bounds of entity2."""
+        return (
+                entity1["lOffset"] >= entity2["lOffset"]
+                and entity1["rOffset"] <= entity2["rOffset"]
+        )
+    def postprocess(self, outputs, **kwargs):
+        """
+        Postprocess the outputs of the model
+        :param outputs:
+        :param kwargs:
+        :return:
+        """
+        tokens_result, word_ids, text, tokens = outputs
+        predictions = {}
+        confidence_scores = {}
+        for task, logits in tokens_result.logits.items():
+            predictions[task] = torch.argmax(logits, dim=-1).tolist()[0]
+            confidence_scores[task] = F.softmax(logits, dim=-1).tolist()[0]
+        entities = {}
+        for task in predictions.keys():
+            words_list, preds_list, confidence_list = realign(
+                word_ids,
+                tokens,
+                predictions[task],
+                confidence_scores[task],
+                self.tokenizer,
+                self.id2label[task],
+            )
+            entities[task] = get_entities(words_list, preds_list, confidence_list, text)
+        # add titles to comp entities
+        # from pprint import pprint
+        # print("Before:")
+        # pprint(entities)
+        all_entities = []
+        coarse_entities = []
+        for key in entities:
+            if key in ["NE-COARSE-LIT"]:
+                coarse_entities = entities[key]
+            all_entities.extend(entities[key])
+        if DEBUG:
+            print(all_entities)
+        # print("After remove_included_entities:")
+        all_entities = remove_included_entities(all_entities)
+        if DEBUG:
+            print("After remove_included_entities:", all_entities)
+        all_entities = remove_trailing_stopwords(all_entities)
+        if DEBUG:
+            print("After remove_trailing_stopwords:", all_entities)
+        all_entities = postprocess_entities(all_entities)
+        if DEBUG:
+            print("After postprocess_entities:", all_entities)
+        all_entities = refine_entities_with_coarse(all_entities, coarse_entities)
+        if DEBUG:
+            print("After refine_entities_with_coarse:", all_entities)
+        # print("After attach_comp_to_closest:")
+        # pprint(all_entities)
+        # print("\n")
+        return all_entities

label_map.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"NE-COARSE-LIT": {"I-pers": 0, "I-prod": 1, "B-prod": 2, "B-loc": 3, "I-time": 4, "B-pers": 5, "B-org": 6, "B-time": 7, "I-loc": 8, "O": 9, "I-org": 10}, "NE-FINE-COMP": {"I-comp.title": 0, "B-comp.title": 1, "I-comp.function": 2, "I-comp.name": 3, "B-comp.function": 4, "O": 5, "B-comp.name": 6}}

modeling_stacked.py ADDED Viewed

	@@ -0,0 +1,245 @@

+from transformers.modeling_outputs import TokenClassifierOutput
+import torch
+import torch.nn as nn
+from transformers import PreTrainedModel, AutoModel, AutoConfig, BertConfig
+from torch.nn import CrossEntropyLoss
+from typing import Optional, Tuple, Union
+import logging, json, os
+from .configuration_stacked import ImpressoConfig
+logger = logging.getLogger(__name__)
+def get_info(label_map):
+    num_token_labels_dict = {task: len(labels) for task, labels in label_map.items()}
+    return num_token_labels_dict
+class ExtendedMultitaskTimeModelForTokenClassification(PreTrainedModel):
+    config_class = ImpressoConfig
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+    def __init__(self, config, temporal_fusion_strategy="baseline", num_years=327):
+        super().__init__(config)
+        self.num_token_labels_dict = get_info(config.label_map)
+        self.config = config
+        self.temporal_fusion_strategy = temporal_fusion_strategy
+        self.model = AutoModel.from_pretrained(
+            config.pretrained_config["_name_or_path"], config=config.pretrained_config
+        )
+        self.model.config.use_cache = False
+        self.model.config.pretraining_tp = 1
+        self.num_years = num_years
+        classifier_dropout = getattr(config, "classifier_dropout", 0.1) or config.hidden_dropout_prob
+        self.dropout = nn.Dropout(classifier_dropout)
+        self.temporal_fusion = TemporalFusion(config.hidden_size, strategy=self.temporal_fusion_strategy,
+                                              num_years=num_years)
+        # Additional transformer layers
+        self.transformer_encoder = nn.TransformerEncoder(
+            nn.TransformerEncoderLayer(
+                d_model=config.hidden_size, nhead=config.num_attention_heads
+            ),
+            num_layers=2,
+        )
+        self.token_classifiers = nn.ModuleDict({
+            task: nn.Linear(config.hidden_size, num_labels)
+            for task, num_labels in self.num_token_labels_dict.items()
+        })
+        self.post_init()
+    def forward(
+            self,
+            input_ids: Optional[torch.Tensor] = None,
+            attention_mask: Optional[torch.Tensor] = None,
+            token_type_ids: Optional[torch.Tensor] = None,
+            position_ids: Optional[torch.Tensor] = None,
+            head_mask: Optional[torch.Tensor] = None,
+            labels: Optional[torch.Tensor] = None,
+            inputs_embeds: Optional[torch.Tensor] = None,
+            token_labels: Optional[dict] = None,
+            date_indices: Optional[torch.Tensor] = None,
+            year_index: Optional[torch.Tensor] = None,
+            decade_index: Optional[torch.Tensor] = None,
+            century_index: Optional[torch.Tensor] = None,
+            output_attentions: Optional[bool] = None,
+            output_hidden_states: Optional[bool] = None,
+            return_dict: Optional[bool] = None,
+    ) -> Union[Tuple[torch.Tensor], TokenClassifierOutput]:
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        if inputs_embeds is None:
+            inputs_embeds = self.model.embeddings(input_ids)
+        # Early cross-attention fusion
+        if self.temporal_fusion_strategy == "early-cross-attention":
+            year_emb = self.temporal_fusion.compute_time_embedding(year_index)  # (B, H)
+            inputs_embeds = self.temporal_fusion.cross_attn(inputs_embeds, year_emb)
+        bert_kwargs = {
+            "inputs_embeds": inputs_embeds if self.temporal_fusion_strategy == "early-cross-attention" else None,
+            "input_ids": input_ids if self.temporal_fusion_strategy != "early-cross-attention" else None,
+            "attention_mask": attention_mask,
+            "token_type_ids": token_type_ids,
+            "position_ids": position_ids,
+            "head_mask": head_mask,
+            "output_attentions": output_attentions,
+            "output_hidden_states": output_hidden_states,
+            "return_dict": return_dict,
+        }
+        if any(keyword in self.config.name_or_path.lower() for keyword in ["llama", "deberta"]):
+            bert_kwargs.pop("token_type_ids", None)
+            bert_kwargs.pop("head_mask", None)
+        outputs = self.model(**bert_kwargs)
+        token_output = self.dropout(outputs[0])  # (B, T, H)
+        hidden_states = list(outputs.hidden_states) if output_hidden_states else None
+        # Pass through additional transformer layers
+        token_output = self.transformer_encoder(token_output.transpose(0, 1)).transpose(
+            0, 1
+        )
+        # Apply fusion after transformer if needed
+        if self.temporal_fusion_strategy not in ["baseline", "early-cross-attention"]:
+            token_output = self.temporal_fusion(token_output, year_index)
+            if output_hidden_states:
+                hidden_states.append(token_output)  # add the final fused state
+        task_logits = {}
+        total_loss = 0
+        for task, classifier in self.token_classifiers.items():
+            logits = classifier(token_output)
+            task_logits[task] = logits
+            if token_labels and task in token_labels:
+                loss_fct = CrossEntropyLoss()
+                loss = loss_fct(
+                    logits.view(-1, self.num_token_labels_dict[task]),
+                    token_labels[task].view(-1),
+                )
+                total_loss += loss
+        if not return_dict:
+            output = (task_logits,) + outputs[2:]
+            return ((total_loss,) + output) if total_loss != 0 else output
+        return TokenClassifierOutput(
+            loss=total_loss,
+            logits=task_logits,
+            hidden_states=tuple(hidden_states) if hidden_states is not None else None,
+            attentions=outputs.attentions if output_attentions else None,
+        )
+class TemporalFusion(nn.Module):
+    def __init__(self, hidden_size, strategy="add", num_years=327, min_year=1700):
+        super().__init__()
+        self.strategy = strategy
+        self.hidden_size = hidden_size
+        self.min_year = min_year
+        self.max_year = min_year + num_years - 1
+        self.year_emb = nn.Embedding(num_years, hidden_size)
+        if strategy == "concat":
+            self.concat_proj = nn.Linear(hidden_size * 2, hidden_size)
+        elif strategy == "film":
+            self.film_gamma = nn.Linear(hidden_size, hidden_size)
+            self.film_beta = nn.Linear(hidden_size, hidden_size)
+        elif strategy == "adapter":
+            self.adapter = nn.Sequential(
+                nn.Linear(hidden_size, hidden_size),
+                nn.ReLU(),
+                nn.Linear(hidden_size, hidden_size),
+            )
+        elif strategy == "relative":
+            self.relative_encoder = nn.Sequential(
+                nn.Linear(hidden_size, hidden_size),
+                nn.SiLU(),
+                nn.LayerNorm(hidden_size),
+            )
+            self.film_gamma = nn.Linear(hidden_size, hidden_size)
+            self.film_beta = nn.Linear(hidden_size, hidden_size)
+        elif strategy == "multiscale":
+            self.decade_emb = nn.Embedding(1000, hidden_size)
+            self.century_emb = nn.Embedding(100, hidden_size)
+        elif strategy in ["early-cross-attention", "late-cross-attention"]:
+            self.year_encoder = nn.Sequential(
+                nn.Linear(hidden_size, hidden_size),
+                nn.SiLU()
+            )
+            self.cross_attn = TemporalCrossAttention(hidden_size)
+    def compute_time_embedding(self, year_index):
+        if self.strategy in ["early-cross-attention", "late-cross-attention"]:
+            return self.year_encoder(self.year_emb(year_index))
+        elif self.strategy == "multiscale":
+            year_index = year_index.long()
+            year = year_index + self.min_year
+            decade = (year // 10).long()
+            century = (year // 100).long()
+            return (
+                    self.year_emb(year_index) +
+                    self.decade_emb(decade) +
+                    self.century_emb(century)
+            )
+        else:
+            return self.year_emb(year_index)
+    def forward(self, token_output, year_index):
+        B, T, H = token_output.size()
+        if self.strategy == "baseline":
+            return token_output
+        year_emb = self.compute_time_embedding(year_index)
+        if self.strategy == "concat":
+            expanded_year = year_emb.unsqueeze(1).repeat(1, T, 1)
+            fused = torch.cat([token_output, expanded_year], dim=-1)
+            return self.concat_proj(fused)
+        elif self.strategy == "film":
+            gamma = self.film_gamma(year_emb).unsqueeze(1)
+            beta = self.film_beta(year_emb).unsqueeze(1)
+            return gamma * token_output + beta
+        elif self.strategy == "adapter":
+            return token_output + self.adapter(year_emb).unsqueeze(1)
+        elif self.strategy == "add":
+            expanded_year = year_emb.unsqueeze(1).repeat(1, T, 1)
+            return token_output + expanded_year
+        elif self.strategy == "relative":
+            encoded = self.relative_encoder(year_emb)
+            gamma = self.film_gamma(encoded).unsqueeze(1)
+            beta = self.film_beta(encoded).unsqueeze(1)
+            return gamma * token_output + beta
+        elif self.strategy == "multiscale":
+            expanded_year = year_emb.unsqueeze(1).expand(-1, T, -1)
+            return token_output + expanded_year
+        elif self.strategy == "late-cross-attention":
+            return self.cross_attn(token_output, year_emb)
+        else:
+            raise ValueError(f"Unknown fusion strategy: {self.strategy}")
+class TemporalCrossAttention(nn.Module):
+    def __init__(self, hidden_size, num_heads=4):
+        super().__init__()
+        self.attn = nn.MultiheadAttention(embed_dim=hidden_size, num_heads=num_heads, batch_first=True)
+    def forward(self, token_output, time_embedding):
+        # token_output: (B, T, H), time_embedding: (B, H)
+        time_as_seq = time_embedding.unsqueeze(1)  # (B, 1, H)
+        attn_output, _ = self.attn(token_output, time_as_seq, time_as_seq)
+        return token_output + attn_output

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4693904ab70f0ef7c0249db6c23b0b1e3b2760629d8dafa9010f5ec9feb7de39
+size 168604214

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

test.py ADDED Viewed

	@@ -0,0 +1,46 @@

+# Import necessary modules from the transformers library
+from transformers import pipeline
+from transformers import AutoModelForTokenClassification, AutoTokenizer
+# Define the model name to be used for token classification, we use the Impresso NER
+# that can be found at "https://huggingface.co/impresso-project/ner-stacked-bert-multilingual"
+MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual"
+# Load the tokenizer corresponding to the specified model name
+ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+ner_pipeline = pipeline(
+    "generic-ner",
+    model=MODEL_NAME,
+    tokenizer=ner_tokenizer,
+    trust_remote_code=True,
+    device="cpu",
+)
+sentences = [
+    """In the year 1789, King Louis XVI, ruler of France, convened the Estates-General at the Palace of Versailles,
+                where Marie Antoinette, the Queen of France, alongside Maximilien Robespierre, a leading member of the National Assembly,
+                debated with Jean-Jacques Rousseau, the famous philosopher, and Charles de Talleyrand, the Bishop of Autun,
+                regarding the future of the French monarchy. At the same time, across the Atlantic in Philadelphia,
+                George Washington, the first President of the United States, and Thomas Jefferson, the nation's Secretary of State,
+                were drafting policies for the newly established American government following the signing of the Constitution."""
+]
+print(sentences[0])
+# Helper function to print entities one per row
+def print_nicely(entities):
+    for entity in entities:
+        print(
+            f"Entity: {entity['entity']} | Confidence: {entity['score']:.2f}% | Text: {entity['word'].strip()} | Start: {entity['start']} | End: {entity['end']}"
+        )
+# Visualize stacked entities for each sentence
+for sentence in sentences:
+    results = ner_pipeline(sentence)
+    # Extract coarse and fine entities
+    for key in results.keys():
+        # Visualize the coarse entities
+        print_nicely(results[key])

test_ner.py ADDED Viewed

	@@ -0,0 +1,106 @@

+from transformers import pipeline, AutoTokenizer
+import bz2, json
+from pprint import pprint
+MODEL_NAME = "impresso-project/ner-stacked-bert-multilingual-light"
+# Load the tokenizer and model using the pipeline
+ner_tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
+ner_pipeline = pipeline(
+    "generic-ner",
+    model=MODEL_NAME,
+    tokenizer=ner_tokenizer,
+    trust_remote_code=True,
+    device="cpu",
+)
+def process_archive(lingproc_path):
+    """
+    Processes paired NER and full-text archives to extract full text and sentence offsets.
+    Args:
+        ner_path (str): Path to the NER .jsonl.bz2 archive.
+        fulltext_path (str): Path to the full-text .jsonl.bz2 archive.
+    Returns:
+        List of tuples: (doc_id, full_text, sentence_offsets)
+    """
+    results = []
+    with bz2.open(lingproc_path, mode='rt', encoding='utf-8') as f:
+        for line in f:
+            data = json.loads(line)
+            doc_id = data.get("id")
+            # Reconstruct the full text from all tokens using their offsets
+            offset_token_map = {}
+            for sent in data.get("sents", []):
+                for token in sent.get("tok", []):
+                    offset = token["o"]
+                    text = token["t"]
+                    offset_token_map[offset] = text
+            # Rebuild full text from sorted offsets
+            full_text_parts = []
+            sorted_offsets = sorted(offset_token_map.keys())
+            last_end = 0
+            for offset in sorted_offsets:
+                token = offset_token_map[offset]
+                if offset > last_end:
+                    full_text_parts.append(" " * (offset - last_end))
+                full_text_parts.append(token)
+                last_end = offset + len(token)
+            full_text = "".join(full_text_parts).strip()
+            # assert new_full_text == full_text, f"Full text mismatch for doc_id {doc_id}. Expected: {full_text}, Got: {new_full_text}"
+            sentences = []
+            for sent in data.get("sents", []):
+                tokens = sent.get("tok", [])
+                if not tokens:
+                    continue
+                start = tokens[0]["o"]
+                end = tokens[-1]["o"] + len(tokens[-1]["t"])
+                newtokens = [{"t": token["t"], "o": token["o"], "l": len(token["t"])} for token in tokens]
+                sentences.append({"start": start, "end": end, "tokens": newtokens})
+            results.append((doc_id, full_text, sentences))
+    return results
+processed_cis = process_archive("../../data/lematin-1885.jsonl.bz2")
+for ci in processed_cis:
+    doc_id, full_text, offsets = ci
+    print(f"Document ID: {doc_id}")
+    # print(f"Full Text: {full_text}")
+    # print("Sentences:")
+    for sentence in offsets:
+        start = sentence["start"]
+        end = sentence["end"]
+        tokens = sentence["tokens"]
+        sentence_text = full_text[start:end]
+        tokens_texts = [full_text[token["o"]:token["o"] + len(token["t"])] for token in tokens]
+        # print(sentence_text)
+        entities = ner_pipeline(sentence_text, tokens=tokens_texts)
+        for entity in entities:
+            abs_start = sentence["start"] + entity["lOffset"]
+            abs_end = sentence["start"] + entity["rOffset"]
+            entity_text = full_text[abs_start:abs_end]
+            entity_surface = entity["surface"]
+            assert entity_text == entity_surface, f"Entity text mismatch: {entity_text} != {entity_surface}"
+            print(f"{doc_id}: {entity_text} -- surface: {entity_surface} -- {entity['type']} -- {abs_start} - {abs_end}")
+        # pprint(entities)
+        # print(f"  Sentence: {sentence_text} (Start: {start}, End: {end})")
+        # for token in tokens:
+        #     token_text = token["t"]
+        #     token_offset = token["o"]
+        #     token_label = token["l"]
+        #     print(f"    Token: {token_text} (Offset: {token_offset}, Label: {token_label})")
+# entities = ner_pipeline(sentence)

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_len": 512,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": false,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cc92dca5d693d80c40bfa708d0ee9551d1f85b832c57710b3edfc72dc86707e1
+size 2104

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff