nlp-dr-ner / README.md
agomez302's picture
Update README.md
9468436 verified
metadata
language:
  - es
metrics:
  - precision
  - accuracy
  - f1
  - recall
base_model:
  - MMG/xlm-roberta-large-ner-spanish
pipeline_tag: text-classification
library_name: transformers
widget:
  - text: >-
      lorem ... SENTENCIA DEL 31 DE ENERO DE 2024 ... que la sentencia que
      antecede fue dada y firmada por los jueces...
  - example_title: 'Spanish Legal Text Date NER '

Model Card

🤖 NER Model 🧑‍⚖️

📅 Date Extraction for Sentencias from DR 🇩🇴

Choose a PDF or DOCX file to extract text, clean it, and perform Named Entity Recognition (NER) for date extraction.

Model Details

Model Description

This is a Named Entity Recognition (NER) model which identifies and extracts date entities from Spanish legal documents from the Dominican Republic. This model is based on MMG/XLM-roberta-large-ner-spanish and was finetuned using boletines judiciales.

source

  • Developed by: Victor Fernandez, Alejandro Gomez, Karol Gutierrez, Nathan Dahlberg, Bree Shi, Dr. Charlotte Alexander
  • Model type: NER
  • Language(s) (NLP): Spanish
  • License:
  • Finetuned from model: MMG/xlm-roberta-large-ner-spanish which is a derivative of FacebookAI/xlm-roberta-large

Model Sources

  • Repository: Coming Soon
  • Paper: Coming Soon
  • Demo: Try it out

Uses

This NER model is intended for use in processing and analyzing legal documents from the Dominican Republic to extract date-related information. It is particularly useful for legal professionals, researchers, and organizations that need to automate the extraction of dates for case management, compliance, and archival purposes.

Direct Use

  • Legal professionals working with documents in Spanish
  • Researchers analyzing legal texts in Spanish

Out-of-Scope Use

  • Extraction of non-date entities (e.g. persons, locations, organizations, etc.)
  • High risk or critical applications

Bias, Risks, and Limitations

  • This is trained with 3 boletines judiciales only
  • Date format variations
  • Potential for misclassification

Recommendations

  • Human QA/Due diligence follow the NER extraction

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

REPO = "agomez302/nlp-dr-ner"

class NerProcessor:
    def __init__(self):
        self.deployed_tokenizer = AutoTokenizer.from_pretrained(REPO)
        self.deployed_model = AutoModelForTokenClassification.from_pretrained(REPO)
        self.deployed_ner_pipeline = pipeline(
                "ner",
                model=self.deployed_model,
                tokenizer=self.deployed_tokenizer,
                aggregation_strategy="simple"
        )

    def process_text(self, text):
        """Runs NER model on text and returns JSONL string."""
        try:
            chunks = self.split_text_with_overlap(text)
            all_predictions = []
            for chunk in chunks:
                preds = self.deployed_ner_pipeline(chunk)
                all_predictions.extend(preds)
            all_predictions = self.deduplicate_entities(all_predictions)
            
            formatted_output = {
                "entities": self.run_predictions(all_predictions)
            }
            
            return json.dumps(formatted_output)

        except Exception as e:
            logger.error(f"Failed to run NER model on extracted text: {e}")
        
    def split_text_with_overlap(self, text, max_tokens=450, overlap=50):
        """Split text into chunks with overlap to handle long sequences."""
        if not text:
            return []
        max_tokens = min(max_tokens, 512)
        
        tokenizer = self.deployed_tokenizer
        tokens = tokenizer.encode(text, truncation=False)
        
        if len(tokens) <= max_tokens:
            return [text]
            
        chunks = []
        i = 0
        while i < len(tokens):
            chunk = tokenizer.decode(tokens[i:i + max_tokens], skip_special_tokens=True)
            chunks.append(chunk)
            i += max_tokens - overlap
        return chunks

    def deduplicate_entities(self, predictions):
        """Remove duplicate entities from overlapping chunks."""
        unique = []
        seen = set()
        for entity in predictions:
            key = (entity['entity_group'], entity['word'], entity['start'], entity['end'])
            if key not in seen:
                unique.append(entity)
                seen.add(key)
        return unique
    
    def run_predictions(self, predictions: list):
        """Format predictions for output, converting float32 to regular float."""
        try:
            processed_predictions = []
            for pred in predictions:
                pred_dict = dict(pred)
                pred_dict['score'] = float(pred_dict['score'])
                processed_predictions.append(pred_dict)

            return processed_predictions
                
        except Exception as e:
            logging.error(f"Failed to process predictions: {e}")
            raise


def main():
  text = "SENTENCIA DEL 31 DE ENERO DE 2024 ... que la sentencia que antecede fue dada y firmada por los jueces que figuran en ella, en la fecha arriba indicada. www.poderjudicial.gob.do\n"
  ner_processor = NerProcessor()
  ner_output = ner_processor.process_text(text)
  print(ner_output)

if __name__ = '__main__':
  main()

Some Sample output

{
  "entities":[
    0:{
      "entity_group":"DATE"
      "score":0.9878288507461548
      "word":"veintitrés (23) días del mes de mayo del año dos mil veintitrés (2023)"
      "start":290
      "end":360
    }
    1:{
      "entity_group":"DATE"
      "score":0.9994959831237793
      "word":"23 de mayo del año 2023"
      "start":1058
      "end":1081
    }
}

Training Details

Training Data

The training data consists of a JSON Lines (.jsonl) file specifically designed for Named Entity Recognition (NER) tasks in Spanish legal texts. Each entry in the dataset includes the text and corresponding entities labeled with their respective types.

{"text": "SENTENCIA DEL 31 DE ENERO DE 2024 ... que la sentencia que antecede fue dada y firmada por los jueces que figuran en ella, en la fecha arriba indicada. www.poderjudicial.gob.do\n", "entities": [{"start": 113, "end": 132, "label": "DATE"}, {"start": 271, "end": 292, "label": "DATE"}, {"start": 2009, "end": 2029, "label": "DATE"}, {"start": 2246, "end": 2265, "label": "DATE"}, {"start": 3083, "end": 3102, "label": "DATE"}, {"start": 3281, "end": 3300, "label": "DATE"}, {"start": 3479, "end": 3497, "label": "DATE"}, {"start": 3569, "end": 3588, "label": "DATE"}, {"start": 3872, "end": 3891, "label": "DATE"}, {"start": 7936, "end": 7955, "label": "DATE"}]}
// and so forth with further json lines
  • Dataset Path: ner_dataset.jsonl

  • Description: The dataset contains annotated legal documents with entities related to dates (e.g., B-DATE, I-DATE). This focused annotation helps the model accurately recognize and classify date-related entities within legal texts.

  • Data Preprocessing:

    • Chunking: The text data is split into manageable chunks to handle long sequences effectively. Each chunk maintains an overlap to ensure entities are not fragmented across chunks.
    • Tokenization: The AutoTokenizer from Hugging Face is used to tokenize the text, aligning labels with tokenized inputs while handling special tokens and padding appropriately.
    • Filtering: Chunks that begin with partial entities are discarded to maintain the integrity of entity recognition.

Training Procedure

The training procedure involves fine-tuning a pre-trained XLM-RoBERTa model for the specific NER task. The process is orchestrated through the NerFinetuner class, which manages data loading, preprocessing, model training, evaluation, and saving.

Preprocessing

  1. Loading the Dataset: The dataset is loaded using the datasets library's load_dataset function, targeting the train split from the specified JSON Lines file.

  2. Chunking Texts: Texts are divided into chunks of a maximum of 128 tokens with an overlap of 50 tokens to preserve entity continuity. Entities are adjusted to align with the chunked text segments. Chunks with incomplete entity annotations are filtered out to ensure consistency.

  3. Tokenization and Label Alignment: The AutoTokenizer tokenizes the text, and labels are aligned with the tokenized output. Special tokens and padding are handled by assigning a label of -100 to ignore them during training.

Training Hyperparameters

  • Training regime: Output Directory: ./_ner_results Evaluation Strategy: Evaluates the model at the end of each epoch (eval_strategy: epoch) Save Strategy: Saves the model at the end of each epoch (save_strategy: epoch) Learning Rate: 2e-5 Batch Sizes: Training Batch Size: 16 per device Evaluation Batch Size: 16 per device Number of Epochs: 10 Weight Decay: 0.01 Mixed Precision: Enabled using FP16 (fp16: True)

Evaluation

Testing Data, Factors & Metrics

Testing Data

Testing data came from the original data and was split in code as such:

  # Split the dataset into training and testing sets (e.g., 80% train, 20% test)
  split_dataset = dataset.train_test_split(test_size=0.2)
  train_dataset = split_dataset["train"]
  validation_dataset = split_dataset["test"]

Factors

  • Entity Type: The primary focus is on detecting DATE entities within the legal texts.

Metrics

The evaluation employs the following metrics using the seqeval library:

Precision: Measures the accuracy of the positive predictions. Recall: Assesses the ability of the model to find all relevant instances. F1 Score: Combines precision and recall into a single metric. Accuracy: Evaluates the overall correctness of the predictions.

Results

Coming Soon

Summary

Model Architecture and Objective

The model architecture is based on XLMRobertaForTokenClassification, a transformer-based model from Hugging Face tailored for token classification tasks such as NER.

Base Model: MMG/xlm-roberta-large-ner-spanish Number of Labels: 3 (e.g., O, B-DATE, I-DATE) Label Mappings: O: Outside of any entity B-DATE: Beginning of a date entity I-DATE: Inside of a date entity

Citation

BibTeX: Coming Soon

APA: Coming Soon

Glossary [optional]

sentencia - a document that is a formal judicial decision or judgment issued by a court at the conclusion of a legal proceeding.

More Information

This was developed as part of the HAAG Fall 2024 NLP-DR cohort under Dr. Charlotte Alexander and Bree Shi.

Model Card Contact

Reach out to the HAAG team or Dr. Alexander at Georgia Tech with any inquiries