Token Classification
Scikit-learn
English
ner
legal
crf
shashankmc's picture
Readme.md file updated
6830d6b verified
metadata
license: apache-2.0
datasets:
  - darrow-ai/LegalLensNER
language:
  - en
metrics:
  - f1
pipeline_tag: token-classification
library_name: sklearn
tags:
  - ner
  - legal
  - crf

Model Card for Model ID

Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law. The dataset is of the BIO format. The model achieves an F1-score of 0.32.

Model Details

Model Description

The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named entities in the BIO format.

  • Developed by: Shashank M Chakravarthy
  • Funded by [optional]: NA
  • Shared by [optional]: NA
  • Model type: Statistical Model
  • Language(s) (NLP): English
  • License: Apache 2.0 License
  • Finetuned from model [optional]: NA

Model Sources [optional]

Uses

The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features.

Direct Use

The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script.

Downstream Use [optional]

Out-of-Scope Use

This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities.

Bias, Risks, and Limitations

The limitation comes with the handcrafting the features.

Recommendations

If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to.

How to Get Started with the Model

Use the code below to get started with the model.

Load libraries

import ast
import pandas as pd
import joblib
import nltk
from nltk import pos_tag
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer

Check if nltk modules are downloaded, if not download them

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download("averaged_perceptron_tagger")

Class for grouping tokens as sentences (redundant if text processed directly)

class getsentence(object):   
    '''
    This class is used to get the sentences from the dataset.
    Converts from BIO format to sentences using their sentence numbers
    '''
    def __init__(self, data):
        self.n_sent = 1.0
        self.data = data
        self.empty = False
        self.grouped = self.data.groupby("sentence_num").apply(self._agg_func)
        self.sentences = [s for s in self.grouped]
   
    def _agg_func(self, s):
        return [(w, p) for w, p in zip(s["token"].values.tolist(),
                                       s["pos_tag"].values.tolist())]

Creates features for words in a sentence (code can be reduced using iteration)

def word2features(sent, i):
    '''
    This method is used to extract features from the words in the sentence.
    The main features extracted are:
    - word.lower(): The word in lowercase
    - word.isdigit(): If the word is a digit
    - word.punct(): If the word is a punctuation
    - postag: The pos tag of the word
    - word.lemma(): The lemma of the word
    - word.stem(): The stem of the word
    The features (not all) are also extracted for the 4 previous and 4 next words.
    '''
    global token_count
    wordnet_lemmatizer = WordNetLemmatizer()
    porter_stemmer = PorterStemmer()
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word.isdigit()': word.isdigit(),
        # Check if its punctuations
        'word.punct()': word in string.punctuation,
        'postag': postag,
        # Lemma of the word
        'word.lemma()': wordnet_lemmatizer.lemmatize(word),
        # Stem of the word
        'word.stem()': porter_stemmer.stem(word)
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.isdigit()': word1.isdigit(),
            '-1:word.punct()': word1 in string.punctuation,
            '-1:postag': postag1
        })
        if i - 2 >= 0:
            features.update({
                '-2:word.lower()': sent[i-2][0].lower(),
                '-2:word.isdigit()': sent[i-2][0].isdigit(),
                '-2:word.punct()': sent[i-2][0] in string.punctuation,
                '-2:postag': sent[i-2][1]
            })
        if i - 3 >= 0:
            features.update({
                '-3:word.lower()': sent[i-3][0].lower(),
                '-3:word.isdigit()': sent[i-3][0].isdigit(),
                '-3:word.punct()': sent[i-3][0] in string.punctuation,
                '-3:postag': sent[i-3][1]
            })
        if i - 4 >= 0:
            features.update({
                '-4:word.lower()': sent[i-4][0].lower(),
                '-4:word.isdigit()': sent[i-4][0].isdigit(),
                '-4:word.punct()': sent[i-4][0] in string.punctuation,
                '-4:postag': sent[i-4][1]
            })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.isdigit()': word1.isdigit(),
            '+1:word.punct()': word1 in string.punctuation,
            '+1:postag': postag1
        })
        if i + 2 < len(sent):
            features.update({
                '+2:word.lower()': sent[i+2][0].lower(),
                '+2:word.isdigit()': sent[i+2][0].isdigit(),
                '+2:word.punct()': sent[i+2][0] in string.punctuation,
                '+2:postag': sent[i+2][1]
            })
        if i + 3 < len(sent):
            features.update({
                '+3:word.lower()': sent[i+3][0].lower(),
                '+3:word.isdigit()': sent[i+3][0].isdigit(),
                '+3:word.punct()': sent[i+3][0] in string.punctuation,
                '+3:postag': sent[i+3][1]
            })
        if i + 4 < len(sent):
            features.update({
                '+4:word.lower()': sent[i+4][0].lower(),
                '+4:word.isdigit()': sent[i+4][0].isdigit(),
                '+4:word.punct()': sent[i+4][0] in string.punctuation,
                '+4:postag': sent[i+4][1]
            })
    else:
        features['EOS'] = True

    return features

Obtain features for a given sentence

def sent2features(sent):
    '''
    This method is used to extract features from the sentence.
    '''
    return [word2features(sent, i) for i in range(len(sent))]

Load file from your directory

df_eval = pd.read_excel("testset_NER_LegalLens.xlsx")

Evaluate data type and create pos_tags for each token

df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval)
df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1]
                                                         for tag in pos_tag(x)])

Aggregate tokens to sentences

data_eval = []
for i in range(len(df_eval)):
    for j in range(len(df_eval["tokens"][i])):
        data_eval.append(
            {
                "sentence_num": i+1,
                "id": df_eval["id"][i],
                "token": df_eval["tokens"][i][j],
                "pos_tag": df_eval["pos_tags"][i][j],
            }
        )
data_eval = pd.DataFrame(data_eval)
getter = getsentence(data_eval)
sentences_eval = getter.sentences
X_eval = [sent2features(s) for s in sentences_eval]

Load model from your directory

crf = joblib.load("../models/crf.pkl")
y_pred_eval = crf.predict(X_eval)
print("NER tags predicted.")
df_eval["ner_tags"] = y_pred_eval
df_eval.drop(columns=["pos_tags"], inplace=True)
print("Saving the predictions...")
df_eval.to_csv("predictions_NERLens.csv", index=False)
print("Predictions saved.")

Training Details

Training Data

[https://huggingface.co/datasets/darrow-ai/LegalLensNER]

Training Procedure

The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features, the model was trained on a CPU. Training time is around 20-30 minutes for this dataset.

Preprocessing [optional]

For every token, POS_tags were assigned using NLTK library.

Training Hyperparameters

  • Training regime: NA

Speeds, Sizes, Times [optional]

NA

Evaluation

The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data.

Testing Data, Factors & Metrics

Testing Data

[https://huggingface.co/datasets/darrow-ai/LegalLensNER]

Factors

[More Information Needed]

Metrics

Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset.

Results

0.32 macro-F1 score on unseen data.

Summary

The model was designed and developed to tackle NER task in unstructured text.

Model Examination [optional]

NA

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: 13th Gen Intel(R) Core(TM) i7-1365U
  • Hours used: 0.5 hours
  • Cloud Provider: NA
  • Compute Region: NA
  • Carbon Emitted: Unknown

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]