File size: 12,431 Bytes

6830d6b

---
license: apache-2.0
datasets:
- darrow-ai/LegalLensNER
language:
- en
metrics:
- f1
pipeline_tag: token-classification
library_name: sklearn
tags:
- ner
- legal
- crf
---
# Model Card for Model ID

<!-- Provide a quick summary of what the model is/does. -->
Conditional Random Field model for performing named entity recognition with hand crafted features. Named entities recognied - Violation-on, Violation-by, and Law.
The dataset is of the BIO format. The model achieves an F1-score of 0.32.

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->
The model was developed for LegalLens 2024 competition as part of Natural Legal Language Processing 2024. The model has handcrafted features for identifying named
entities in the BIO format.


- **Developed by:** Shashank M Chakravarthy
- **Funded by [optional]:** NA
- **Shared by [optional]:** NA
- **Model type:** Statistical Model
- **Language(s) (NLP):** English
- **License:** Apache 2.0 License
- **Finetuned from model [optional]:** NA

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** NA
- **Paper [optional]:** [https://aclanthology.org/2024.nllp-1.33.pdf]
- **Demo [optional]:** NA

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
The model is used to detect named entities in unstructured text. The model can be extended to other entities with further modification to the handcrafted features.

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

The model can be directly used on any unstructured text with a bit of preprocessing. The files contain the evaluation script.

### Downstream Use [optional]

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->


### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
This model is handcrafted for detecting violations and law in text. Can be used for other legal text which may contain similar entities.

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The limitation comes with the handcrafting the features. 

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

If the text used for prediction is improperly processed without POS tags, the model will not perform as its designed to.

## How to Get Started with the Model

Use the code below to get started with the model.
### Load libraries
```
import ast
import pandas as pd
import joblib
import nltk
from nltk import pos_tag
import string
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
```

### Check if nltk modules are downloaded, if not download them
```
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download("averaged_perceptron_tagger")
```
### Class for grouping tokens as sentences (redundant if text processed directly)
```
class getsentence(object):   
    '''
    This class is used to get the sentences from the dataset.
    Converts from BIO format to sentences using their sentence numbers
    '''
    def __init__(self, data):
        self.n_sent = 1.0
        self.data = data
        self.empty = False
        self.grouped = self.data.groupby("sentence_num").apply(self._agg_func)
        self.sentences = [s for s in self.grouped]
   
    def _agg_func(self, s):
        return [(w, p) for w, p in zip(s["token"].values.tolist(),
                                       s["pos_tag"].values.tolist())]

```
### Creates features for words in a sentence (code can be reduced using iteration)
```
def word2features(sent, i):
    '''
    This method is used to extract features from the words in the sentence.
    The main features extracted are:
    - word.lower(): The word in lowercase
    - word.isdigit(): If the word is a digit
    - word.punct(): If the word is a punctuation
    - postag: The pos tag of the word
    - word.lemma(): The lemma of the word
    - word.stem(): The stem of the word
    The features (not all) are also extracted for the 4 previous and 4 next words.
    '''
    global token_count
    wordnet_lemmatizer = WordNetLemmatizer()
    porter_stemmer = PorterStemmer()
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word.isdigit()': word.isdigit(),
        # Check if its punctuations
        'word.punct()': word in string.punctuation,
        'postag': postag,
        # Lemma of the word
        'word.lemma()': wordnet_lemmatizer.lemmatize(word),
        # Stem of the word
        'word.stem()': porter_stemmer.stem(word)
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.isdigit()': word1.isdigit(),
            '-1:word.punct()': word1 in string.punctuation,
            '-1:postag': postag1
        })
        if i - 2 >= 0:
            features.update({
                '-2:word.lower()': sent[i-2][0].lower(),
                '-2:word.isdigit()': sent[i-2][0].isdigit(),
                '-2:word.punct()': sent[i-2][0] in string.punctuation,
                '-2:postag': sent[i-2][1]
            })
        if i - 3 >= 0:
            features.update({
                '-3:word.lower()': sent[i-3][0].lower(),
                '-3:word.isdigit()': sent[i-3][0].isdigit(),
                '-3:word.punct()': sent[i-3][0] in string.punctuation,
                '-3:postag': sent[i-3][1]
            })
        if i - 4 >= 0:
            features.update({
                '-4:word.lower()': sent[i-4][0].lower(),
                '-4:word.isdigit()': sent[i-4][0].isdigit(),
                '-4:word.punct()': sent[i-4][0] in string.punctuation,
                '-4:postag': sent[i-4][1]
            })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.isdigit()': word1.isdigit(),
            '+1:word.punct()': word1 in string.punctuation,
            '+1:postag': postag1
        })
        if i + 2 < len(sent):
            features.update({
                '+2:word.lower()': sent[i+2][0].lower(),
                '+2:word.isdigit()': sent[i+2][0].isdigit(),
                '+2:word.punct()': sent[i+2][0] in string.punctuation,
                '+2:postag': sent[i+2][1]
            })
        if i + 3 < len(sent):
            features.update({
                '+3:word.lower()': sent[i+3][0].lower(),
                '+3:word.isdigit()': sent[i+3][0].isdigit(),
                '+3:word.punct()': sent[i+3][0] in string.punctuation,
                '+3:postag': sent[i+3][1]
            })
        if i + 4 < len(sent):
            features.update({
                '+4:word.lower()': sent[i+4][0].lower(),
                '+4:word.isdigit()': sent[i+4][0].isdigit(),
                '+4:word.punct()': sent[i+4][0] in string.punctuation,
                '+4:postag': sent[i+4][1]
            })
    else:
        features['EOS'] = True

    return features
```
### Obtain features for a given sentence
```
def sent2features(sent):
    '''
    This method is used to extract features from the sentence.
    '''
    return [word2features(sent, i) for i in range(len(sent))]
```
### Load file from your directory
```
df_eval = pd.read_excel("testset_NER_LegalLens.xlsx")
```
### Evaluate data type and create pos_tags for each token
```
df_eval["tokens"] = df_eval["tokens"].apply(ast.literal_eval)
df_eval['pos_tags'] = df_eval['tokens'].apply(lambda x: [tag[1]
                                                         for tag in pos_tag(x)])
```
### Aggregate tokens to sentences
```
data_eval = []
for i in range(len(df_eval)):
    for j in range(len(df_eval["tokens"][i])):
        data_eval.append(
            {
                "sentence_num": i+1,
                "id": df_eval["id"][i],
                "token": df_eval["tokens"][i][j],
                "pos_tag": df_eval["pos_tags"][i][j],
            }
        )
data_eval = pd.DataFrame(data_eval)
getter = getsentence(data_eval)
sentences_eval = getter.sentences
X_eval = [sent2features(s) for s in sentences_eval]
```
### Load model from your directory
```
crf = joblib.load("../models/crf.pkl")
y_pred_eval = crf.predict(X_eval)
print("NER tags predicted.")
df_eval["ner_tags"] = y_pred_eval
df_eval.drop(columns=["pos_tags"], inplace=True)
print("Saving the predictions...")
df_eval.to_csv("predictions_NERLens.csv", index=False)
print("Predictions saved.")
```

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[https://huggingface.co/datasets/darrow-ai/LegalLensNER]

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
The dataset was first evaluated for its datatype, POS_tags were created for each token in the text. With handcrafted features,
the model was trained on a CPU. Training time is around 20-30 minutes for this dataset.
#### Preprocessing [optional]
For every token, POS_tags were assigned using NLTK library. 


#### Training Hyperparameters

- **Training regime:** NA <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
NA

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->
The model was evaluated using macro-F1 score. A score of 0.32 was obtained on unseen test data.

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

[https://huggingface.co/datasets/darrow-ai/LegalLensNER]

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

Macro-F1 score as it evaluates the true performance of the model and mitigates the performance boost created by highly skewed entities in the dataset.

### Results

0.32 macro-F1 score on unseen data.

#### Summary

The model was designed and developed to tackle NER task in unstructured text.

## Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->
NA

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** 13th Gen Intel(R) Core(TM) i7-1365U
- **Hours used:** 0.5 hours
- **Cloud Provider:** NA
- **Compute Region:** NA
- **Carbon Emitted:** Unknown

## Technical Specifications [optional]

### Model Architecture and Objective

[More Information Needed]

### Compute Infrastructure

[More Information Needed]

#### Hardware

[More Information Needed]

#### Software

[More Information Needed]

## Citation [optional]

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]

## Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]