BroLaurens/finer-distilbert

#finer-distilbert

Model description

finer-distilbert is a fine-tuned distilbert model trained on the task of Named Entity Recognition. It is a proof-of-concept model trained to recognize the top 4 entity types in the nlpaueb/finer-139 dataset. Due to limited time the model has not undergone any hyperparameter tuning. The model's output structure matches the IOB2 annotation scheme of the original training dataset. The label ids are as followed:

0: O
1: B-DebtInstrumentBasisSpreadOnVariableRate1
2: B-DebtInstrumentFaceAmount
3: I-DebtInstrumentFaceAmount
4: I-LineOfCreditFacilityMaximumBorrowingCapacity
5: B-DebtInstrumentInterestRateStatedPercentage
6: I-DebtInstrumentBasisSpreadOnVariableRate1
7: I-DebtInstrumentInterestRateStatedPercentage
8: B-LineOfCreditFacilityMaximumBorrowingCapacity

Running the model

A basic example on how to run the model and obtain the predicted labels per token per text:

from transformers import AutoTokenizer, AutoModelForTokenClassification

# Preparing labels for reference
int2str = {
  0: 'O',
  1: 'B-DebtInstrumentBasisSpreadOnVariableRate1',
  2: 'B-DebtInstrumentFaceAmount',
  3: 'I-DebtInstrumentFaceAmount',
  4: 'I-LineOfCreditFacilityMaximumBorrowingCapacity',
  5: 'B-DebtInstrumentInterestRateStatedPercentage',
  6: 'I-DebtInstrumentBasisSpreadOnVariableRate1',
  7: 'I-DebtInstrumentInterestRateStatedPercentage',
  8: 'B-LineOfCreditFacilityMaximumBorrowingCapacity',
}

str2int = {v:k for k,v in int2str.items()}

# Load model dependencies
model = AutoModelForTokenClassification.from_pretrained(
    "brolaurens/finer-distilbert", num_labels=len(int2str), id2label=int2str, label2id=str2int
)

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", model_max_length=512)

# Text
texts = [
  "Of the amount drawn, $ 3,721,583 was used to pay the principal amount of $ 3,700,000 and accrued interest of $ 21,583 due under the Company 's Loan Agreement with Capital Preservation Solutions, LLC entered into on September 4, 2015."
]

# Tokenize input
model_input = tokenizer(texts, return_tensors='pt')

# Obtain model output
predictions = model(**model_input).logits
predictions = predictions.argmax(axis=2)
predicted_labels = [[int2str[x] for x in t] for t in predictions.tolist()]

Training parameters

The model was trained using the following hyperparameters:

base_model: distilbert/distilbert-base-uncased
learning_rate: 2e-5
batch_size: 32
epochs: 3
optimizer: adamw
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
loss_function: cross entropy loss