|
#finer-distilbert |
|
|
|
## Model description |
|
|
|
**finer-distilbert** is a fine-tuned distilbert model trained on the task of **Named Entity Recognition**. It is a proof-of-concept model trained to recognize the top 4 entity types in the nlpaueb/finer-139 dataset. Due to limited time the model has not undergone any hyperparameter tuning. The model's output structure matches the **IOB2** annotation scheme of the original training dataset. The label ids are as followed: |
|
``` |
|
0: O |
|
1: B-DebtInstrumentBasisSpreadOnVariableRate1 |
|
2: B-DebtInstrumentFaceAmount |
|
3: I-DebtInstrumentFaceAmount |
|
4: I-LineOfCreditFacilityMaximumBorrowingCapacity |
|
5: B-DebtInstrumentInterestRateStatedPercentage |
|
6: I-DebtInstrumentBasisSpreadOnVariableRate1 |
|
7: I-DebtInstrumentInterestRateStatedPercentage |
|
8: B-LineOfCreditFacilityMaximumBorrowingCapacity |
|
``` |
|
|
|
## Running the model |
|
A basic example on how to run the model and obtain the predicted labels per token per text: |
|
|
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
# Preparing labels for reference |
|
int2str = { |
|
0: 'O', |
|
1: 'B-DebtInstrumentBasisSpreadOnVariableRate1', |
|
2: 'B-DebtInstrumentFaceAmount', |
|
3: 'I-DebtInstrumentFaceAmount', |
|
4: 'I-LineOfCreditFacilityMaximumBorrowingCapacity', |
|
5: 'B-DebtInstrumentInterestRateStatedPercentage', |
|
6: 'I-DebtInstrumentBasisSpreadOnVariableRate1', |
|
7: 'I-DebtInstrumentInterestRateStatedPercentage', |
|
8: 'B-LineOfCreditFacilityMaximumBorrowingCapacity', |
|
} |
|
|
|
str2int = {v:k for k,v in int2str.items()} |
|
|
|
# Load model dependencies |
|
model = AutoModelForTokenClassification.from_pretrained( |
|
"brolaurens/finer-distilbert", num_labels=len(int2str), id2label=int2str, label2id=str2int |
|
) |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", model_max_length=512) |
|
|
|
# Text |
|
texts = [ |
|
"Of the amount drawn, $ 3,721,583 was used to pay the principal amount of $ 3,700,000 and accrued interest of $ 21,583 due under the Company 's Loan Agreement with Capital Preservation Solutions, LLC entered into on September 4, 2015." |
|
] |
|
|
|
# Tokenize input |
|
model_input = tokenizer(texts, return_tensors='pt') |
|
|
|
# Obtain model output |
|
predictions = model(**model_input).logits |
|
predictions = predictions.argmax(axis=2) |
|
predicted_labels = [[int2str[x] for x in t] for t in predictions.tolist()] |
|
``` |
|
|
|
## Training parameters |
|
|
|
The model was trained using the following hyperparameters: |
|
|
|
``` |
|
base_model: distilbert/distilbert-base-uncased |
|
learning_rate: 2e-5 |
|
batch_size: 32 |
|
epochs: 3 |
|
optimizer: adamw |
|
adam_beta1: 0.9 |
|
adam_beta2: 0.999 |
|
adam_epsilon: 1e-08 |
|
loss_function: cross entropy loss |
|
``` |