finer-distilbert / README.md
BroLaurens's picture
Update README.md
5c77061 verified
#finer-distilbert
## Model description
**finer-distilbert** is a fine-tuned distilbert model trained on the task of **Named Entity Recognition**. It is a proof-of-concept model trained to recognize the top 4 entity types in the nlpaueb/finer-139 dataset. Due to limited time the model has not undergone any hyperparameter tuning. The model's output structure matches the **IOB2** annotation scheme of the original training dataset. The label ids are as followed:
```
0: O
1: B-DebtInstrumentBasisSpreadOnVariableRate1
2: B-DebtInstrumentFaceAmount
3: I-DebtInstrumentFaceAmount
4: I-LineOfCreditFacilityMaximumBorrowingCapacity
5: B-DebtInstrumentInterestRateStatedPercentage
6: I-DebtInstrumentBasisSpreadOnVariableRate1
7: I-DebtInstrumentInterestRateStatedPercentage
8: B-LineOfCreditFacilityMaximumBorrowingCapacity
```
## Running the model
A basic example on how to run the model and obtain the predicted labels per token per text:
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
# Preparing labels for reference
int2str = {
0: 'O',
1: 'B-DebtInstrumentBasisSpreadOnVariableRate1',
2: 'B-DebtInstrumentFaceAmount',
3: 'I-DebtInstrumentFaceAmount',
4: 'I-LineOfCreditFacilityMaximumBorrowingCapacity',
5: 'B-DebtInstrumentInterestRateStatedPercentage',
6: 'I-DebtInstrumentBasisSpreadOnVariableRate1',
7: 'I-DebtInstrumentInterestRateStatedPercentage',
8: 'B-LineOfCreditFacilityMaximumBorrowingCapacity',
}
str2int = {v:k for k,v in int2str.items()}
# Load model dependencies
model = AutoModelForTokenClassification.from_pretrained(
"brolaurens/finer-distilbert", num_labels=len(int2str), id2label=int2str, label2id=str2int
)
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", model_max_length=512)
# Text
texts = [
"Of the amount drawn, $ 3,721,583 was used to pay the principal amount of $ 3,700,000 and accrued interest of $ 21,583 due under the Company 's Loan Agreement with Capital Preservation Solutions, LLC entered into on September 4, 2015."
]
# Tokenize input
model_input = tokenizer(texts, return_tensors='pt')
# Obtain model output
predictions = model(**model_input).logits
predictions = predictions.argmax(axis=2)
predicted_labels = [[int2str[x] for x in t] for t in predictions.tolist()]
```
## Training parameters
The model was trained using the following hyperparameters:
```
base_model: distilbert/distilbert-base-uncased
learning_rate: 2e-5
batch_size: 32
epochs: 3
optimizer: adamw
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
loss_function: cross entropy loss
```