#finer-distilbert ## Model description **finer-distilbert** is a fine-tuned distilbert model trained on the task of **Named Entity Recognition**. It is a proof-of-concept model trained to recognize the top 4 entity types in the nlpaueb/finer-139 dataset. Due to limited time the model has not undergone any hyperparameter tuning. The model's output structure matches the **IOB2** annotation scheme of the original training dataset. The label ids are as followed: ``` 0: O 1: B-DebtInstrumentBasisSpreadOnVariableRate1 2: B-DebtInstrumentFaceAmount 3: I-DebtInstrumentFaceAmount 4: I-LineOfCreditFacilityMaximumBorrowingCapacity 5: B-DebtInstrumentInterestRateStatedPercentage 6: I-DebtInstrumentBasisSpreadOnVariableRate1 7: I-DebtInstrumentInterestRateStatedPercentage 8: B-LineOfCreditFacilityMaximumBorrowingCapacity ``` ## Running the model A basic example on how to run the model and obtain the predicted labels per token per text: ```python from transformers import AutoTokenizer, AutoModelForTokenClassification # Preparing labels for reference int2str = { 0: 'O', 1: 'B-DebtInstrumentBasisSpreadOnVariableRate1', 2: 'B-DebtInstrumentFaceAmount', 3: 'I-DebtInstrumentFaceAmount', 4: 'I-LineOfCreditFacilityMaximumBorrowingCapacity', 5: 'B-DebtInstrumentInterestRateStatedPercentage', 6: 'I-DebtInstrumentBasisSpreadOnVariableRate1', 7: 'I-DebtInstrumentInterestRateStatedPercentage', 8: 'B-LineOfCreditFacilityMaximumBorrowingCapacity', } str2int = {v:k for k,v in int2str.items()} # Load model dependencies model = AutoModelForTokenClassification.from_pretrained( "brolaurens/finer-distilbert", num_labels=len(int2str), id2label=int2str, label2id=str2int ) tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", model_max_length=512) # Text texts = [ "Of the amount drawn, $ 3,721,583 was used to pay the principal amount of $ 3,700,000 and accrued interest of $ 21,583 due under the Company 's Loan Agreement with Capital Preservation Solutions, LLC entered into on September 4, 2015." ] # Tokenize input model_input = tokenizer(texts, return_tensors='pt') # Obtain model output predictions = model(**model_input).logits predictions = predictions.argmax(axis=2) predicted_labels = [[int2str[x] for x in t] for t in predictions.tolist()] ``` ## Training parameters The model was trained using the following hyperparameters: ``` base_model: distilbert/distilbert-base-uncased learning_rate: 2e-5 batch_size: 32 epochs: 3 optimizer: adamw adam_beta1: 0.9 adam_beta2: 0.999 adam_epsilon: 1e-08 loss_function: cross entropy loss ```