BroLaurens
/

finer-distilbert

Token Classification

Inference Endpoints

Model card Files Files and versions Community

finer-distilbert / README.md

BroLaurens's picture

Update README.md

5c77061 verified 4 months ago

|

history blame contribute delete

No virus

2.6 kB

	#finer-distilbert

	## Model description

	finer-distilbert is a fine-tuned distilbert model trained on the task of Named Entity Recognition. It is a proof-of-concept model trained to recognize the top 4 entity types in the nlpaueb/finer-139 dataset. Due to limited time the model has not undergone any hyperparameter tuning. The model's output structure matches the IOB2 annotation scheme of the original training dataset. The label ids are as followed:
	```
	0: O
	1: B-DebtInstrumentBasisSpreadOnVariableRate1
	2: B-DebtInstrumentFaceAmount
	3: I-DebtInstrumentFaceAmount
	4: I-LineOfCreditFacilityMaximumBorrowingCapacity
	5: B-DebtInstrumentInterestRateStatedPercentage
	6: I-DebtInstrumentBasisSpreadOnVariableRate1
	7: I-DebtInstrumentInterestRateStatedPercentage
	8: B-LineOfCreditFacilityMaximumBorrowingCapacity
	```

	## Running the model
	A basic example on how to run the model and obtain the predicted labels per token per text:


	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	# Preparing labels for reference
	int2str = {
	0: 'O',
	1: 'B-DebtInstrumentBasisSpreadOnVariableRate1',
	2: 'B-DebtInstrumentFaceAmount',
	3: 'I-DebtInstrumentFaceAmount',
	4: 'I-LineOfCreditFacilityMaximumBorrowingCapacity',
	5: 'B-DebtInstrumentInterestRateStatedPercentage',
	6: 'I-DebtInstrumentBasisSpreadOnVariableRate1',
	7: 'I-DebtInstrumentInterestRateStatedPercentage',
	8: 'B-LineOfCreditFacilityMaximumBorrowingCapacity',
	}

	str2int = {v:k for k,v in int2str.items()}

	# Load model dependencies
	model = AutoModelForTokenClassification.from_pretrained(
	"brolaurens/finer-distilbert", num_labels=len(int2str), id2label=int2str, label2id=str2int
	)

	tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased", model_max_length=512)

	# Text
	texts = [
	"Of the amount drawn, $ 3,721,583 was used to pay the principal amount of $ 3,700,000 and accrued interest of $ 21,583 due under the Company 's Loan Agreement with Capital Preservation Solutions, LLC entered into on September 4, 2015."
	]

	# Tokenize input
	model_input = tokenizer(texts, return_tensors='pt')

	# Obtain model output
	predictions = model(**model_input).logits
	predictions = predictions.argmax(axis=2)
	predicted_labels = [[int2str[x] for x in t] for t in predictions.tolist()]
	```

	## Training parameters

	The model was trained using the following hyperparameters:

	```
	base_model: distilbert/distilbert-base-uncased
	learning_rate: 2e-5
	batch_size: 32
	epochs: 3
	optimizer: adamw
	adam_beta1: 0.9
	adam_beta2: 0.999
	adam_epsilon: 1e-08
	loss_function: cross entropy loss
	```