Financial Entity Identification through NER and DistilBERT
1. Loading Dataset
The dataset used in this project is obtained from the Hugging Face library, named nlpaueb/finer-139
. It contains annotated data for named entity recognition tasks.
2. Dataset Size Reduction
Due to the large size of the dataset, we reduce it to a manageable size to achieve good accuracy during training. This step involves selecting a subset of the data for training, validation, and testing.
3. Map Indices to Tags and Vice Versa
This section involves mapping indices to NER tag names and vice versa. These mappings are essential for converting between numerical indices and string representations of NER tags.
4. Mapping Encoded NER Tags to String Representations
Here, we convert the encoded NER tags to their string representations to facilitate better understanding and interpretation of the data.
5. Loading a Pre-trained Tokenizer
We load a pre-trained tokenizer, DistilBERT
, from the Hugging Face Transformers library. The tokenizer is essential for tokenizing the input text data, which is a crucial step in NER tasks.
6. Align Labels with Tokens
This section describes the process of aligning labels with tokens in tokenized sequences. It ensures that each label corresponds accurately to its respective token in the tokenized input sequence.
7. Create Batches of Tokenized Input Data
We use a DataCollatorForTokenClassification
to create batches of tokenized input data for token classification tasks. This step prepares the data for training and evaluation of NER models.
8. Evaluation Metrics
Here, we install and use the seqeval
library to compute evaluation metrics such as precision, recall, F1 score, and accuracy for evaluating the performance of NER models.
9. Setup Data Pipeline for Checkpointing
We set up a data pipeline to save all weights and model parameters in a folder for deployment on Hugging Face.
10. Define Model
We define the NER model using AutoModelForTokenClassification
from the Hugging Face Transformers library. The model is initialized with pre-trained weights and configured for token classification tasks.
11. Setting up Training Arguments
This section involves setting up training arguments such as learning rate, number of training epochs, and weight decay for training the NER model.
12. Training the Model
We train the NER model using the defined model, training arguments, data collator, tokenizer, and evaluation metrics.
13. Deployment and Conclusion
The final section concludes the project, mentioning the training duration, achieved accuracy, and deployment on Hugging Face. It also outlines any further steps or observations.
- Downloads last month
- 2