Financial Entity Identification through NER and DistilBERT

1. Loading Dataset

The dataset used in this project is obtained from the Hugging Face library, named nlpaueb/finer-139. It contains annotated data for named entity recognition tasks.

2. Dataset Size Reduction

Due to the large size of the dataset, we reduce it to a manageable size to achieve good accuracy during training. This step involves selecting a subset of the data for training, validation, and testing.

3. Map Indices to Tags and Vice Versa

This section involves mapping indices to NER tag names and vice versa. These mappings are essential for converting between numerical indices and string representations of NER tags.

4. Mapping Encoded NER Tags to String Representations

Here, we convert the encoded NER tags to their string representations to facilitate better understanding and interpretation of the data.

5. Loading a Pre-trained Tokenizer

We load a pre-trained tokenizer, DistilBERT, from the Hugging Face Transformers library. The tokenizer is essential for tokenizing the input text data, which is a crucial step in NER tasks.

6. Align Labels with Tokens

This section describes the process of aligning labels with tokens in tokenized sequences. It ensures that each label corresponds accurately to its respective token in the tokenized input sequence.

7. Create Batches of Tokenized Input Data

We use a DataCollatorForTokenClassification to create batches of tokenized input data for token classification tasks. This step prepares the data for training and evaluation of NER models.

8. Evaluation Metrics

Here, we install and use the seqeval library to compute evaluation metrics such as precision, recall, F1 score, and accuracy for evaluating the performance of NER models.

9. Setup Data Pipeline for Checkpointing

We set up a data pipeline to save all weights and model parameters in a folder for deployment on Hugging Face.

10. Define Model

We define the NER model using AutoModelForTokenClassification from the Hugging Face Transformers library. The model is initialized with pre-trained weights and configured for token classification tasks.

11. Setting up Training Arguments

This section involves setting up training arguments such as learning rate, number of training epochs, and weight decay for training the NER model.

12. Training the Model

We train the NER model using the defined model, training arguments, data collator, tokenizer, and evaluation metrics.

13. Deployment and Conclusion

The final section concludes the project, mentioning the training duration, achieved accuracy, and deployment on Hugging Face. It also outlines any further steps or observations.