File size: 5,190 Bytes
eb4e046
 
 
 
 
ffa9989
 
 
eb4e046
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ffa9989
559ea6f
 
eb4e046
 
 
 
 
559ea6f
eb4e046
 
565ab4c
559ea6f
eb4e046
565ab4c
 
 
 
eb4e046
565ab4c
 
 
 
eb4e046
565ab4c
559ea6f
eb4e046
565ab4c
ffa9989
565ab4c
eb4e046
ffa9989
 
 
eb4e046
ffa9989
 
eb4e046
ffa9989
 
565ab4c
 
 
 
 
559ea6f
 
 
 
 
565ab4c
 
 
 
 
 
 
 
559ea6f
 
 
 
 
 
 
 
 
 
 
 
 
 
eb4e046
565ab4c
559ea6f
eb4e046
 
 
 
 
 
 
559ea6f
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
license: apache-2.0
base_model: bert-base-cased
tags:
- generated_from_trainer
- bert-finetuned
- Named Entity Recognition
- NER
datasets:
- conll2003
metrics:
- precision
- recall
- f1
- accuracy
model-index:
- name: bert-finetuned-ner
  results:
  - task:
      name: Token Classification
      type: token-classification
    dataset:
      name: conll2003
      type: conll2003
      config: conll2003
      split: validation
      args: conll2003
    metrics:
    - name: Precision
      type: precision
      value: 0.9346783529022656
    - name: Recall
      type: recall
      value: 0.9511948838774823
    - name: F1
      type: f1
      value: 0.9428642922679124
    - name: Accuracy
      type: accuracy
      value: 0.9863572143403779
pipeline_tag: token-classification
language:
- en
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->


# bert-finetuned-ner

## Model Description
This model is a Named Entity Recognition (NER) model built using PyTorch and fine-tuned on the CoNLL-2003 dataset. The model is designed to identify and classify named entities in text into categories such as persons (PER), organizations (ORG), locations (LOC), and miscellaneous entities (MISC).

## Intended Uses & Limitations
**Intended Uses:**
- **Text Analysis:** This model can be used for extracting named entities from unstructured text data, which is useful in various NLP tasks such as information retrieval, content categorization, and automated summarization.
- **NER Task:** Specifically designed for NER tasks in English.

**Limitations:**
- **Language Dependency:** The model is trained on English data and may not perform well on texts in other languages.
- **Domain Specificity:** Performance may degrade on text from domains significantly different from the training data.
- **Error Propagation:** Incorrect predictions may propagate to downstream tasks, affecting overall performance.

## How to Use
To use this model, you can load it using the Hugging Face Transformers library. Below is an example of how to perform inference using the model:

```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("Ashaduzzaman/bert-finetuned-ner")
model = AutoModelForTokenClassification.from_pretrained("Ashaduzzaman/bert-finetuned-ner")

# Create a pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# Example inference
text = "Hugging Face Inc. is based in New York City."
entities = ner_pipeline(text)

print(entities)
```

### Troubleshooting
If the model isn't performing as expected, consider checking the following:
- Ensure that the input text is in English, as the model was trained on English data.
- Adjust the model's confidence threshold for entity detection to filter out less confident predictions.

## Limitations and Bias
- **Bias in Data:** The model is trained on the CoNLL-2003 dataset, which may contain biases related to the sources of the text. The model might underperform on entities not well represented in the training data.
- **Overfitting:** The model may overfit to the specific entities present in the CoNLL-2003 dataset, affecting its generalization to new entities or text styles.

## Training Data
The model was trained on the CoNLL-2003 dataset, a widely used benchmark dataset for NER tasks. The dataset contains annotated text from news articles, with labels for persons, organizations, locations, and miscellaneous entities.

## Training Procedure
The model was fine-tuned using the pre-trained BERT model (`bert-base-cased`) with a token classification head for NER. The training process involved:
- **Optimizer:** AdamW optimizer with betas=(0.9, 0.999) and epsilon=1e-08
- **Learning Rate:** A linear learning rate scheduler was employed starting from 2e-05
- **Batch Size:** 8 for both training and evaluation
- **Epochs:** The model was trained for 3 epochs
- **Evaluation:** Model performance was evaluated on a validation set with metrics like F1-score, precision, recall, and accuracy.

### Training Hyperparameters
- **Learning Rate:** 2e-05
- **Batch Size (train/eval):** 8/8
- **Seed:** 42
- **Optimizer:** Adam with betas=(0.9, 0.999) and epsilon=1e-08
- **LR Scheduler Type:** Linear
- **Number of Epochs:** 3

## Evaluation Results
This model was evaluated on the CoNLL-2003 test set, with performance measured using standard NER metrics:

| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1     | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| 0.076         | 1.0   | 1756 | 0.0657          | 0.9076    | 0.9337 | 0.9204 | 0.9819   |
| 0.0359        | 2.0   | 3512 | 0.0693          | 0.9265    | 0.9418 | 0.9341 | 0.9847   |
| 0.0222        | 3.0   | 5268 | 0.0599          | 0.9347    | 0.9512 | 0.9429 | 0.9864   |

## Framework Versions
- **Transformers:** 4.42.4
- **PyTorch:** 2.3.1+cu121
- **Datasets:** 2.21.0
- **Tokenizers:** 0.19.1
!