huseyincenik/conll_ner_with_bert

This model is a fine-tuned version of bert-base-uncased on the CoNLL-2003 dataset for Named Entity Recognition (NER).

Model description

This model has been trained to perform Named Entity Recognition (NER) and is based on the BERT architecture. It was fine-tuned on the CoNLL-2003 dataset, a standard dataset for NER tasks.

Intended uses & limitations

Intended Uses

Named Entity Recognition: This model is designed to identify and classify named entities in text into categories such as location (LOC), organization (ORG), person (PER), and miscellaneous (MISC).

Limitations

Domain Specificity: The model was fine-tuned on the CoNLL-2003 dataset, which consists of news articles. It may not generalize well to other domains or types of text not represented in the training data.
Subword Tokens: The model may occasionally tag subword tokens as entities, requiring post-processing to handle these cases.

Training and evaluation data

Training Dataset: CoNLL-2003

Training Evaluation Metrics:

Label	Precision	Recall	F1-Score	Support
B-PER	0.98	0.98	0.98	11273
I-PER	0.98	0.99	0.99	9323
B-ORG	0.88	0.92	0.90	10447
I-ORG	0.81	0.92	0.86	5137
B-LOC	0.86	0.94	0.90	9621
I-LOC	1.00	0.08	0.14	1267
B-MISC	0.81	0.73	0.77	4793
I-MISC	0.83	0.36	0.50	1329
Micro Avg	0.90	0.90	0.90	53190
Macro Avg	0.89	0.74	0.75	53190
Weighted Avg	0.90	0.90	0.89	53190

Validation Evaluation Metrics:

Label	Precision	Recall	F1-Score	Support
B-PER	0.97	0.98	0.97	3018
I-PER	0.98	0.98	0.98	2741
B-ORG	0.86	0.91	0.88	2056
I-ORG	0.77	0.81	0.79	900
B-LOC	0.86	0.94	0.90	2618
I-LOC	1.00	0.10	0.18	281
B-MISC	0.77	0.74	0.76	1231
I-MISC	0.77	0.34	0.48	390
Micro Avg	0.90	0.89	0.89	13235
Macro Avg	0.87	0.73	0.74	13235
Weighted Avg	0.90	0.89	0.88	13235

Test Evaluation Metrics:

Label	Precision	Recall	F1-Score	Support
B-PER	0.96	0.95	0.96	2714
I-PER	0.98	0.99	0.98	2487
B-ORG	0.81	0.87	0.84	2588
I-ORG	0.74	0.87	0.80	1050
B-LOC	0.81	0.90	0.85	2121
I-LOC	0.89	0.12	0.22	276
B-MISC	0.75	0.67	0.71	996
I-MISC	0.85	0.49	0.62	241
Micro Avg	0.87	0.88	0.87	12473
Macro Avg	0.85	0.73	0.75	12473
Weighted Avg	0.87	0.88	0.86	12473

Training procedure

Training Hyperparameters

Optimizer: AdamWeightDecay
- Learning Rate: 2e-05
- Decay Schedule: PolynomialDecay
- Warmup Steps: 0.1
- Weight Decay Rate: 0.01
training_precision: float32

Training results

Train Loss	Validation Loss	Epoch
0.1016	0.0254	0
0.0228	0.0180	1

Optimizer Details

from transformers import create_optimizer

batch_size = 32
num_train_epochs = 2
num_train_steps = (len(tokenized_conll["train"]) // batch_size) * num_train_epochs

optimizer, lr_schedule = create_optimizer(
    init_lr=2e-5,
    num_train_steps=num_train_steps,
    weight_decay_rate=0.01,
    num_warmup_steps=0.1
)

How to Use

Using a Pipeline

from transformers import pipeline

pipe = pipeline("token-classification", model="huseyincenik/conll_ner_with_bert")

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("huseyincenik/conll_ner_with_bert")
model = AutoModelForTokenClassification.from_pretrained("huseyincenik/conll_ner_with_bert")

Abbreviation	Description
O	Outside of a named entity
B-MISC	Beginning of a miscellaneous entity right after another miscellaneous entity
I-MISC	Miscellaneous entity
B-PER	Beginning of a person’s name right after another person’s name
I-PER	Person’s name
B-ORG	Beginning of an organization right after another organization
I-ORG	organization
B-LOC	Beginning of a location right after another location
I-LOC	Location

CoNLL-2003 English Dataset Statistics

This dataset was derived from the Reuters corpus which consists of Reuters news stories. You can read more about how this dataset was created in the CoNLL-2003 paper.

# of training examples per entity type

Dataset	LOC	MISC	ORG	PER
Train	7140	3438	6321	6600
Dev	1837	922	1341	1842
Test	1668	702	1661	1617

# of articles/sentences/tokens per dataset

Dataset	Articles	Sentences	Tokens
Train	946	14,987	203,621
Dev	216	3,466	51,362
Test	231	3,684	46,435

Framework versions

Transformers 4.45.0.dev0
TensorFlow 2.17.0
Datasets 2.21.0
Tokenizers 0.19.1

huseyincenik
/

conll_ner_with_bert