File size: 3,453 Bytes
92cac9b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---

language:
- en
base_model:
- google-bert/bert-large-uncased
pipeline_tag: token-classification
---




# Model Card for Mountain NER Model

 **Model Summary**
 
This model is a fine-tuned Named Entity Recognition (NER) model specifically designed to identify mountain names in text. It is trained to detect and classify mountain entities using labeled data and state-of-the-art NER architectures. The model can handle both single-word and multi-word mountain names (e.g., "Kilimanjaro" or "Rocky Mountains").




## Intended Use

 - **Task**: Named Entity Recognition (NER) for mountain name identification.
 - Input: A text string containing sentences or paragraphs.
 - Output: A list of tokens annotated with labels:

 - B-MOUNTAIN: Beginning of a mountain name.
 - I-MOUNTAIN: Inside a mountain name.
 - O: Outside of any mountain entity.





## How to Use

You can load this model using the Hugging Face `transformers` library:

```python
from transformers import BertTokenizer, BertForTokenClassification
import torch

tokenizer = BertTokenizer.from_pretrained("your_username/your_model")
model = BertForTokenClassification.from_pretrained("your_username/your_model")

text = "The Kilimanjaro is one of the most famous mountains."

inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
labels = [model.config.id2label[label] for label in predictions.squeeze().tolist()]

print(list(zip(tokens, labels)))
```


## Dataset

The dataset includes annotated examples of text with mountain names in BIO format:

- **Training Set**: 350 examples
- **Validation Set**: 75 examples
- **Test Set**: 75 examples

The dataset was created by combining known mountain names with sentences containing them. 


## Limitations

- The model is specifically designed for mountain names and may not generalize to other named entities.
- Performance may degrade on noisy or informal text.
- Multi-word mountain names must be tokenized correctly for proper recognition.

- **Repository:** [https://github.com/Yevheniia-Ilchenko/Bert_NER]



## Training Details

The model was fine-tuned using the **BERT Base Uncased** architecture for token classification. Below are the training details:

- **Model Architecture**: BERT for Token Classification (`bert-base-uncased`).
- **Dataset**: Custom-labeled dataset in BIO format for mountain name recognition.
- **Hyperparameters**:
  - **Learning Rate**: `2e-4`
  - **Batch Size**: `16`
  - **Maximum Sequence Length**: `128`
  - **Number of Epochs**: `3`
- **Optimizer**: AdamW
- **Warmup Steps**: `500`
- **Weight Decay**: `0.01`
- **Evaluation Strategy**: Steps-based evaluation with automatic saving of the best model.
- **Training Arguments**:
  - `save_total_limit=3`: Limits the number of saved checkpoints.
  - `load_best_model_at_end=True`: Ensures the best model is used after training.
- **Training Performance**:
  - **Training Runtime**: `570.44 seconds`
  - **Training Samples per Second**: `1.841`
  - **Training Steps per Second**: `0.116`
  - **Final Training Loss**: `0.4017`
- **Evaluation Metrics**:
  - **Evaluation Loss**: `0.0839`
  - **Precision**: `97.11%`
  - **Recall**: `96.89%`
  - **F1 Score**: `96.91%`
  - **Evaluation Runtime**: `13.76 seconds`
  - **Samples per Second**: `5.449`
  - **Steps per Second**: `0.726`