File size: 3,453 Bytes
92cac9b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 |
---
language:
- en
base_model:
- google-bert/bert-large-uncased
pipeline_tag: token-classification
---
# Model Card for Mountain NER Model
**Model Summary**
This model is a fine-tuned Named Entity Recognition (NER) model specifically designed to identify mountain names in text. It is trained to detect and classify mountain entities using labeled data and state-of-the-art NER architectures. The model can handle both single-word and multi-word mountain names (e.g., "Kilimanjaro" or "Rocky Mountains").
## Intended Use
- **Task**: Named Entity Recognition (NER) for mountain name identification.
- Input: A text string containing sentences or paragraphs.
- Output: A list of tokens annotated with labels:
- B-MOUNTAIN: Beginning of a mountain name.
- I-MOUNTAIN: Inside a mountain name.
- O: Outside of any mountain entity.
## How to Use
You can load this model using the Hugging Face `transformers` library:
```python
from transformers import BertTokenizer, BertForTokenClassification
import torch
tokenizer = BertTokenizer.from_pretrained("your_username/your_model")
model = BertForTokenClassification.from_pretrained("your_username/your_model")
text = "The Kilimanjaro is one of the most famous mountains."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.argmax(outputs.logits, dim=-1)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
labels = [model.config.id2label[label] for label in predictions.squeeze().tolist()]
print(list(zip(tokens, labels)))
```
## Dataset
The dataset includes annotated examples of text with mountain names in BIO format:
- **Training Set**: 350 examples
- **Validation Set**: 75 examples
- **Test Set**: 75 examples
The dataset was created by combining known mountain names with sentences containing them.
## Limitations
- The model is specifically designed for mountain names and may not generalize to other named entities.
- Performance may degrade on noisy or informal text.
- Multi-word mountain names must be tokenized correctly for proper recognition.
- **Repository:** [https://github.com/Yevheniia-Ilchenko/Bert_NER]
## Training Details
The model was fine-tuned using the **BERT Base Uncased** architecture for token classification. Below are the training details:
- **Model Architecture**: BERT for Token Classification (`bert-base-uncased`).
- **Dataset**: Custom-labeled dataset in BIO format for mountain name recognition.
- **Hyperparameters**:
- **Learning Rate**: `2e-4`
- **Batch Size**: `16`
- **Maximum Sequence Length**: `128`
- **Number of Epochs**: `3`
- **Optimizer**: AdamW
- **Warmup Steps**: `500`
- **Weight Decay**: `0.01`
- **Evaluation Strategy**: Steps-based evaluation with automatic saving of the best model.
- **Training Arguments**:
- `save_total_limit=3`: Limits the number of saved checkpoints.
- `load_best_model_at_end=True`: Ensures the best model is used after training.
- **Training Performance**:
- **Training Runtime**: `570.44 seconds`
- **Training Samples per Second**: `1.841`
- **Training Steps per Second**: `0.116`
- **Final Training Loss**: `0.4017`
- **Evaluation Metrics**:
- **Evaluation Loss**: `0.0839`
- **Precision**: `97.11%`
- **Recall**: `96.89%`
- **F1 Score**: `96.91%`
- **Evaluation Runtime**: `13.76 seconds`
- **Samples per Second**: `5.449`
- **Steps per Second**: `0.726` |