IsmatS commited on
Commit
a3d058c
1 Parent(s): 6f0b1a2

Upload folder using huggingface_hub

Browse files
README.md CHANGED
@@ -1,3 +1,144 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Azeri-Turkish-BERT-NER
2
+
3
+ ## Model Description
4
+
5
+ The **Azeri-Turkish-BERT-NER** model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.
6
+
7
+ The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.
8
+
9
+ ## Model Details
10
+
11
+ - **Base Model**: `bert-base-turkish-cased-ner` (adapted from Hugging Face)
12
+ - **Task**: Named Entity Recognition (NER)
13
+ - **Languages**: Azerbaijani, Turkish
14
+ - **Fine-Tuned On**: Custom Azerbaijani NER dataset
15
+ - **Input Text Format**: Plain text with tokenized words
16
+ - **Model Type**: BERT-based transformer for token classification
17
+
18
+ ## Training Details
19
+
20
+ The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration:
21
+
22
+ - **Tokenizer**: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model
23
+ - **Max Sequence Length**: 128 tokens
24
+ - **Batch Size**: 128 (training and evaluation)
25
+ - **Learning Rate**: 2e-5
26
+ - **Number of Epochs**: 10
27
+ - **Weight Decay**: 0.005
28
+ - **Optimization Strategy**: Early stopping with a patience of 5 epochs based on the F1 metric
29
+
30
+ ### Training Dataset
31
+
32
+ The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately.
33
+
34
+ ### Label Categories
35
+
36
+ The model supports the following entity categories:
37
+ - **Person (B-PERSON, I-PERSON)**
38
+ - **Location (B-LOCATION, I-LOCATION)**
39
+ - **Organization (B-ORGANISATION, I-ORGANISATION)**
40
+ - **Date (B-DATE, I-DATE)**
41
+ - **Time (B-TIME, I-TIME)**
42
+ - **Money (B-MONEY, I-MONEY)**
43
+ - **Percentage (B-PERCENTAGE, I-PERCENTAGE)**
44
+ - **Facility (B-FACILITY, I-FACILITY)**
45
+ - **Product (B-PRODUCT, I-PRODUCT)**
46
+ - ... (additional categories as specified in the training label list)
47
+
48
+ ### Training Metrics
49
+
50
+ | Epoch | Training Loss | Validation Loss | Precision | Recall | F1 |
51
+ |-------|---------------|-----------------|-----------|--------|-------|
52
+ | 1 | 0.433100 | 0.306711 | 0.739000 | 0.693282 | 0.715412 |
53
+ | 2 | 0.292700 | 0.275796 | 0.781565 | 0.688937 | 0.732334 |
54
+ | 3 | 0.250600 | 0.275115 | 0.758261 | 0.709425 | 0.733031 |
55
+ | 4 | 0.233700 | 0.273087 | 0.756184 | 0.716277 | 0.735689 |
56
+ | 5 | 0.214800 | 0.278477 | 0.756051 | 0.710996 | 0.732832 |
57
+ | 6 | 0.199200 | 0.286102 | 0.755068 | 0.717012 | 0.735548 |
58
+ | 7 | 0.192800 | 0.297157 | 0.742326 | 0.725802 | 0.733971 |
59
+ | 8 | 0.178900 | 0.304510 | 0.743206 | 0.723930 | 0.733442 |
60
+ | 9 | 0.171700 | 0.313845 | 0.743145 | 0.725535 | 0.734234 |
61
+
62
+ ### Category-Wise Evaluation Metrics
63
+
64
+ | Category | Precision | Recall | F1-Score | Support |
65
+ |---------------|-----------|--------|----------|---------|
66
+ | ART | 0.49 | 0.14 | 0.21 | 1988 |
67
+ | DATE | 0.49 | 0.48 | 0.49 | 844 |
68
+ | EVENT | 0.88 | 0.36 | 0.51 | 84 |
69
+ | FACILITY | 0.72 | 0.68 | 0.70 | 1146 |
70
+ | LAW | 0.57 | 0.64 | 0.60 | 1103 |
71
+ | LOCATION | 0.77 | 0.79 | 0.78 | 8806 |
72
+ | MONEY | 0.62 | 0.57 | 0.59 | 532 |
73
+ | ORGANISATION | 0.64 | 0.65 | 0.64 | 527 |
74
+ | PERCENTAGE | 0.77 | 0.83 | 0.80 | 3679 |
75
+ | PERSON | 0.87 | 0.81 | 0.84 | 6924 |
76
+ | PRODUCT | 0.82 | 0.80 | 0.81 | 2653 |
77
+ | TIME | 0.55 | 0.50 | 0.52 | 1634 |
78
+
79
+ - **Micro Average**: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
80
+ - **Macro Average**: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
81
+ - **Weighted Average**: Precision: 0.74, Recall: 0.72, F1-Score: 0.72
82
+
83
+ ## Usage
84
+
85
+ ### Loading the Model
86
+
87
+ To use the model for NER tasks, you can load it using the Hugging Face `transformers` library:
88
+
89
+ ```python
90
+ from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
91
+
92
+ # Load the model and tokenizer
93
+ tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
94
+ model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
95
+
96
+ # Initialize the NER pipeline
97
+ ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
98
+
99
+ # Example text
100
+ text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."
101
+
102
+ # Run NER
103
+ results = ner_pipeline(text)
104
+ print(results)
105
+ ```
106
+
107
+ ### Inputs and Outputs
108
+
109
+ - **Input**: Plain text in Azerbaijani or Turkish.
110
+ - **Output**: List of detected entities with entity types and character offsets.
111
+
112
+ Example output:
113
+ ```
114
+ [
115
+ {'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
116
+ {'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
117
+ ]
118
+ ```
119
+
120
+ ### Evaluation Metrics
121
+
122
+ The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.
123
+
124
+ ## Limitations
125
+
126
+ - The model may have limited performance on texts that diverge significantly from the training data distribution.
127
+ - Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
128
+ - Further fine-tuning on larger and more diverse datasets may improve generalizability.
129
+
130
+ ## Model Card
131
+
132
+ A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER).
133
+
134
+ ## Citation
135
+
136
+ If you use this model, please consider citing:
137
+ ```
138
+ @misc{azeri-turkish-bert-ner,
139
+ author = {Ismat Samadov},
140
+ title = {Azeri-Turkish-BERT-NER},
141
+ year = {2024},
142
+ howpublished = {Hugging Face repository},
143
+ }
144
+ ```
azeri-turkish-bert-ner.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
azeri-turkish-bert-ner.py ADDED
@@ -0,0 +1,271 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # -*- coding: utf-8 -*-
2
+ """Azeri-Turkish-BERT-NER.ipynb
3
+
4
+ Automatically generated by Colab.
5
+
6
+ Original file is located at
7
+ https://colab.research.google.com/drive/1_vQDhrFp16kCtjJB5mENIT6jl5kkb03o
8
+ """
9
+
10
+ !pip install transformers datasets seqeval huggingface_hub
11
+
12
+ # Standard library imports
13
+ import os # Provides functions for interacting with the operating system
14
+ import warnings # Used to handle or suppress warnings
15
+ import numpy as np # Essential for numerical operations and array manipulation
16
+ import torch # PyTorch library for tensor computations and model handling
17
+ import ast # Used for safe evaluation of strings to Python objects (e.g., parsing tokens)
18
+
19
+ # Hugging Face and Transformers imports
20
+ from datasets import load_dataset # Loads datasets for model training and evaluation
21
+ from transformers import (
22
+ AutoTokenizer, # Initializes a tokenizer from a pre-trained model
23
+ DataCollatorForTokenClassification, # Handles padding and formatting of token classification data
24
+ TrainingArguments, # Defines training parameters like batch size and learning rate
25
+ Trainer, # High-level API for managing training and evaluation
26
+ AutoModelForTokenClassification, # Loads a pre-trained model for token classification tasks
27
+ get_linear_schedule_with_warmup, # Learning rate scheduler for gradual warm-up and linear decay
28
+ EarlyStoppingCallback # Callback to stop training if validation performance plateaus
29
+ )
30
+
31
+ # Hugging Face Hub
32
+ from huggingface_hub import login # Allows logging in to Hugging Face Hub to upload models
33
+
34
+ # seqeval metrics for NER evaluation
35
+ from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
36
+ # Provides precision, recall, F1-score, and classification report for evaluating NER model performance
37
+
38
+ # Log in to Hugging Face Hub
39
+ login(token="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")
40
+
41
+ # Disable WandB (Weights & Biases) logging to avoid unwanted log outputs during training
42
+ os.environ["WANDB_DISABLED"] = "true"
43
+
44
+ # Suppress warning messages to keep output clean, especially during training and evaluation
45
+ warnings.filterwarnings("ignore")
46
+
47
+ # Load the Azerbaijani NER dataset from Hugging Face
48
+ dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset")
49
+ print(dataset) # Display dataset structure (e.g., train/validation splits)
50
+
51
+ # Preprocessing function to format tokens and NER tags correctly
52
+ def preprocess_example(example):
53
+ try:
54
+ # Convert string of tokens to a list and parse NER tags to integers
55
+ example["tokens"] = ast.literal_eval(example["tokens"])
56
+ example["ner_tags"] = list(map(int, ast.literal_eval(example["ner_tags"])))
57
+ except (ValueError, SyntaxError) as e:
58
+ # Skip and log malformed examples, ensuring error resilience
59
+ print(f"Skipping malformed example: {example['index']} due to error: {e}")
60
+ example["tokens"] = []
61
+ example["ner_tags"] = []
62
+ return example
63
+
64
+ # Apply preprocessing to each dataset entry, ensuring consistent formatting
65
+ dataset = dataset.map(preprocess_example)
66
+
67
+ # Initialize the tokenizer for multilingual NER using xlm-roberta-large
68
+ # tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained("akdeniz27/bert-base-turkish-cased-ner")
71
+
72
+ # Function to tokenize input and align labels with tokenized words
73
+ def tokenize_and_align_labels(example):
74
+ # Tokenize the sentence while preserving word boundaries for correct NER tag alignment
75
+ tokenized_inputs = tokenizer(
76
+ example["tokens"], # List of words (tokens) in the sentence
77
+ truncation=True, # Truncate sentences longer than max_length
78
+ is_split_into_words=True, # Specify that input is a list of words
79
+ padding="max_length", # Pad to maximum sequence length
80
+ max_length=128, # Set the maximum sequence length to 128 tokens
81
+ )
82
+
83
+ labels = [] # List to store aligned NER labels
84
+ word_ids = tokenized_inputs.word_ids() # Get word IDs for each token
85
+ previous_word_idx = None # Initialize previous word index for tracking
86
+
87
+ # Loop through word indices to align NER tags with subword tokens
88
+ for word_idx in word_ids:
89
+ if word_idx is None:
90
+ labels.append(-100) # Set padding token labels to -100 (ignored in loss)
91
+ elif word_idx != previous_word_idx:
92
+ # Assign the label from example's NER tags if word index matches
93
+ labels.append(example["ner_tags"][word_idx] if word_idx < len(example["ner_tags"]) else -100)
94
+ else:
95
+ labels.append(-100) # Label subword tokens with -100 to avoid redundant labels
96
+ previous_word_idx = word_idx # Update previous word index
97
+
98
+ tokenized_inputs["labels"] = labels # Add labels to tokenized inputs
99
+ return tokenized_inputs
100
+
101
+ # Apply tokenization and label alignment function to the dataset
102
+ tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=False)
103
+
104
+ # Create a 90-10 split of the dataset for training and validation
105
+ tokenized_datasets = tokenized_datasets["train"].train_test_split(test_size=0.1)
106
+ print(tokenized_datasets) # Output structure of split datasets
107
+
108
+ # Define a list of entity labels for NER tagging with B- (beginning) and I- (inside) markers
109
+ label_list = [
110
+ "O", # Outside of a named entity
111
+ "B-PERSON", "I-PERSON", # Person name (e.g., "John" in "John Doe")
112
+ "B-LOCATION", "I-LOCATION", # Geographical location (e.g., "Paris")
113
+ "B-ORGANISATION", "I-ORGANISATION", # Organization name (e.g., "UNICEF")
114
+ "B-DATE", "I-DATE", # Date entity (e.g., "2024-11-05")
115
+ "B-TIME", "I-TIME", # Time (e.g., "12:00 PM")
116
+ "B-MONEY", "I-MONEY", # Monetary values (e.g., "$20")
117
+ "B-PERCENTAGE", "I-PERCENTAGE", # Percentage values (e.g., "20%")
118
+ "B-FACILITY", "I-FACILITY", # Physical facilities (e.g., "Airport")
119
+ "B-PRODUCT", "I-PRODUCT", # Product names (e.g., "iPhone")
120
+ "B-EVENT", "I-EVENT", # Named events (e.g., "Olympics")
121
+ "B-ART", "I-ART", # Works of art (e.g., "Mona Lisa")
122
+ "B-LAW", "I-LAW", # Laws and legal documents (e.g., "Article 50")
123
+ "B-LANGUAGE", "I-LANGUAGE", # Languages (e.g., "Azerbaijani")
124
+ "B-GPE", "I-GPE", # Geopolitical entities (e.g., "Europe")
125
+ "B-NORP", "I-NORP", # Nationalities, religious groups, political groups
126
+ "B-ORDINAL", "I-ORDINAL", # Ordinal indicators (e.g., "first", "second")
127
+ "B-CARDINAL", "I-CARDINAL", # Cardinal numbers (e.g., "three")
128
+ "B-DISEASE", "I-DISEASE", # Diseases (e.g., "COVID-19")
129
+ "B-CONTACT", "I-CONTACT", # Contact info (e.g., email or phone number)
130
+ "B-ADAGE", "I-ADAGE", # Common sayings or adages
131
+ "B-QUANTITY", "I-QUANTITY", # Quantities (e.g., "5 km")
132
+ "B-MISCELLANEOUS", "I-MISCELLANEOUS", # Miscellaneous entities not fitting other categories
133
+ "B-POSITION", "I-POSITION", # Job titles or positions (e.g., "CEO")
134
+ "B-PROJECT", "I-PROJECT" # Project names (e.g., "Project Apollo")
135
+ ]
136
+
137
+ # Initialize a data collator to handle padding and formatting for token classification
138
+ data_collator = DataCollatorForTokenClassification(tokenizer)
139
+
140
+ # Load a pre-trained model for token classification, adapted for NER tasks
141
+ # model = AutoModelForTokenClassification.from_pretrained(
142
+ # "xlm-roberta-large", # Base model (multilingual XLM-RoBERTa) for NER
143
+ # num_labels=len(label_list) # Set the number of output labels to match NER categories
144
+ # )
145
+
146
+ model = AutoModelForTokenClassification.from_pretrained(
147
+ "akdeniz27/bert-base-turkish-cased-ner",
148
+ num_labels=len(label_list), # Ensure this matches the number of labels for your NER task
149
+ ignore_mismatched_sizes=True # Allow loading despite mismatched classifier layer size
150
+ )
151
+
152
+ # Define a function to compute evaluation metrics for the model's predictions
153
+ def compute_metrics(p):
154
+ predictions, labels = p # Unpack predictions and true labels from the input
155
+
156
+ # Convert logits to predicted label indices by taking the argmax along the last axis
157
+ predictions = np.argmax(predictions, axis=2)
158
+
159
+ # Filter out special padding labels (-100) and convert indices to label names
160
+ true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
161
+ true_predictions = [
162
+ [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
163
+ for prediction, label in zip(predictions, labels)
164
+ ]
165
+
166
+ # Print a detailed classification report for each label category
167
+ print(classification_report(true_labels, true_predictions))
168
+
169
+ # Calculate and return key evaluation metrics
170
+ return {
171
+ # Precision measures the accuracy of predicted positive instances
172
+ # Important in NER to ensure entity predictions are correct and reduce false positives.
173
+ "precision": precision_score(true_labels, true_predictions),
174
+
175
+ # Recall measures the model's ability to capture all relevant entities
176
+ # Essential in NER to ensure the model captures all entities, reducing false negatives.
177
+ "recall": recall_score(true_labels, true_predictions),
178
+
179
+ # F1-score is the harmonic mean of precision and recall, balancing both metrics
180
+ # Useful in NER for providing an overall performance measure, especially when precision and recall are both important.
181
+ "f1": f1_score(true_labels, true_predictions),
182
+ }
183
+
184
+ # Set up training arguments for model training, defining essential training configurations
185
+ training_args = TrainingArguments(
186
+ output_dir="./results", # Directory to save model checkpoints and final outputs
187
+ evaluation_strategy="epoch", # Evaluate model on the validation set at the end of each epoch
188
+ save_strategy="epoch", # Save model checkpoints at the end of each epoch
189
+ learning_rate=2e-5, # Set a low learning rate to ensure stable training for fine-tuning
190
+ per_device_train_batch_size=128, # Number of examples per batch during training, balancing speed and memory
191
+ per_device_eval_batch_size=128, # Number of examples per batch during evaluation
192
+ num_train_epochs=10, # Number of full training passes over the dataset
193
+ weight_decay=0.005, # Regularization term to prevent overfitting by penalizing large weights
194
+ fp16=True, # Use 16-bit floating point for faster and memory-efficient training
195
+ logging_dir='./logs', # Directory to store training logs
196
+ save_total_limit=2, # Keep only the 2 latest model checkpoints to save storage space
197
+ load_best_model_at_end=True, # Load the best model based on metrics at the end of training
198
+ metric_for_best_model="f1", # Use F1-score to determine the best model checkpoint
199
+ report_to="none" # Disable reporting to external services (useful in local runs)
200
+ )
201
+
202
+ # Initialize the Trainer class to manage the training loop with all necessary components
203
+ trainer = Trainer(
204
+ model=model, # The pre-trained model to be fine-tuned
205
+ args=training_args, # Training configuration parameters defined in TrainingArguments
206
+ train_dataset=tokenized_datasets["train"], # Tokenized training dataset
207
+ eval_dataset=tokenized_datasets["test"], # Tokenized validation dataset
208
+ tokenizer=tokenizer, # Tokenizer used for processing input text
209
+ data_collator=data_collator, # Data collator for padding and batching during training
210
+ compute_metrics=compute_metrics, # Function to calculate evaluation metrics like precision, recall, F1
211
+ callbacks=[EarlyStoppingCallback(early_stopping_patience=5)] # Stop training early if validation metrics don't improve for 2 epochs
212
+ )
213
+
214
+ # Begin the training process and capture the training metrics
215
+ training_metrics = trainer.train()
216
+
217
+ # Evaluate the model on the validation set after training
218
+ eval_results = trainer.evaluate()
219
+
220
+ # Print evaluation results, including precision, recall, and F1-score
221
+ print(eval_results)
222
+
223
+ # Define the directory where the trained model and tokenizer will be saved
224
+ save_directory = "./Azeri-Turkish-BERT-NER"
225
+
226
+ # Save the trained model to the specified directory
227
+ model.save_pretrained(save_directory)
228
+
229
+ # Save the tokenizer to the same directory for compatibility with the model
230
+ tokenizer.save_pretrained(save_directory)
231
+
232
+ from transformers import pipeline
233
+
234
+ # Load tokenizer and model
235
+ tokenizer = AutoTokenizer.from_pretrained(save_directory)
236
+ model = AutoModelForTokenClassification.from_pretrained(save_directory)
237
+
238
+ # Initialize the NER pipeline
239
+ device = 0 if torch.cuda.is_available() else -1
240
+ nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device)
241
+
242
+ label_mapping = {f"LABEL_{i}": label for i, label in enumerate(label_list) if label != "O"}
243
+
244
+ def evaluate_model(test_texts, true_labels):
245
+ predictions = []
246
+ for i, text in enumerate(test_texts):
247
+ pred_entities = nlp_ner(text)
248
+ pred_labels = [label_mapping.get(entity["entity_group"], "O") for entity in pred_entities if entity["entity_group"] in label_mapping]
249
+ if len(pred_labels) != len(true_labels[i]):
250
+ print(f"Warning: Inconsistent number of entities in sample {i+1}. Adjusting predicted entities.")
251
+ pred_labels = pred_labels[:len(true_labels[i])]
252
+ predictions.append(pred_labels)
253
+ if all(len(true) == len(pred) for true, pred in zip(true_labels, predictions)):
254
+ precision = precision_score(true_labels, predictions)
255
+ recall = recall_score(true_labels, predictions)
256
+ f1 = f1_score(true_labels, predictions)
257
+ print("Precision:", precision)
258
+ print("Recall:", recall)
259
+ print("F1-Score:", f1)
260
+ print(classification_report(true_labels, predictions))
261
+ else:
262
+ print("Error: Could not align all samples correctly for evaluation.")
263
+
264
+ test_texts = ["Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."]
265
+ true_labels = [["B-PERSON", "B-ORGANISATION"]]
266
+ evaluate_model(test_texts, true_labels)
267
+
268
+
269
+
270
+
271
+
config.json ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "akdeniz27/bert-base-turkish-cased-ner",
3
+ "architectures": [
4
+ "BertForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "id2label": {
13
+ "0": "LABEL_0",
14
+ "1": "LABEL_1",
15
+ "2": "LABEL_2",
16
+ "3": "LABEL_3",
17
+ "4": "LABEL_4",
18
+ "5": "LABEL_5",
19
+ "6": "LABEL_6",
20
+ "7": "LABEL_7",
21
+ "8": "LABEL_8",
22
+ "9": "LABEL_9",
23
+ "10": "LABEL_10",
24
+ "11": "LABEL_11",
25
+ "12": "LABEL_12",
26
+ "13": "LABEL_13",
27
+ "14": "LABEL_14",
28
+ "15": "LABEL_15",
29
+ "16": "LABEL_16",
30
+ "17": "LABEL_17",
31
+ "18": "LABEL_18",
32
+ "19": "LABEL_19",
33
+ "20": "LABEL_20",
34
+ "21": "LABEL_21",
35
+ "22": "LABEL_22",
36
+ "23": "LABEL_23",
37
+ "24": "LABEL_24",
38
+ "25": "LABEL_25",
39
+ "26": "LABEL_26",
40
+ "27": "LABEL_27",
41
+ "28": "LABEL_28",
42
+ "29": "LABEL_29",
43
+ "30": "LABEL_30",
44
+ "31": "LABEL_31",
45
+ "32": "LABEL_32",
46
+ "33": "LABEL_33",
47
+ "34": "LABEL_34",
48
+ "35": "LABEL_35",
49
+ "36": "LABEL_36",
50
+ "37": "LABEL_37",
51
+ "38": "LABEL_38",
52
+ "39": "LABEL_39",
53
+ "40": "LABEL_40",
54
+ "41": "LABEL_41",
55
+ "42": "LABEL_42",
56
+ "43": "LABEL_43",
57
+ "44": "LABEL_44",
58
+ "45": "LABEL_45",
59
+ "46": "LABEL_46",
60
+ "47": "LABEL_47",
61
+ "48": "LABEL_48"
62
+ },
63
+ "initializer_range": 0.02,
64
+ "intermediate_size": 3072,
65
+ "label2id": {
66
+ "LABEL_0": 0,
67
+ "LABEL_1": 1,
68
+ "LABEL_10": 10,
69
+ "LABEL_11": 11,
70
+ "LABEL_12": 12,
71
+ "LABEL_13": 13,
72
+ "LABEL_14": 14,
73
+ "LABEL_15": 15,
74
+ "LABEL_16": 16,
75
+ "LABEL_17": 17,
76
+ "LABEL_18": 18,
77
+ "LABEL_19": 19,
78
+ "LABEL_2": 2,
79
+ "LABEL_20": 20,
80
+ "LABEL_21": 21,
81
+ "LABEL_22": 22,
82
+ "LABEL_23": 23,
83
+ "LABEL_24": 24,
84
+ "LABEL_25": 25,
85
+ "LABEL_26": 26,
86
+ "LABEL_27": 27,
87
+ "LABEL_28": 28,
88
+ "LABEL_29": 29,
89
+ "LABEL_3": 3,
90
+ "LABEL_30": 30,
91
+ "LABEL_31": 31,
92
+ "LABEL_32": 32,
93
+ "LABEL_33": 33,
94
+ "LABEL_34": 34,
95
+ "LABEL_35": 35,
96
+ "LABEL_36": 36,
97
+ "LABEL_37": 37,
98
+ "LABEL_38": 38,
99
+ "LABEL_39": 39,
100
+ "LABEL_4": 4,
101
+ "LABEL_40": 40,
102
+ "LABEL_41": 41,
103
+ "LABEL_42": 42,
104
+ "LABEL_43": 43,
105
+ "LABEL_44": 44,
106
+ "LABEL_45": 45,
107
+ "LABEL_46": 46,
108
+ "LABEL_47": 47,
109
+ "LABEL_48": 48,
110
+ "LABEL_5": 5,
111
+ "LABEL_6": 6,
112
+ "LABEL_7": 7,
113
+ "LABEL_8": 8,
114
+ "LABEL_9": 9
115
+ },
116
+ "layer_norm_eps": 1e-12,
117
+ "max_position_embeddings": 512,
118
+ "model_type": "bert",
119
+ "num_attention_heads": 12,
120
+ "num_hidden_layers": 12,
121
+ "pad_token_id": 0,
122
+ "position_embedding_type": "absolute",
123
+ "torch_dtype": "float32",
124
+ "transformers_version": "4.44.2",
125
+ "type_vocab_size": 2,
126
+ "use_cache": true,
127
+ "vocab_size": 32000
128
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:21202eaf782833dcd47b7af7a8cb1f81926e3432e9765063a2e540a3cf3da0d8
3
+ size 440281084
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "mask_token": "[MASK]",
49
+ "max_len": 512,
50
+ "max_length": 512,
51
+ "model_max_length": 512,
52
+ "never_split": null,
53
+ "pad_token": "[PAD]",
54
+ "sep_token": "[SEP]",
55
+ "stride": 0,
56
+ "strip_accents": null,
57
+ "tokenize_chinese_chars": true,
58
+ "tokenizer_class": "BertTokenizer",
59
+ "truncation_side": "right",
60
+ "truncation_strategy": "longest_first",
61
+ "unk_token": "[UNK]"
62
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff