Upload folder using huggingface_hub

Browse files

Files changed (9) hide show

README.md +144 -3
azeri-turkish-bert-ner.ipynb +0 -0
azeri-turkish-bert-ner.py +271 -0
config.json +128 -0
model.safetensors +3 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +62 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,144 @@
----
-license: mit
----

+# Azeri-Turkish-BERT-NER
+## Model Description
+The **Azeri-Turkish-BERT-NER** model is a fine-tuned version of the `bert-base-turkish-cased-ner` model for Named Entity Recognition (NER) tasks in the Azerbaijani and Turkish languages. This model builds upon a pre-trained Turkish BERT model and adapts it to perform NER tasks specifically for Azerbaijani data while preserving compatibility with Turkish entities.
+The model can identify and classify named entities into a variety of categories, such as persons, organizations, locations, dates, and more, making it suitable for applications such as text extraction, entity recognition, and data processing in Azerbaijani and Turkish texts.
+## Model Details
+- **Base Model**: `bert-base-turkish-cased-ner` (adapted from Hugging Face)
+- **Task**: Named Entity Recognition (NER)
+- **Languages**: Azerbaijani, Turkish
+- **Fine-Tuned On**: Custom Azerbaijani NER dataset
+- **Input Text Format**: Plain text with tokenized words
+- **Model Type**: BERT-based transformer for token classification
+## Training Details
+The model was fine-tuned using the Hugging Face `transformers` library and `datasets`. Here is a brief summary of the fine-tuning configuration:
+- **Tokenizer**: `AutoTokenizer` from the `bert-base-turkish-cased-ner` model
+- **Max Sequence Length**: 128 tokens
+- **Batch Size**: 128 (training and evaluation)
+- **Learning Rate**: 2e-5
+- **Number of Epochs**: 10
+- **Weight Decay**: 0.005
+- **Optimization Strategy**: Early stopping with a patience of 5 epochs based on the F1 metric
+### Training Dataset
+The training dataset is a custom Azerbaijani NER dataset sourced from [LocalDoc/azerbaijani-ner-dataset](https://huggingface.co/datasets/LocalDoc/azerbaijani-ner-dataset). The dataset was preprocessed to align tokens and NER tags accurately.
+### Label Categories
+The model supports the following entity categories:
+- **Person (B-PERSON, I-PERSON)**
+- **Location (B-LOCATION, I-LOCATION)**
+- **Organization (B-ORGANISATION, I-ORGANISATION)**
+- **Date (B-DATE, I-DATE)**
+- **Time (B-TIME, I-TIME)**
+- **Money (B-MONEY, I-MONEY)**
+- **Percentage (B-PERCENTAGE, I-PERCENTAGE)**
+- **Facility (B-FACILITY, I-FACILITY)**
+- **Product (B-PRODUCT, I-PRODUCT)**
+- ... (additional categories as specified in the training label list)
+### Training Metrics
+| Epoch | Training Loss | Validation Loss | Precision | Recall | F1    |
+|-------|---------------|-----------------|-----------|--------|-------|
+| 1     | 0.433100      | 0.306711        | 0.739000  | 0.693282 | 0.715412 |
+| 2     | 0.292700      | 0.275796        | 0.781565  | 0.688937 | 0.732334 |
+| 3     | 0.250600      | 0.275115        | 0.758261  | 0.709425 | 0.733031 |
+| 4     | 0.233700      | 0.273087        | 0.756184  | 0.716277 | 0.735689 |
+| 5     | 0.214800      | 0.278477        | 0.756051  | 0.710996 | 0.732832 |
+| 6     | 0.199200      | 0.286102        | 0.755068  | 0.717012 | 0.735548 |
+| 7     | 0.192800      | 0.297157        | 0.742326  | 0.725802 | 0.733971 |
+| 8     | 0.178900      | 0.304510        | 0.743206  | 0.723930 | 0.733442 |
+| 9     | 0.171700      | 0.313845        | 0.743145  | 0.725535 | 0.734234 |
+### Category-Wise Evaluation Metrics
+| Category      | Precision | Recall | F1-Score | Support |
+|---------------|-----------|--------|----------|---------|
+| ART           | 0.49      | 0.14   | 0.21     | 1988    |
+| DATE          | 0.49      | 0.48   | 0.49     | 844     |
+| EVENT         | 0.88      | 0.36   | 0.51     | 84      |
+| FACILITY      | 0.72      | 0.68   | 0.70     | 1146    |
+| LAW           | 0.57      | 0.64   | 0.60     | 1103    |
+| LOCATION      | 0.77      | 0.79   | 0.78     | 8806    |
+| MONEY         | 0.62      | 0.57   | 0.59     | 532     |
+| ORGANISATION  | 0.64      | 0.65   | 0.64     | 527     |
+| PERCENTAGE    | 0.77      | 0.83   | 0.80     | 3679    |
+| PERSON        | 0.87      | 0.81   | 0.84     | 6924    |
+| PRODUCT       | 0.82      | 0.80   | 0.81     | 2653    |
+| TIME          | 0.55      | 0.50   | 0.52     | 1634    |
+- **Micro Average**: Precision: 0.76, Recall: 0.72, F1-Score: 0.74
+- **Macro Average**: Precision: 0.68, Recall: 0.60, F1-Score: 0.62
+- **Weighted Average**: Precision: 0.74, Recall: 0.72, F1-Score: 0.72
+## Usage
+### Loading the Model
+To use the model for NER tasks, you can load it using the Hugging Face `transformers` library:
+```python
+from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
+# Load the model and tokenizer
+tokenizer = AutoTokenizer.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
+model = AutoModelForTokenClassification.from_pretrained("IsmatS/Azeri-Turkish-BERT-NER")
+# Initialize the NER pipeline
+ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
+# Example text
+text = "Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."
+# Run NER
+results = ner_pipeline(text)
+print(results)
+```
+### Inputs and Outputs
+- **Input**: Plain text in Azerbaijani or Turkish.
+- **Output**: List of detected entities with entity types and character offsets.
+Example output:
+```
+[
+  {'entity_group': 'B-PERSON', 'word': 'Shahla', 'start': 0, 'end': 6, 'score': 0.98},
+  {'entity_group': 'B-ORGANISATION', 'word': 'Pasha Sığorta', 'start': 11, 'end': 24, 'score': 0.95}
+]
+```
+### Evaluation Metrics
+The model was evaluated using precision, recall, and F1-score metrics as detailed in the training metrics section.
+## Limitations
+- The model may have limited performance on texts that diverge significantly from the training data distribution.
+- Handling of rare or unseen entities in Turkish and Azerbaijani may result in lower confidence scores.
+- Further fine-tuning on larger and more diverse datasets may improve generalizability.
+## Model Card
+A detailed model card with additional training details, dataset descriptions, and usage recommendations is available on the [Hugging Face model page](https://huggingface.co/IsmatS/Azeri-Turkish-BERT-NER).
+## Citation
+If you use this model, please consider citing:
+```
+@misc{azeri-turkish-bert-ner,
+  author = {Ismat Samadov},
+  title = {Azeri-Turkish-BERT-NER},
+  year = {2024},
+  howpublished = {Hugging Face repository},
+}
+```

azeri-turkish-bert-ner.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

azeri-turkish-bert-ner.py ADDED Viewed

	@@ -0,0 +1,271 @@

+# -*- coding: utf-8 -*-
+"""Azeri-Turkish-BERT-NER.ipynb
+Automatically generated by Colab.
+Original file is located at
+    https://colab.research.google.com/drive/1_vQDhrFp16kCtjJB5mENIT6jl5kkb03o
+"""
+!pip install transformers datasets seqeval huggingface_hub
+# Standard library imports
+import os                 # Provides functions for interacting with the operating system
+import warnings           # Used to handle or suppress warnings
+import numpy as np        # Essential for numerical operations and array manipulation
+import torch              # PyTorch library for tensor computations and model handling
+import ast                # Used for safe evaluation of strings to Python objects (e.g., parsing tokens)
+# Hugging Face and Transformers imports
+from datasets import load_dataset                     # Loads datasets for model training and evaluation
+from transformers import (
+    AutoTokenizer,                                   # Initializes a tokenizer from a pre-trained model
+    DataCollatorForTokenClassification,              # Handles padding and formatting of token classification data
+    TrainingArguments,                               # Defines training parameters like batch size and learning rate
+    Trainer,                                         # High-level API for managing training and evaluation
+    AutoModelForTokenClassification,                 # Loads a pre-trained model for token classification tasks
+    get_linear_schedule_with_warmup,                 # Learning rate scheduler for gradual warm-up and linear decay
+    EarlyStoppingCallback                           # Callback to stop training if validation performance plateaus
+)
+# Hugging Face Hub
+from huggingface_hub import login                   # Allows logging in to Hugging Face Hub to upload models
+# seqeval metrics for NER evaluation
+from seqeval.metrics import precision_score, recall_score, f1_score, classification_report
+# Provides precision, recall, F1-score, and classification report for evaluating NER model performance
+# Log in to Hugging Face Hub
+login(token="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")
+# Disable WandB (Weights & Biases) logging to avoid unwanted log outputs during training
+os.environ["WANDB_DISABLED"] = "true"
+# Suppress warning messages to keep output clean, especially during training and evaluation
+warnings.filterwarnings("ignore")
+# Load the Azerbaijani NER dataset from Hugging Face
+dataset = load_dataset("LocalDoc/azerbaijani-ner-dataset")
+print(dataset)  # Display dataset structure (e.g., train/validation splits)
+# Preprocessing function to format tokens and NER tags correctly
+def preprocess_example(example):
+    try:
+        # Convert string of tokens to a list and parse NER tags to integers
+        example["tokens"] = ast.literal_eval(example["tokens"])
+        example["ner_tags"] = list(map(int, ast.literal_eval(example["ner_tags"])))
+    except (ValueError, SyntaxError) as e:
+        # Skip and log malformed examples, ensuring error resilience
+        print(f"Skipping malformed example: {example['index']} due to error: {e}")
+        example["tokens"] = []
+        example["ner_tags"] = []
+    return example
+# Apply preprocessing to each dataset entry, ensuring consistent formatting
+dataset = dataset.map(preprocess_example)
+# Initialize the tokenizer for multilingual NER using xlm-roberta-large
+# tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large")
+tokenizer = AutoTokenizer.from_pretrained("akdeniz27/bert-base-turkish-cased-ner")
+# Function to tokenize input and align labels with tokenized words
+def tokenize_and_align_labels(example):
+    # Tokenize the sentence while preserving word boundaries for correct NER tag alignment
+    tokenized_inputs = tokenizer(
+        example["tokens"],            # List of words (tokens) in the sentence
+        truncation=True,               # Truncate sentences longer than max_length
+        is_split_into_words=True,      # Specify that input is a list of words
+        padding="max_length",          # Pad to maximum sequence length
+        max_length=128,                # Set the maximum sequence length to 128 tokens
+    )
+    labels = []                        # List to store aligned NER labels
+    word_ids = tokenized_inputs.word_ids()  # Get word IDs for each token
+    previous_word_idx = None           # Initialize previous word index for tracking
+    # Loop through word indices to align NER tags with subword tokens
+    for word_idx in word_ids:
+        if word_idx is None:
+            labels.append(-100)        # Set padding token labels to -100 (ignored in loss)
+        elif word_idx != previous_word_idx:
+            # Assign the label from example's NER tags if word index matches
+            labels.append(example["ner_tags"][word_idx] if word_idx < len(example["ner_tags"]) else -100)
+        else:
+            labels.append(-100)        # Label subword tokens with -100 to avoid redundant labels
+        previous_word_idx = word_idx   # Update previous word index
+    tokenized_inputs["labels"] = labels  # Add labels to tokenized inputs
+    return tokenized_inputs
+# Apply tokenization and label alignment function to the dataset
+tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=False)
+# Create a 90-10 split of the dataset for training and validation
+tokenized_datasets = tokenized_datasets["train"].train_test_split(test_size=0.1)
+print(tokenized_datasets)  # Output structure of split datasets
+# Define a list of entity labels for NER tagging with B- (beginning) and I- (inside) markers
+label_list = [
+    "O",                  # Outside of a named entity
+    "B-PERSON", "I-PERSON",         # Person name (e.g., "John" in "John Doe")
+    "B-LOCATION", "I-LOCATION",     # Geographical location (e.g., "Paris")
+    "B-ORGANISATION", "I-ORGANISATION", # Organization name (e.g., "UNICEF")
+    "B-DATE", "I-DATE",             # Date entity (e.g., "2024-11-05")
+    "B-TIME", "I-TIME",             # Time (e.g., "12:00 PM")
+    "B-MONEY", "I-MONEY",           # Monetary values (e.g., "$20")
+    "B-PERCENTAGE", "I-PERCENTAGE", # Percentage values (e.g., "20%")
+    "B-FACILITY", "I-FACILITY",     # Physical facilities (e.g., "Airport")
+    "B-PRODUCT", "I-PRODUCT",       # Product names (e.g., "iPhone")
+    "B-EVENT", "I-EVENT",           # Named events (e.g., "Olympics")
+    "B-ART", "I-ART",               # Works of art (e.g., "Mona Lisa")
+    "B-LAW", "I-LAW",               # Laws and legal documents (e.g., "Article 50")
+    "B-LANGUAGE", "I-LANGUAGE",     # Languages (e.g., "Azerbaijani")
+    "B-GPE", "I-GPE",               # Geopolitical entities (e.g., "Europe")
+    "B-NORP", "I-NORP",             # Nationalities, religious groups, political groups
+    "B-ORDINAL", "I-ORDINAL",       # Ordinal indicators (e.g., "first", "second")
+    "B-CARDINAL", "I-CARDINAL",     # Cardinal numbers (e.g., "three")
+    "B-DISEASE", "I-DISEASE",       # Diseases (e.g., "COVID-19")
+    "B-CONTACT", "I-CONTACT",       # Contact info (e.g., email or phone number)
+    "B-ADAGE", "I-ADAGE",           # Common sayings or adages
+    "B-QUANTITY", "I-QUANTITY",     # Quantities (e.g., "5 km")
+    "B-MISCELLANEOUS", "I-MISCELLANEOUS", # Miscellaneous entities not fitting other categories
+    "B-POSITION", "I-POSITION",     # Job titles or positions (e.g., "CEO")
+    "B-PROJECT", "I-PROJECT"        # Project names (e.g., "Project Apollo")
+]
+# Initialize a data collator to handle padding and formatting for token classification
+data_collator = DataCollatorForTokenClassification(tokenizer)
+# Load a pre-trained model for token classification, adapted for NER tasks
+# model = AutoModelForTokenClassification.from_pretrained(
+#     "xlm-roberta-large",               # Base model (multilingual XLM-RoBERTa) for NER
+#     num_labels=len(label_list)        # Set the number of output labels to match NER categories
+# )
+model = AutoModelForTokenClassification.from_pretrained(
+    "akdeniz27/bert-base-turkish-cased-ner",
+    num_labels=len(label_list),  # Ensure this matches the number of labels for your NER task
+    ignore_mismatched_sizes=True  # Allow loading despite mismatched classifier layer size
+)
+# Define a function to compute evaluation metrics for the model's predictions
+def compute_metrics(p):
+    predictions, labels = p  # Unpack predictions and true labels from the input
+    # Convert logits to predicted label indices by taking the argmax along the last axis
+    predictions = np.argmax(predictions, axis=2)
+    # Filter out special padding labels (-100) and convert indices to label names
+    true_labels = [[label_list[l] for l in label if l != -100] for label in labels]
+    true_predictions = [
+        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
+        for prediction, label in zip(predictions, labels)
+    ]
+    # Print a detailed classification report for each label category
+    print(classification_report(true_labels, true_predictions))
+    # Calculate and return key evaluation metrics
+    return {
+        # Precision measures the accuracy of predicted positive instances
+        # Important in NER to ensure entity predictions are correct and reduce false positives.
+        "precision": precision_score(true_labels, true_predictions),
+        # Recall measures the model's ability to capture all relevant entities
+        # Essential in NER to ensure the model captures all entities, reducing false negatives.
+        "recall": recall_score(true_labels, true_predictions),
+        # F1-score is the harmonic mean of precision and recall, balancing both metrics
+        # Useful in NER for providing an overall performance measure, especially when precision and recall are both important.
+        "f1": f1_score(true_labels, true_predictions),
+    }
+# Set up training arguments for model training, defining essential training configurations
+training_args = TrainingArguments(
+    output_dir="./results",               # Directory to save model checkpoints and final outputs
+    evaluation_strategy="epoch",          # Evaluate model on the validation set at the end of each epoch
+    save_strategy="epoch",                # Save model checkpoints at the end of each epoch
+    learning_rate=2e-5,                   # Set a low learning rate to ensure stable training for fine-tuning
+    per_device_train_batch_size=128,       # Number of examples per batch during training, balancing speed and memory
+    per_device_eval_batch_size=128,        # Number of examples per batch during evaluation
+    num_train_epochs=10,                   # Number of full training passes over the dataset
+    weight_decay=0.005,                    # Regularization term to prevent overfitting by penalizing large weights
+    fp16=True,                            # Use 16-bit floating point for faster and memory-efficient training
+    logging_dir='./logs',                 # Directory to store training logs
+    save_total_limit=2,                   # Keep only the 2 latest model checkpoints to save storage space
+    load_best_model_at_end=True,          # Load the best model based on metrics at the end of training
+    metric_for_best_model="f1",           # Use F1-score to determine the best model checkpoint
+    report_to="none"                      # Disable reporting to external services (useful in local runs)
+)
+# Initialize the Trainer class to manage the training loop with all necessary components
+trainer = Trainer(
+    model=model,                         # The pre-trained model to be fine-tuned
+    args=training_args,                  # Training configuration parameters defined in TrainingArguments
+    train_dataset=tokenized_datasets["train"],  # Tokenized training dataset
+    eval_dataset=tokenized_datasets["test"],    # Tokenized validation dataset
+    tokenizer=tokenizer,                 # Tokenizer used for processing input text
+    data_collator=data_collator,         # Data collator for padding and batching during training
+    compute_metrics=compute_metrics,     # Function to calculate evaluation metrics like precision, recall, F1
+    callbacks=[EarlyStoppingCallback(early_stopping_patience=5)] # Stop training early if validation metrics don't improve for 2 epochs
+)
+# Begin the training process and capture the training metrics
+training_metrics = trainer.train()
+# Evaluate the model on the validation set after training
+eval_results = trainer.evaluate()
+# Print evaluation results, including precision, recall, and F1-score
+print(eval_results)
+# Define the directory where the trained model and tokenizer will be saved
+save_directory = "./Azeri-Turkish-BERT-NER"
+# Save the trained model to the specified directory
+model.save_pretrained(save_directory)
+# Save the tokenizer to the same directory for compatibility with the model
+tokenizer.save_pretrained(save_directory)
+from transformers import pipeline
+# Load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained(save_directory)
+model = AutoModelForTokenClassification.from_pretrained(save_directory)
+# Initialize the NER pipeline
+device = 0 if torch.cuda.is_available() else -1
+nlp_ner = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple", device=device)
+label_mapping = {f"LABEL_{i}": label for i, label in enumerate(label_list) if label != "O"}
+def evaluate_model(test_texts, true_labels):
+    predictions = []
+    for i, text in enumerate(test_texts):
+        pred_entities = nlp_ner(text)
+        pred_labels = [label_mapping.get(entity["entity_group"], "O") for entity in pred_entities if entity["entity_group"] in label_mapping]
+        if len(pred_labels) != len(true_labels[i]):
+            print(f"Warning: Inconsistent number of entities in sample {i+1}. Adjusting predicted entities.")
+            pred_labels = pred_labels[:len(true_labels[i])]
+        predictions.append(pred_labels)
+    if all(len(true) == len(pred) for true, pred in zip(true_labels, predictions)):
+        precision = precision_score(true_labels, predictions)
+        recall = recall_score(true_labels, predictions)
+        f1 = f1_score(true_labels, predictions)
+        print("Precision:", precision)
+        print("Recall:", recall)
+        print("F1-Score:", f1)
+        print(classification_report(true_labels, predictions))
+    else:
+        print("Error: Could not align all samples correctly for evaluation.")
+test_texts = ["Shahla Khuduyeva və Pasha Sığorta şirkəti haqqında məlumat."]
+true_labels = [["B-PERSON", "B-ORGANISATION"]]
+evaluate_model(test_texts, true_labels)

config.json ADDED Viewed

	@@ -0,0 +1,128 @@

+{
+  "_name_or_path": "akdeniz27/bert-base-turkish-cased-ner",
+  "architectures": [
+    "BertForTokenClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0",
+    "1": "LABEL_1",
+    "2": "LABEL_2",
+    "3": "LABEL_3",
+    "4": "LABEL_4",
+    "5": "LABEL_5",
+    "6": "LABEL_6",
+    "7": "LABEL_7",
+    "8": "LABEL_8",
+    "9": "LABEL_9",
+    "10": "LABEL_10",
+    "11": "LABEL_11",
+    "12": "LABEL_12",
+    "13": "LABEL_13",
+    "14": "LABEL_14",
+    "15": "LABEL_15",
+    "16": "LABEL_16",
+    "17": "LABEL_17",
+    "18": "LABEL_18",
+    "19": "LABEL_19",
+    "20": "LABEL_20",
+    "21": "LABEL_21",
+    "22": "LABEL_22",
+    "23": "LABEL_23",
+    "24": "LABEL_24",
+    "25": "LABEL_25",
+    "26": "LABEL_26",
+    "27": "LABEL_27",
+    "28": "LABEL_28",
+    "29": "LABEL_29",
+    "30": "LABEL_30",
+    "31": "LABEL_31",
+    "32": "LABEL_32",
+    "33": "LABEL_33",
+    "34": "LABEL_34",
+    "35": "LABEL_35",
+    "36": "LABEL_36",
+    "37": "LABEL_37",
+    "38": "LABEL_38",
+    "39": "LABEL_39",
+    "40": "LABEL_40",
+    "41": "LABEL_41",
+    "42": "LABEL_42",
+    "43": "LABEL_43",
+    "44": "LABEL_44",
+    "45": "LABEL_45",
+    "46": "LABEL_46",
+    "47": "LABEL_47",
+    "48": "LABEL_48"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0,
+    "LABEL_1": 1,
+    "LABEL_10": 10,
+    "LABEL_11": 11,
+    "LABEL_12": 12,
+    "LABEL_13": 13,
+    "LABEL_14": 14,
+    "LABEL_15": 15,
+    "LABEL_16": 16,
+    "LABEL_17": 17,
+    "LABEL_18": 18,
+    "LABEL_19": 19,
+    "LABEL_2": 2,
+    "LABEL_20": 20,
+    "LABEL_21": 21,
+    "LABEL_22": 22,
+    "LABEL_23": 23,
+    "LABEL_24": 24,
+    "LABEL_25": 25,
+    "LABEL_26": 26,
+    "LABEL_27": 27,
+    "LABEL_28": 28,
+    "LABEL_29": 29,
+    "LABEL_3": 3,
+    "LABEL_30": 30,
+    "LABEL_31": 31,
+    "LABEL_32": 32,
+    "LABEL_33": 33,
+    "LABEL_34": 34,
+    "LABEL_35": 35,
+    "LABEL_36": 36,
+    "LABEL_37": 37,
+    "LABEL_38": 38,
+    "LABEL_39": 39,
+    "LABEL_4": 4,
+    "LABEL_40": 40,
+    "LABEL_41": 41,
+    "LABEL_42": 42,
+    "LABEL_43": 43,
+    "LABEL_44": 44,
+    "LABEL_45": 45,
+    "LABEL_46": 46,
+    "LABEL_47": 47,
+    "LABEL_48": 48,
+    "LABEL_5": 5,
+    "LABEL_6": 6,
+    "LABEL_7": 7,
+    "LABEL_8": 8,
+    "LABEL_9": 9
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.44.2",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 32000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:21202eaf782833dcd47b7af7a8cb1f81926e3432e9765063a2e540a3cf3da0d8
+size 440281084

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,62 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "mask_token": "[MASK]",
+  "max_len": 512,
+  "max_length": 512,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "stride": 0,
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "truncation_side": "right",
+  "truncation_strategy": "longest_first",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff