Model Description

This model is a fine-tuned version of google-bert/bert-base-cased specifically adapted for Named Entity Recognition (NER) of common and scientific plant names. It utilizes full fine-tuning of the base model for this task. The goal is to identify spans of text corresponding to plant names and classify them as either common (PLANT_COMMON) or scientific (PLANT_SCI) according to the IOB2 tagging scheme.

Developed by: [Your Name/Organization - Fill this in]
Model type: BERT (bert-base-cased) fine-tuned for Token Classification (NER) using full fine-tuning
Language(s): Primarily English (based on bert-base-cased and likely training data)
License: Base model (bert-base-cased) uses Apache 2.0. The fine-tuned model inherits this license unless otherwise specified
Fine-tuned from model: google-bert/bert-base-cased

Intended Uses & Limitations

Intended Use

This model is intended for identifying and classifying mentions of plant names (common and scientific) within English text. Potential applications include:

Extracting plant names from botanical texts, research papers, or gardening articles
Structuring information about plant mentions in databases
Assisting in indexing or searching documents based on contained plant names
Preprocessing text for downstream tasks that require knowledge of plant entities

Limitations

Domain Specificity: The model's performance is likely best on text similar to its training data (generated templates about plants). Performance may degrade on significantly different domains (e.g., highly informal text, complex biological pathway descriptions unless similar data was included)
IOB2 Scheme: The model strictly adheres to the IOB2 tagging scheme (B-TAG, I-TAG, O). It identifies the beginning (B-) and inside (I-) tokens of a named entity span
Specific Tags: Trained only to recognize PLANT_COMMON and PLANT_SCI. It will tag all other tokens as O (Outside). It cannot identify other entity types (e.g., locations, people, chemicals) unless explicitly trained
Ambiguity: May struggle with ambiguous terms where a word could be a plant name in one context but not another (e.g., "Rose" as a name vs. a flower)
Novel Names: Performance on plant names not seen during training (or very different from those seen) may be lower
Context Dependency: Like most NER models, its accuracy depends heavily on the surrounding context. Short, isolated mentions might be harder to classify correctly
Case Sensitivity: Based on bert-base-cased, the model is case-sensitive, which might be beneficial for distinguishing scientific names but could affect common names written inconsistently

How to Use (with Transformers)

from transformers import AutoModelForTokenClassification, AutoTokenizer, AutoConfig
import torch

# --- Configuration ---
# --- *** Point this to the directory containing the saved model *** ---
MODEL_PATH = "/kaggle/working/bert_ner_full_finetune_plant_output/best_model"
# --- ************************************************************** ---
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForTokenClassification.from_pretrained(MODEL_PATH)
model.to(DEVICE)
model.eval()

print("Model loaded and ready for inference.")

# --- Inference Example ---
text = "The Pineapple Guava (Feijoa sellowiana) is different from Ananas comosus."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(DEVICE)

with torch.no_grad():
    logits = model(**inputs).logits

predictions = torch.argmax(logits, dim=2)
predicted_token_class_ids = predictions[0].cpu().numpy()

# Map IDs back to labels, aligning with tokens
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0].cpu().numpy())
word_ids = inputs.word_ids() # Only available with fast tokenizers

aligned_labels = []
previous_word_idx = None
for i, token in enumerate(tokens):
    if token in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
        continue # Skip special tokens
    word_idx = word_ids[i]
    if word_idx != previous_word_idx: # Only take first token of each word
        label_id = predicted_token_class_ids[i]
        label_str = model.config.id2label.get(label_id, "O")
        aligned_labels.append(label_str)
    previous_word_idx = word_idx

original_words = text.split() # Simple split for demo, might need better tokenization alignment
print("Text:", text)
print("Predicted Labels (approx alignment):")
for word, label in zip(original_words[:len(aligned_labels)], aligned_labels):
    if label != "O": print(f"- {word}: {label}")

Using the ONNX Model

If you've exported the model to ONNX format, you can use it as follows:

import onnxruntime as ort
import numpy as np
import os
from transformers import AutoTokenizer, AutoConfig

# --- *** Point this to the directory containing the ONNX model *** ---
ONNX_MODEL_DIR = "/kaggle/working/bert_ner_onnx"
# --- ************************************************************** ---

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ONNX_MODEL_DIR)

# Load ONNX model and create session
model_path = os.path.join(ONNX_MODEL_DIR, "model.onnx")
ort_session = ort.InferenceSession(model_path, providers=['CPUExecutionProvider']) # Or ['CUDAExecutionProvider'] if available

# Load id_to_label map (needed for decoding)
config = AutoConfig.from_pretrained(ONNX_MODEL_DIR)
id_to_label = config.id2label

# --- Inference Example ---
text = "The Pineapple Guava (Feijoa sellowiana) is different from Ananas comosus."
inputs = tokenizer(text, return_tensors="np") # Use numpy for ONNX runtime

# Prepare inputs for ONNX session
ort_inputs = {k: v for k, v in inputs.items()}

# Run inference
ort_outputs = ort_session.run(None, ort_inputs)
logits = ort_outputs[0] # Usually the first output

predictions = np.argmax(logits, axis=-1)
predicted_token_class_ids = predictions[0]

# Map IDs back to labels
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
print("Text:", text)
print("Predicted Labels (per token):")
for token, label_id in zip(tokens, predicted_token_class_ids):
    if token not in [tokenizer.cls_token, tokenizer.sep_token, tokenizer.pad_token]:
        print(f"- {token}: {id_to_label.get(label_id, 'O')}")

Training Data

The model was fine-tuned on a dataset generated from templates focusing on common and scientific plant names. The data format is CoNLL style (one token and tag per line, separated by TAB, with empty lines between sentences).

Data Split: 90% Training, 10% Validation (using sklearn.model_selection.train_test_split with random_state=42).

Training Procedure

Preprocessing

Tokenizer: BertTokenizerFast from google-bert/bert-base-cased
Padding: Padded/truncated to max_length=128
Label Alignment: Standard IOB2 scheme. Labels are aligned to the first token of each word. Special tokens and subsequent subword tokens are assigned the ignore_index (-100)

Training

Framework: PyTorch with transformers
Environment: GPU (Kaggle P100/T4/V100)
Precision: Float32 (FP16 disabled)
Optimizer: AdamW
Learning Rate: 2e-5 with linear warmup and decay
Batch Size: 8 per device with gradient accumulation steps of 4
Epochs: Up to 3 epochs with early stopping (patience=2 based on validation F1)
Weight Decay: 0.01
Gradient Clipping: Max norm = 1.0 (implicit in Trainer)

Evaluation Results

Evaluation was performed using the seqeval library with the IOB2 scheme and strict matching. The primary metric tracked was the micro-averaged F1 score.

Environmental Impact

Hardware: Trained on GPU (likely Nvidia P100)
Compute: [Estimate training time if known, e.g., Approx. X hours on a single GPU]. Carbon emissions can be estimated using tools like the Machine Learning Impact calculator if compute details are known.

Disclaimer

This model is fine-tuned from a base model and inherits its capabilities and biases. Performance depends heavily on the similarity between the target text and the training data. Always evaluate thoroughly for your specific use case.

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dudeman523
/

googe-bert-based-cased-NER-plants