BAT Token Classifier

The BAT Token Classifier is a token-level classification model designed to detect and mask bias-sensitive information in legal documents. It leverages a BERT-base-uncased backbone and has been trained on a custom legal dataset annotated for bias-aware token labeling.

Model Overview

Architecture: BERT-based token classifier
Purpose: Detect tokens in legal text that may indicate bias and perform classification at the token level
Training Dataset: Annotated legal text with token-level labels for bias
Tokenizer: BERT-base-uncased (fast tokenizer)
Framework: Hugging Face Transformers & PyTorch

Repository Structure

hf_bat_token/
├─ model.safetensors               # Trained model weights
├─ tokenizer.json                  # Tokenizer configuration
├─ tokenizer_config.json           # Tokenizer config for HF
├─ vocab.txt                       # Tokenizer vocabulary
├─ config.json                     # Model architecture configuration
├─ special_tokens_map.json         # Special token mapping
├─ training_args.bin               # Hugging Face Trainer arguments
├─ README.md                       # This file
└─ LICENSE                         # License (optional)

Installation

Install the required packages:

pip install transformers datasets torch
pip install huggingface_hub

Usage

Load the model and tokenizer from Hugging Face Hub:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Replace 'username/BAT-Token-Classifier' with your HF repo path
tokenizer = AutoTokenizer.from_pretrained("username/BAT-Token-Classifier")
model = AutoModelForTokenClassification.from_pretrained("username/BAT-Token-Classifier")
model.eval()

# Example input
text = "The judge ruled in favor of the plaintiff despite previous biases."
tokens = tokenizer(text, return_tensors="pt", is_split_into_words=False)
outputs = model(**tokens)

# Predicted token labels
predictions = torch.argmax(outputs.logits, dim=-1)
print(predictions)

Training Details

Model: bert-base-uncased
Batch Size: 2 (gradient accumulation used to simulate larger batch)
Epochs: 4
Learning Rate: 3e-5
Weight Decay: 0.01
Precision: FP16 enabled for faster training
Data Collator: DataCollatorForTokenClassification from Hugging Face

The training script train_bat_token_classifier.py handles:

Loading annotated JSON data from data/token_labels/
Encoding tokens using BERT tokenizer
Mapping token labels to IDs
Training with Hugging Face Trainer API
Saving model and tokenizer in outputs/bat_token_classifier/

Evaluation

The model supports evaluation using standard metrics for token classification (Precision, Recall, F1).
Validation split is 15% of the dataset.
Checkpoints are saved every 1000 steps, allowing resumption of training if interrupted.

Downloads last month: 1

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support