BAT Token Classifier

The BAT Token Classifier is a token-level classification model designed to detect and mask bias-sensitive information in legal documents. It leverages a BERT-base-uncased backbone and has been trained on a custom legal dataset annotated for bias-aware token labeling.


Model Overview

  • Architecture: BERT-based token classifier
  • Purpose: Detect tokens in legal text that may indicate bias and perform classification at the token level
  • Training Dataset: Annotated legal text with token-level labels for bias
  • Tokenizer: BERT-base-uncased (fast tokenizer)
  • Framework: Hugging Face Transformers & PyTorch

Repository Structure

hf_bat_token/
β”œβ”€ model.safetensors               # Trained model weights
β”œβ”€ tokenizer.json                  # Tokenizer configuration
β”œβ”€ tokenizer_config.json           # Tokenizer config for HF
β”œβ”€ vocab.txt                       # Tokenizer vocabulary
β”œβ”€ config.json                     # Model architecture configuration
β”œβ”€ special_tokens_map.json         # Special token mapping
β”œβ”€ training_args.bin               # Hugging Face Trainer arguments
β”œβ”€ README.md                       # This file
└─ LICENSE                         # License (optional)

Installation

Install the required packages:

pip install transformers datasets torch
pip install huggingface_hub

Usage

Load the model and tokenizer from Hugging Face Hub:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Replace 'username/BAT-Token-Classifier' with your HF repo path
tokenizer = AutoTokenizer.from_pretrained("username/BAT-Token-Classifier")
model = AutoModelForTokenClassification.from_pretrained("username/BAT-Token-Classifier")
model.eval()

# Example input
text = "The judge ruled in favor of the plaintiff despite previous biases."
tokens = tokenizer(text, return_tensors="pt", is_split_into_words=False)
outputs = model(**tokens)

# Predicted token labels
predictions = torch.argmax(outputs.logits, dim=-1)
print(predictions)

Training Details

  • Model: bert-base-uncased
  • Batch Size: 2 (gradient accumulation used to simulate larger batch)
  • Epochs: 4
  • Learning Rate: 3e-5
  • Weight Decay: 0.01
  • Precision: FP16 enabled for faster training
  • Data Collator: DataCollatorForTokenClassification from Hugging Face

The training script train_bat_token_classifier.py handles:

  1. Loading annotated JSON data from data/token_labels/
  2. Encoding tokens using BERT tokenizer
  3. Mapping token labels to IDs
  4. Training with Hugging Face Trainer API
  5. Saving model and tokenizer in outputs/bat_token_classifier/

Evaluation

  • The model supports evaluation using standard metrics for token classification (Precision, Recall, F1).
  • Validation split is 15% of the dataset.
  • Checkpoints are saved every 1000 steps, allowing resumption of training if interrupted.
Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support