BAT Token Classifier
The BAT Token Classifier is a token-level classification model designed to detect and mask bias-sensitive information in legal documents. It leverages a BERT-base-uncased backbone and has been trained on a custom legal dataset annotated for bias-aware token labeling.
Model Overview
- Architecture: BERT-based token classifier
- Purpose: Detect tokens in legal text that may indicate bias and perform classification at the token level
- Training Dataset: Annotated legal text with token-level labels for bias
- Tokenizer: BERT-base-uncased (fast tokenizer)
- Framework: Hugging Face Transformers & PyTorch
Repository Structure
hf_bat_token/
ββ model.safetensors # Trained model weights
ββ tokenizer.json # Tokenizer configuration
ββ tokenizer_config.json # Tokenizer config for HF
ββ vocab.txt # Tokenizer vocabulary
ββ config.json # Model architecture configuration
ββ special_tokens_map.json # Special token mapping
ββ training_args.bin # Hugging Face Trainer arguments
ββ README.md # This file
ββ LICENSE # License (optional)
Installation
Install the required packages:
pip install transformers datasets torch
pip install huggingface_hub
Usage
Load the model and tokenizer from Hugging Face Hub:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Replace 'username/BAT-Token-Classifier' with your HF repo path
tokenizer = AutoTokenizer.from_pretrained("username/BAT-Token-Classifier")
model = AutoModelForTokenClassification.from_pretrained("username/BAT-Token-Classifier")
model.eval()
# Example input
text = "The judge ruled in favor of the plaintiff despite previous biases."
tokens = tokenizer(text, return_tensors="pt", is_split_into_words=False)
outputs = model(**tokens)
# Predicted token labels
predictions = torch.argmax(outputs.logits, dim=-1)
print(predictions)
Training Details
- Model:
bert-base-uncased - Batch Size: 2 (gradient accumulation used to simulate larger batch)
- Epochs: 4
- Learning Rate: 3e-5
- Weight Decay: 0.01
- Precision: FP16 enabled for faster training
- Data Collator:
DataCollatorForTokenClassificationfrom Hugging Face
The training script train_bat_token_classifier.py handles:
- Loading annotated JSON data from
data/token_labels/ - Encoding tokens using BERT tokenizer
- Mapping token labels to IDs
- Training with Hugging Face
TrainerAPI - Saving model and tokenizer in
outputs/bat_token_classifier/
Evaluation
- The model supports evaluation using standard metrics for token classification (Precision, Recall, F1).
- Validation split is 15% of the dataset.
- Checkpoints are saved every 1000 steps, allowing resumption of training if interrupted.
- Downloads last month
- 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support