language: en
license: mit
library_name: transformers
base_model: marcmaxmeister/bert-base-uncased-ntee-classifier-v7
tags:
- text-classification
- nonprofit
- ntee
- cause-area
- multilabel-classification
- bert
datasets:
- givingtuesday/ntee-training-data
metrics:
- f1
model-index:
- name: marcmaxmeister/bert-base-uncased-ntee-classifier-v7
results:
- task:
   type: text-classification
   name: Multi-Label Text Classification

# NTEE Cause Area Classifier

This model classifies nonprofit organization mission statements and activity 
descriptions into one or more NTEE (National Taxonomy of Exempt Entities) major 
category codes. It was developed by GivingTuesday for the Gates Foundation 
nonprofit sector mapping project.

## Model Details

- **Base model:** `marcmaxmeister/bert-base-uncased-ntee-classifier-v7`
- **Version:** `7`
- **Problem type:** Multi-label classification
- **Number of labels:** 28
- **Input:** Nonprofit mission statement + activity description (concatenated)
- **Output:** Primary, secondary, and optional tertiary NTEE major code

## Label Space

The model predicts across 28 NTEE major codes, including two custom splits:

| Code | Category |
|------|----------|
| A | Arts & Culture |
| B | Education (K-12, other) |
| BB | Universities & Colleges *(custom split from B)* |
| C | Environment |
| D | Animal-Related |
| E | Health Clinics & Services |
| EE | Hospitals *(custom split from E)* |
| F | Mental Health |
| G | Voluntary Health Associations |
| H | Medical Research |
| I | Crime & Legal |
| J | Employment |
| K | Food & Agriculture |
| L | Housing & Shelter |
| M | Public Safety |
| N | Recreation & Sports |
| O | Youth Development |
| P | Human Services |
| Q | International Affairs |
| R | Civil Rights & Advocacy |
| S | Community Improvement |
| T | Philanthropy & Foundations |
| U | Science & Technology |
| V | Social Science |
| W | Public Benefit (General) |
| X | Religion |
| Y | Mutual Benefit |
| Z | Unknown/Unclassified |

## Training Data

- **Training rows:** 28584
- **Validation rows:** 15840
- **Source:** Combination of real IRS Form 990/990-EZ filings and 
synthetically generated examples produced using Claude (Anthropic)
- **Label encoding:** Multi-hot binary vectors of length 28, derived from 
NTEE primary, secondary, and tertiary codes per organization

## Training Configuration

- **Epochs:** 6
- **Learning rate:** 1e-05
- **Batch size (train):** 16
- **Weight decay:** 0.01
- **Mixed precision:** fp16
- **Framework:** Hugging Face Transformers + PyTorch

## Evaluation Results

Epoch	Eval F1	EvalLoss	EvalRuntime
0	0.4048	1.6717	51.5719
1	0.4992	1.4850	51.4108
2	0.5705	1.4298	51.4209
4	0.5923	1.3915	51.4547
4	0.6051	1.3683	51.4933
5	0.6083	1.3679	51.5154

## Final results
[{'train_runtime': 2181.9363, 'train_samples_per_second': 78.602, 'train_steps_per_second': 1.226, 'total_flos': 2.1819078880526336e+16, 'train_loss': 1.472620153997333, 'epoch': 5.989927252378288, 'step': 2676}]

## Usage

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from sklearn.preprocessing import MultiLabelBinarizer
import torch

model_id = "givingtuesday/ntee-cause-area-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

def predict_ntee(mission, activities, threshold=0.4, max_labels=3):
    text = f"{mission} {activities}"
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
    with torch.no_grad():
        logits = model(**inputs).logits
    probs = torch.sigmoid(logits[0]).cpu().numpy()
    ranked = sorted(zip(model.config.id2label.values(), probs), key=lambda x: -x[1])
    selected = [(label, float(p)) for label, p in ranked if p >= threshold][:max_labels]
    if not selected:
        selected = [ranked[0]]
    return {
        "primary":   selected[0] if len(selected) > 0 else None,
        "secondary": selected[1] if len(selected) > 1 else None,
        "tertiary":  selected[2] if len(selected) > 2 else None,
    }
```

## Intended Use

- Classifying IRS Form 990 and 990-EZ filers by mission area
- Nonprofit sector research and analysis
- Philanthropic portfolio mapping

## Limitations

- Trained primarily on English-language mission statements
- Performance is lower on categories with fewer training examples 
(e.g. Mutual Benefit, Social Science, Unknown/Unclassified)
- The BB/EE custom splits (universities vs. general education; hospitals 
vs. general health) are the hardest boundaries for the model to learn
- Not suitable for classifying organizations outside the U.S. nonprofit sector

## Citation

```bibtex
@misc{givingtuesday2025ntee,
author    = {GivingTuesday},
title     = {NTEE Cause Area Classifier for IRS 990 Data},
year      = {2026},
publisher = {GivingTuesday, HuggingFace},
url       = {https://huggingface.co/marcmaxmeister/bert-base-uncased-ntee-classifier-v6}
}
```

Downloads last month: 54

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support