breakend's picture
Update README.md
f85c0f4
metadata
license: apache-2.0
tags:
  - generated_from_trainer
datasets:
  - eoir_privacy
metrics:
  - accuracy
  - f1
model-index:
  - name: distilbert-base-uncased-finetuned-eoir_privacy
    results:
      - task:
          name: Text Classification
          type: text-classification
        dataset:
          name: eoir_privacy
          type: eoir_privacy
          args: all
        metrics:
          - name: Accuracy
            type: accuracy
            value: 0.9052835051546392
          - name: F1
            type: f1
            value: 0.8088426527958388

distilbert-base-uncased-finetuned-eoir_privacy

This model is a fine-tuned version of distilbert-base-uncased on the eoir_privacy dataset. It achieves the following results on the evaluation set:

  • Loss: 0.3681
  • Accuracy: 0.9053
  • F1: 0.8088

Model description

Model predicts whether to mask names as pseudonyms in any text. Input format should be a paragraph with names masked. It will then output whether to use a pseudonym because the EOIR courts would not allow such private/sensitive information to become public unmasked.

Intended uses & limitations

This is a minimal privacy standard and will likely not work on out-of-distribution data.

Training and evaluation data

We train on the EOIR Privacy dataset and evaluate further using sensitivity analyses.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 2e-05
  • train_batch_size: 16
  • eval_batch_size: 16
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 5

Training results

Training Loss Epoch Step Validation Loss Accuracy F1
No log 1.0 395 0.3053 0.8789 0.7432
0.3562 2.0 790 0.2857 0.8976 0.7883
0.2217 3.0 1185 0.3358 0.8905 0.7550
0.1509 4.0 1580 0.3505 0.9040 0.8077
0.1509 5.0 1975 0.3681 0.9053 0.8088

Framework versions

  • Transformers 4.18.0
  • Pytorch 1.11.0+cu113
  • Datasets 2.1.0
  • Tokenizers 0.12.1

Citation

@misc{hendersonkrass2022pileoflaw,
  url = {https://arxiv.org/abs/2207.00220},
  author = {Henderson*, Peter and Krass*, Mark S. and Zheng, Lucia and Guha, Neel and Manning, Christopher D. and Jurafsky, Dan and Ho, Daniel E.},
  title = {Pile of Law: Learning Responsible Data Filtering from the Law and a 256GB Open-Source Legal Dataset},
  publisher = {arXiv},
  year = {2022}
}