File size: 13,224 Bytes
5fae609 6f00df6 5fae609 6f00df6 5fae609 754f517 73d1188 3b2813e 754f517 73d1188 5fae609 3b2813e 5fae609 3b2813e 5fae609 b5956de 5fae609 b5956de 5fae609 b5956de 5fae609 b5956de 5fae609 b5956de 5fae609 b5956de 5fae609 b5956de 5fae609 b5956de 5fae609 b5956de 4991a96 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 |
---
language:
- en
widget:
- text: theft 3
- text: forgery
- text: unlawful possession short-barreled shotgun
- text: criminal trespass 2nd degree
- text: eluding a police vehicle
- text: upcs synthetic narcotic
license: apache-2.0
---
# ROTA
## Rapid Offense Text Autocoder
[![HuggingFace Models](https://img.shields.io/badge/%F0%9F%A4%97%20models-2021.05.18.15-blue)](https://huggingface.co/rti-international/rota)
[![HuggingFace Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20spaces-2021.05.18.15-blue)](https://huggingface.co/spaces/rti-international/rota-app)
[![GitHub Model Release](https://img.shields.io/github/v/release/RTIInternational/rota?logo=github)](https://github.com/RTIInternational/rota)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4770492.svg)](https://doi.org/10.5281/zenodo.4770492)
ROTA Application hosted on Hugging Face Spaces: https://huggingface.co/spaces/rti-international/rota-app
Criminal justice research often requires conversion of free-text offense descriptions into overall charge categories to aid analysis. For example, the free-text offense of "eluding a police vehicle" would be coded to a charge category of "Obstruction - Law Enforcement". Since free-text offense descriptions aren't standardized and often need to be categorized in large volumes, this can result in a manual and time intensive process for researchers. ROTA is a machine learning model for converting offense text into offense codes.
Currently ROTA predicts the *Charge Category* of a given offense text. A *charge category* is one of the headings for offense codes in the [2009 NCRP Codebook: Appendix F](https://www.icpsr.umich.edu/web/NACJD/studies/30799/datadocumentation#).
The model was trained on [publicly available data](https://web.archive.org/web/20201021001250/https://www.icpsr.umich.edu/web/pages/NACJD/guides/ncrp.html) from a crosswalk containing offenses from all 50 states combined with three additional hand-labeled offense text datasets.
<details>
<summary>Charge Category Example</summary>
<img src="https://i.ibb.co/xLsrzmV/charge-category-example.png" width="500">
</details>
### Data Preprocessing
The input text is standardized through a series of preprocessing steps. The text is first passed through a sequence of 500+ case-insensitive regular expressions that identify common misspellings and abbreviations and expand the text to a more full, correct English text. Some data-specific prefixes and suffixes are then removed from the text -- e.g. some states included a statute as a part of the text. Finally, punctuation (excluding dollar signs) are removed from the input, multiple spaces between words are removed, and the text is lowercased.
## Cross-Validation Performance
This model was evaluated using 3-fold cross validation. Except where noted, numbers presented below are the mean value across the 3 folds.
The model in this repository is trained on all available data. Because of this, you can typically expect production performance to be (unknowably) better than the numbers presented below.
### Overall Metrics
| Metric | Value |
| -------- | ----- |
| Accuracy | 0.934 |
| MCC | 0.931 |
| Metric | precision | recall | f1-score |
| --------- | --------- | ------ | -------- |
| macro avg | 0.811 | 0.786 | 0.794 |
*Note*: These are the average of the values *per fold*, so *macro avg* is the average of the macro average of all categories per fold.
### Per-Category Metrics
| Category | precision | recall | f1-score | support |
| ------------------------------------------------------ | --------- | ------ | -------- | ------- |
| AGGRAVATED ASSAULT | 0.954 | 0.954 | 0.954 | 4085 |
| ARMED ROBBERY | 0.961 | 0.955 | 0.958 | 1021 |
| ARSON | 0.946 | 0.954 | 0.95 | 344 |
| ASSAULTING PUBLIC OFFICER | 0.914 | 0.905 | 0.909 | 588 |
| AUTO THEFT | 0.962 | 0.962 | 0.962 | 1660 |
| BLACKMAIL/EXTORTION/INTIMIDATION | 0.872 | 0.871 | 0.872 | 627 |
| BRIBERY AND CONFLICT OF INTEREST | 0.784 | 0.796 | 0.79 | 216 |
| BURGLARY | 0.979 | 0.981 | 0.98 | 2214 |
| CHILD ABUSE | 0.805 | 0.78 | 0.792 | 139 |
| COCAINE OR CRACK VIOLATION OFFENSE UNSPECIFIED | 0.827 | 0.815 | 0.821 | 47 |
| COMMERCIALIZED VICE | 0.818 | 0.788 | 0.802 | 666 |
| CONTEMPT OF COURT | 0.982 | 0.987 | 0.984 | 2952 |
| CONTRIBUTING TO DELINQUENCY OF A MINOR | 0.544 | 0.333 | 0.392 | 50 |
| CONTROLLED SUBSTANCE - OFFENSE UNSPECIFIED | 0.864 | 0.791 | 0.826 | 280 |
| COUNTERFEITING (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
| DESTRUCTION OF PROPERTY | 0.97 | 0.968 | 0.969 | 2560 |
| DRIVING UNDER INFLUENCE - DRUGS | 0.567 | 0.603 | 0.581 | 34 |
| DRIVING UNDER THE INFLUENCE | 0.951 | 0.946 | 0.949 | 2195 |
| DRIVING WHILE INTOXICATED | 0.986 | 0.981 | 0.984 | 2391 |
| DRUG OFFENSES - VIOLATION/DRUG UNSPECIFIED | 0.903 | 0.911 | 0.907 | 3100 |
| DRUNKENNESS/VAGRANCY/DISORDERLY CONDUCT | 0.856 | 0.861 | 0.858 | 380 |
| EMBEZZLEMENT | 0.865 | 0.759 | 0.809 | 100 |
| EMBEZZLEMENT (FEDERAL ONLY) | 0 | 0 | 0 | 1 |
| ESCAPE FROM CUSTODY | 0.988 | 0.991 | 0.989 | 4035 |
| FAMILY RELATED OFFENSES | 0.739 | 0.773 | 0.755 | 442 |
| FELONY - UNSPECIFIED | 0.692 | 0.735 | 0.712 | 122 |
| FLIGHT TO AVOID PROSECUTION | 0.46 | 0.407 | 0.425 | 38 |
| FORCIBLE SODOMY | 0.82 | 0.8 | 0.809 | 76 |
| FORGERY (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
| FORGERY/FRAUD | 0.911 | 0.928 | 0.919 | 4687 |
| FRAUD (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
| GRAND LARCENY - THEFT OVER $200 | 0.957 | 0.973 | 0.965 | 2412 |
| HABITUAL OFFENDER | 0.742 | 0.627 | 0.679 | 53 |
| HEROIN VIOLATION - OFFENSE UNSPECIFIED | 0.879 | 0.811 | 0.843 | 24 |
| HIT AND RUN DRIVING | 0.922 | 0.94 | 0.931 | 303 |
| HIT/RUN DRIVING - PROPERTY DAMAGE | 0.929 | 0.918 | 0.923 | 362 |
| IMMIGRATION VIOLATIONS | 0.84 | 0.609 | 0.697 | 19 |
| INVASION OF PRIVACY | 0.927 | 0.923 | 0.925 | 1235 |
| JUVENILE OFFENSES | 0.928 | 0.866 | 0.895 | 144 |
| KIDNAPPING | 0.937 | 0.93 | 0.933 | 553 |
| LARCENY/THEFT - VALUE UNKNOWN | 0.955 | 0.945 | 0.95 | 3175 |
| LEWD ACT WITH CHILDREN | 0.775 | 0.85 | 0.811 | 596 |
| LIQUOR LAW VIOLATIONS | 0.741 | 0.768 | 0.755 | 214 |
| MANSLAUGHTER - NON-VEHICULAR | 0.626 | 0.802 | 0.701 | 139 |
| MANSLAUGHTER - VEHICULAR | 0.79 | 0.853 | 0.819 | 117 |
| MARIJUANA/HASHISH VIOLATION - OFFENSE UNSPECIFIED | 0.741 | 0.662 | 0.699 | 62 |
| MISDEMEANOR UNSPECIFIED | 0.63 | 0.243 | 0.347 | 57 |
| MORALS/DECENCY - OFFENSE | 0.774 | 0.764 | 0.769 | 412 |
| MURDER | 0.965 | 0.915 | 0.939 | 621 |
| OBSTRUCTION - LAW ENFORCEMENT | 0.939 | 0.947 | 0.943 | 4220 |
| OFFENSES AGAINST COURTS, LEGISLATURES, AND COMMISSIONS | 0.881 | 0.895 | 0.888 | 1965 |
| PAROLE VIOLATION | 0.97 | 0.953 | 0.962 | 946 |
| PETTY LARCENY - THEFT UNDER $200 | 0.965 | 0.761 | 0.85 | 139 |
| POSSESSION/USE - COCAINE OR CRACK | 0.893 | 0.928 | 0.908 | 68 |
| POSSESSION/USE - DRUG UNSPECIFIED | 0.624 | 0.535 | 0.572 | 189 |
| POSSESSION/USE - HEROIN | 0.884 | 0.852 | 0.866 | 25 |
| POSSESSION/USE - MARIJUANA/HASHISH | 0.977 | 0.97 | 0.973 | 556 |
| POSSESSION/USE - OTHER CONTROLLED SUBSTANCES | 0.975 | 0.965 | 0.97 | 3271 |
| PROBATION VIOLATION | 0.963 | 0.953 | 0.958 | 1158 |
| PROPERTY OFFENSES - OTHER | 0.901 | 0.87 | 0.885 | 446 |
| PUBLIC ORDER OFFENSES - OTHER | 0.7 | 0.721 | 0.71 | 1871 |
| RACKETEERING/EXTORTION (FEDERAL ONLY) | 0 | 0 | 0 | 2 |
| RAPE - FORCE | 0.842 | 0.873 | 0.857 | 641 |
| RAPE - STATUTORY - NO FORCE | 0.707 | 0.55 | 0.611 | 140 |
| REGULATORY OFFENSES (FEDERAL ONLY) | 0.847 | 0.567 | 0.674 | 70 |
| RIOTING | 0.784 | 0.605 | 0.68 | 119 |
| SEXUAL ASSAULT - OTHER | 0.836 | 0.836 | 0.836 | 971 |
| SIMPLE ASSAULT | 0.976 | 0.967 | 0.972 | 4577 |
| STOLEN PROPERTY - RECEIVING | 0.959 | 0.957 | 0.958 | 1193 |
| STOLEN PROPERTY - TRAFFICKING | 0.902 | 0.888 | 0.895 | 491 |
| TAX LAW (FEDERAL ONLY) | 0.373 | 0.233 | 0.286 | 30 |
| TRAFFIC OFFENSES - MINOR | 0.974 | 0.977 | 0.976 | 8699 |
| TRAFFICKING - COCAINE OR CRACK | 0.896 | 0.951 | 0.922 | 185 |
| TRAFFICKING - DRUG UNSPECIFIED | 0.709 | 0.795 | 0.749 | 516 |
| TRAFFICKING - HEROIN | 0.871 | 0.92 | 0.894 | 54 |
| TRAFFICKING - OTHER CONTROLLED SUBSTANCES | 0.963 | 0.954 | 0.959 | 2832 |
| TRAFFICKING MARIJUANA/HASHISH | 0.921 | 0.943 | 0.932 | 255 |
| TRESPASSING | 0.974 | 0.98 | 0.977 | 1916 |
| UNARMED ROBBERY | 0.941 | 0.939 | 0.94 | 377 |
| UNAUTHORIZED USE OF VEHICLE | 0.94 | 0.908 | 0.924 | 304 |
| UNSPECIFIED HOMICIDE | 0.61 | 0.554 | 0.577 | 60 |
| VIOLENT OFFENSES - OTHER | 0.827 | 0.817 | 0.822 | 606 |
| VOLUNTARY/NONNEGLIGENT MANSLAUGHTER | 0.619 | 0.513 | 0.542 | 54 |
| WEAPON OFFENSE | 0.943 | 0.949 | 0.946 | 2466 |
*Note: `support` is the average number of observations predicted on per fold, so the total number of observations per class is roughly 3x `support`.*
### Using Confidence Scores
If we interpret the classification probability as a confidence score, we can use it to filter out predictions that the model isn't as confident about. We applied this process in 3-fold cross validation. The numbers presented below indicate how much of the prediction data is retained given a confidence score cutoff of `p`. We present the overall accuracy and MCC metrics as if the model was only evaluated on this subset of confident predictions.
| | cutoff | percent retained | mcc | acc |
| --- | ------ | ---------------- | ----- | ----- |
| 0 | 0.85 | 0.952 | 0.96 | 0.961 |
| 1 | 0.9 | 0.943 | 0.964 | 0.965 |
| 2 | 0.95 | 0.928 | 0.97 | 0.971 |
| 3 | 0.975 | 0.912 | 0.975 | 0.976 |
| 4 | 0.99 | 0.886 | 0.982 | 0.983 |
| 5 | 0.999 | 0.733 | 0.995 | 0.996 |
|