File size: 13,224 Bytes
5fae609
6f00df6
5fae609
 
 
 
 
 
 
 
6f00df6
5fae609
 
 
 
 
754f517
73d1188
3b2813e
754f517
 
73d1188
5fae609
3b2813e
5fae609
3b2813e
5fae609
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5956de
5fae609
 
 
 
 
 
 
 
b5956de
 
 
 
 
 
 
 
 
 
 
 
 
 
5fae609
b5956de
 
 
 
 
 
 
5fae609
b5956de
 
 
 
 
5fae609
b5956de
5fae609
b5956de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fae609
b5956de
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fae609
 
 
 
 
 
 
 
 
b5956de
 
 
5fae609
b5956de
4991a96
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
---
language:
- en
widget:
- text: theft 3
- text: forgery
- text: unlawful possession short-barreled shotgun
- text: criminal trespass 2nd degree
- text: eluding a police vehicle
- text: upcs synthetic narcotic
license: apache-2.0
---

# ROTA
## Rapid Offense Text Autocoder

[![HuggingFace Models](https://img.shields.io/badge/%F0%9F%A4%97%20models-2021.05.18.15-blue)](https://huggingface.co/rti-international/rota)
[![HuggingFace Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20spaces-2021.05.18.15-blue)](https://huggingface.co/spaces/rti-international/rota-app)
[![GitHub Model Release](https://img.shields.io/github/v/release/RTIInternational/rota?logo=github)](https://github.com/RTIInternational/rota)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4770492.svg)](https://doi.org/10.5281/zenodo.4770492)

ROTA Application hosted on Hugging Face Spaces: https://huggingface.co/spaces/rti-international/rota-app 

Criminal justice research often requires conversion of free-text offense descriptions into overall charge categories to aid analysis. For example, the free-text offense of "eluding a police vehicle" would be coded to a charge category of "Obstruction - Law Enforcement". Since free-text offense descriptions aren't standardized and often need to be categorized in large volumes, this can result in a manual and time intensive process for researchers. ROTA is a machine learning model for converting offense text into offense codes. 

Currently ROTA predicts the *Charge Category* of a given offense text. A *charge category* is one of the headings for offense codes in the [2009 NCRP Codebook: Appendix F](https://www.icpsr.umich.edu/web/NACJD/studies/30799/datadocumentation#).

The model was trained on [publicly available data](https://web.archive.org/web/20201021001250/https://www.icpsr.umich.edu/web/pages/NACJD/guides/ncrp.html) from a crosswalk containing offenses from all 50 states combined with three additional hand-labeled offense text datasets.

<details>
    <summary>Charge Category Example</summary>
    <img src="https://i.ibb.co/xLsrzmV/charge-category-example.png" width="500">
</details>

### Data Preprocessing

The input text is standardized through a series of preprocessing steps. The text is first passed through a sequence of 500+ case-insensitive regular expressions that identify common misspellings and abbreviations and expand the text to a more full, correct English text. Some data-specific prefixes and suffixes are then removed from the text -- e.g. some states included a statute as a part of the text. Finally, punctuation (excluding dollar signs) are removed from the input, multiple spaces between words are removed, and the text is lowercased. 

## Cross-Validation Performance

This model was evaluated using 3-fold cross validation. Except where noted, numbers presented below are the mean value across the 3 folds. 

The model in this repository is trained on all available data. Because of this, you can typically expect production performance to be (unknowably) better than the numbers presented below.

### Overall Metrics

| Metric   | Value |
| -------- | ----- |
| Accuracy | 0.934 |
| MCC      | 0.931 |

 

| Metric    | precision | recall | f1-score |
| --------- | --------- | ------ | -------- |
| macro avg | 0.811     | 0.786  | 0.794    |


*Note*: These are the average of the values *per fold*, so *macro avg* is the average of the macro average of all categories per fold.

### Per-Category Metrics

| Category                                               | precision | recall | f1-score | support |
| ------------------------------------------------------ | --------- | ------ | -------- | ------- |
| AGGRAVATED ASSAULT                                     | 0.954     | 0.954  | 0.954    | 4085    |
| ARMED ROBBERY                                          | 0.961     | 0.955  | 0.958    | 1021    |
| ARSON                                                  | 0.946     | 0.954  | 0.95     | 344     |
| ASSAULTING PUBLIC OFFICER                              | 0.914     | 0.905  | 0.909    | 588     |
| AUTO THEFT                                             | 0.962     | 0.962  | 0.962    | 1660    |
| BLACKMAIL/EXTORTION/INTIMIDATION                       | 0.872     | 0.871  | 0.872    | 627     |
| BRIBERY AND CONFLICT OF INTEREST                       | 0.784     | 0.796  | 0.79     | 216     |
| BURGLARY                                               | 0.979     | 0.981  | 0.98     | 2214    |
| CHILD ABUSE                                            | 0.805     | 0.78   | 0.792    | 139     |
| COCAINE OR CRACK VIOLATION OFFENSE UNSPECIFIED         | 0.827     | 0.815  | 0.821    | 47      |
| COMMERCIALIZED VICE                                    | 0.818     | 0.788  | 0.802    | 666     |
| CONTEMPT OF COURT                                      | 0.982     | 0.987  | 0.984    | 2952    |
| CONTRIBUTING TO DELINQUENCY OF A MINOR                 | 0.544     | 0.333  | 0.392    | 50      |
| CONTROLLED SUBSTANCE - OFFENSE UNSPECIFIED             | 0.864     | 0.791  | 0.826    | 280     |
| COUNTERFEITING (FEDERAL ONLY)                          | 0         | 0      | 0        | 2       |
| DESTRUCTION OF PROPERTY                                | 0.97      | 0.968  | 0.969    | 2560    |
| DRIVING UNDER INFLUENCE - DRUGS                        | 0.567     | 0.603  | 0.581    | 34      |
| DRIVING UNDER THE INFLUENCE                            | 0.951     | 0.946  | 0.949    | 2195    |
| DRIVING WHILE INTOXICATED                              | 0.986     | 0.981  | 0.984    | 2391    |
| DRUG OFFENSES - VIOLATION/DRUG UNSPECIFIED             | 0.903     | 0.911  | 0.907    | 3100    |
| DRUNKENNESS/VAGRANCY/DISORDERLY CONDUCT                | 0.856     | 0.861  | 0.858    | 380     |
| EMBEZZLEMENT                                           | 0.865     | 0.759  | 0.809    | 100     |
| EMBEZZLEMENT (FEDERAL ONLY)                            | 0         | 0      | 0        | 1       |
| ESCAPE FROM CUSTODY                                    | 0.988     | 0.991  | 0.989    | 4035    |
| FAMILY RELATED OFFENSES                                | 0.739     | 0.773  | 0.755    | 442     |
| FELONY - UNSPECIFIED                                   | 0.692     | 0.735  | 0.712    | 122     |
| FLIGHT TO AVOID PROSECUTION                            | 0.46      | 0.407  | 0.425    | 38      |
| FORCIBLE SODOMY                                        | 0.82      | 0.8    | 0.809    | 76      |
| FORGERY (FEDERAL ONLY)                                 | 0         | 0      | 0        | 2       |
| FORGERY/FRAUD                                          | 0.911     | 0.928  | 0.919    | 4687    |
| FRAUD (FEDERAL ONLY)                                   | 0         | 0      | 0        | 2       |
| GRAND LARCENY - THEFT OVER $200                        | 0.957     | 0.973  | 0.965    | 2412    |
| HABITUAL OFFENDER                                      | 0.742     | 0.627  | 0.679    | 53      |
| HEROIN VIOLATION - OFFENSE UNSPECIFIED                 | 0.879     | 0.811  | 0.843    | 24      |
| HIT AND RUN DRIVING                                    | 0.922     | 0.94   | 0.931    | 303     |
| HIT/RUN DRIVING - PROPERTY DAMAGE                      | 0.929     | 0.918  | 0.923    | 362     |
| IMMIGRATION VIOLATIONS                                 | 0.84      | 0.609  | 0.697    | 19      |
| INVASION OF PRIVACY                                    | 0.927     | 0.923  | 0.925    | 1235    |
| JUVENILE OFFENSES                                      | 0.928     | 0.866  | 0.895    | 144     |
| KIDNAPPING                                             | 0.937     | 0.93   | 0.933    | 553     |
| LARCENY/THEFT - VALUE UNKNOWN                          | 0.955     | 0.945  | 0.95     | 3175    |
| LEWD ACT WITH CHILDREN                                 | 0.775     | 0.85   | 0.811    | 596     |
| LIQUOR LAW VIOLATIONS                                  | 0.741     | 0.768  | 0.755    | 214     |
| MANSLAUGHTER - NON-VEHICULAR                           | 0.626     | 0.802  | 0.701    | 139     |
| MANSLAUGHTER - VEHICULAR                               | 0.79      | 0.853  | 0.819    | 117     |
| MARIJUANA/HASHISH VIOLATION - OFFENSE UNSPECIFIED      | 0.741     | 0.662  | 0.699    | 62      |
| MISDEMEANOR UNSPECIFIED                                | 0.63      | 0.243  | 0.347    | 57      |
| MORALS/DECENCY - OFFENSE                               | 0.774     | 0.764  | 0.769    | 412     |
| MURDER                                                 | 0.965     | 0.915  | 0.939    | 621     |
| OBSTRUCTION - LAW ENFORCEMENT                          | 0.939     | 0.947  | 0.943    | 4220    |
| OFFENSES AGAINST COURTS, LEGISLATURES, AND COMMISSIONS | 0.881     | 0.895  | 0.888    | 1965    |
| PAROLE VIOLATION                                       | 0.97      | 0.953  | 0.962    | 946     |
| PETTY LARCENY - THEFT UNDER $200                       | 0.965     | 0.761  | 0.85     | 139     |
| POSSESSION/USE - COCAINE OR CRACK                      | 0.893     | 0.928  | 0.908    | 68      |
| POSSESSION/USE - DRUG UNSPECIFIED                      | 0.624     | 0.535  | 0.572    | 189     |
| POSSESSION/USE - HEROIN                                | 0.884     | 0.852  | 0.866    | 25      |
| POSSESSION/USE - MARIJUANA/HASHISH                     | 0.977     | 0.97   | 0.973    | 556     |
| POSSESSION/USE - OTHER CONTROLLED SUBSTANCES           | 0.975     | 0.965  | 0.97     | 3271    |
| PROBATION VIOLATION                                    | 0.963     | 0.953  | 0.958    | 1158    |
| PROPERTY OFFENSES - OTHER                              | 0.901     | 0.87   | 0.885    | 446     |
| PUBLIC ORDER OFFENSES - OTHER                          | 0.7       | 0.721  | 0.71     | 1871    |
| RACKETEERING/EXTORTION (FEDERAL ONLY)                  | 0         | 0      | 0        | 2       |
| RAPE - FORCE                                           | 0.842     | 0.873  | 0.857    | 641     |
| RAPE - STATUTORY - NO FORCE                            | 0.707     | 0.55   | 0.611    | 140     |
| REGULATORY OFFENSES (FEDERAL ONLY)                     | 0.847     | 0.567  | 0.674    | 70      |
| RIOTING                                                | 0.784     | 0.605  | 0.68     | 119     |
| SEXUAL ASSAULT - OTHER                                 | 0.836     | 0.836  | 0.836    | 971     |
| SIMPLE ASSAULT                                         | 0.976     | 0.967  | 0.972    | 4577    |
| STOLEN PROPERTY - RECEIVING                            | 0.959     | 0.957  | 0.958    | 1193    |
| STOLEN PROPERTY - TRAFFICKING                          | 0.902     | 0.888  | 0.895    | 491     |
| TAX LAW (FEDERAL ONLY)                                 | 0.373     | 0.233  | 0.286    | 30      |
| TRAFFIC OFFENSES - MINOR                               | 0.974     | 0.977  | 0.976    | 8699    |
| TRAFFICKING - COCAINE OR CRACK                         | 0.896     | 0.951  | 0.922    | 185     |
| TRAFFICKING - DRUG UNSPECIFIED                         | 0.709     | 0.795  | 0.749    | 516     |
| TRAFFICKING - HEROIN                                   | 0.871     | 0.92   | 0.894    | 54      |
| TRAFFICKING - OTHER CONTROLLED SUBSTANCES              | 0.963     | 0.954  | 0.959    | 2832    |
| TRAFFICKING MARIJUANA/HASHISH                          | 0.921     | 0.943  | 0.932    | 255     |
| TRESPASSING                                            | 0.974     | 0.98   | 0.977    | 1916    |
| UNARMED ROBBERY                                        | 0.941     | 0.939  | 0.94     | 377     |
| UNAUTHORIZED USE OF VEHICLE                            | 0.94      | 0.908  | 0.924    | 304     |
| UNSPECIFIED HOMICIDE                                   | 0.61      | 0.554  | 0.577    | 60      |
| VIOLENT OFFENSES - OTHER                               | 0.827     | 0.817  | 0.822    | 606     |
| VOLUNTARY/NONNEGLIGENT MANSLAUGHTER                    | 0.619     | 0.513  | 0.542    | 54      |
| WEAPON OFFENSE                                         | 0.943     | 0.949  | 0.946    | 2466    |

*Note: `support` is the average number of observations predicted on per fold, so the total number of observations per class is roughly 3x `support`.*

### Using Confidence Scores

If we interpret the classification probability as a confidence score, we can use it to filter out predictions that the model isn't as confident about. We applied this process in 3-fold cross validation. The numbers presented below indicate how much of the prediction data is retained given a confidence score cutoff of `p`. We present the overall accuracy and MCC metrics as if the model was only evaluated on this subset of confident predictions.

|     | cutoff | percent retained | mcc   | acc   |
| --- | ------ | ---------------- | ----- | ----- |
| 0   | 0.85   | 0.952            | 0.96  | 0.961 |
| 1   | 0.9    | 0.943            | 0.964 | 0.965 |
| 2   | 0.95   | 0.928            | 0.97  | 0.971 |
| 3   | 0.975  | 0.912            | 0.975 | 0.976 |
| 4   | 0.99   | 0.886            | 0.982 | 0.983 |
| 5   | 0.999  | 0.733            | 0.995 | 0.996 |