--- language: - en widget: - text: theft 3 - text: forgery - text: unlawful possession short-barreled shotgun - text: criminal trespass 2nd degree - text: eluding a police vehicle - text: upcs synthetic narcotic license: apache-2.0 --- # ROTA ## Rapid Offense Text Autocoder [![HuggingFace Models](https://img.shields.io/badge/%F0%9F%A4%97%20models-2021.05.18.15-blue)](https://huggingface.co/rti-international/rota) [![HuggingFace Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20spaces-2021.05.18.15-blue)](https://huggingface.co/spaces/rti-international/rota-app) [![GitHub Model Release](https://img.shields.io/github/v/release/RTIInternational/rota?logo=github)](https://github.com/RTIInternational/rota) [![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.4770492.svg)](https://doi.org/10.5281/zenodo.4770492) ROTA Application hosted on Hugging Face Spaces: https://huggingface.co/spaces/rti-international/rota-app Criminal justice research often requires conversion of free-text offense descriptions into overall charge categories to aid analysis. For example, the free-text offense of "eluding a police vehicle" would be coded to a charge category of "Obstruction - Law Enforcement". Since free-text offense descriptions aren't standardized and often need to be categorized in large volumes, this can result in a manual and time intensive process for researchers. ROTA is a machine learning model for converting offense text into offense codes. Currently ROTA predicts the *Charge Category* of a given offense text. A *charge category* is one of the headings for offense codes in the [2009 NCRP Codebook: Appendix F](https://www.icpsr.umich.edu/web/NACJD/studies/30799/datadocumentation#). The model was trained on [publicly available data](https://web.archive.org/web/20201021001250/https://www.icpsr.umich.edu/web/pages/NACJD/guides/ncrp.html) from a crosswalk containing offenses from all 50 states combined with three additional hand-labeled offense text datasets.
Charge Category Example
### Data Preprocessing The input text is standardized through a series of preprocessing steps. The text is first passed through a sequence of 500+ case-insensitive regular expressions that identify common misspellings and abbreviations and expand the text to a more full, correct English text. Some data-specific prefixes and suffixes are then removed from the text -- e.g. some states included a statute as a part of the text. Finally, punctuation (excluding dollar signs) are removed from the input, multiple spaces between words are removed, and the text is lowercased. ## Cross-Validation Performance This model was evaluated using 3-fold cross validation. Except where noted, numbers presented below are the mean value across the 3 folds. The model in this repository is trained on all available data. Because of this, you can typically expect production performance to be (unknowably) better than the numbers presented below. ### Overall Metrics | Metric | Value | | -------- | ----- | | Accuracy | 0.934 | | MCC | 0.931 | | Metric | precision | recall | f1-score | | --------- | --------- | ------ | -------- | | macro avg | 0.811 | 0.786 | 0.794 | *Note*: These are the average of the values *per fold*, so *macro avg* is the average of the macro average of all categories per fold. ### Per-Category Metrics | Category | precision | recall | f1-score | support | | ------------------------------------------------------ | --------- | ------ | -------- | ------- | | AGGRAVATED ASSAULT | 0.954 | 0.954 | 0.954 | 4085 | | ARMED ROBBERY | 0.961 | 0.955 | 0.958 | 1021 | | ARSON | 0.946 | 0.954 | 0.95 | 344 | | ASSAULTING PUBLIC OFFICER | 0.914 | 0.905 | 0.909 | 588 | | AUTO THEFT | 0.962 | 0.962 | 0.962 | 1660 | | BLACKMAIL/EXTORTION/INTIMIDATION | 0.872 | 0.871 | 0.872 | 627 | | BRIBERY AND CONFLICT OF INTEREST | 0.784 | 0.796 | 0.79 | 216 | | BURGLARY | 0.979 | 0.981 | 0.98 | 2214 | | CHILD ABUSE | 0.805 | 0.78 | 0.792 | 139 | | COCAINE OR CRACK VIOLATION OFFENSE UNSPECIFIED | 0.827 | 0.815 | 0.821 | 47 | | COMMERCIALIZED VICE | 0.818 | 0.788 | 0.802 | 666 | | CONTEMPT OF COURT | 0.982 | 0.987 | 0.984 | 2952 | | CONTRIBUTING TO DELINQUENCY OF A MINOR | 0.544 | 0.333 | 0.392 | 50 | | CONTROLLED SUBSTANCE - OFFENSE UNSPECIFIED | 0.864 | 0.791 | 0.826 | 280 | | COUNTERFEITING (FEDERAL ONLY) | 0 | 0 | 0 | 2 | | DESTRUCTION OF PROPERTY | 0.97 | 0.968 | 0.969 | 2560 | | DRIVING UNDER INFLUENCE - DRUGS | 0.567 | 0.603 | 0.581 | 34 | | DRIVING UNDER THE INFLUENCE | 0.951 | 0.946 | 0.949 | 2195 | | DRIVING WHILE INTOXICATED | 0.986 | 0.981 | 0.984 | 2391 | | DRUG OFFENSES - VIOLATION/DRUG UNSPECIFIED | 0.903 | 0.911 | 0.907 | 3100 | | DRUNKENNESS/VAGRANCY/DISORDERLY CONDUCT | 0.856 | 0.861 | 0.858 | 380 | | EMBEZZLEMENT | 0.865 | 0.759 | 0.809 | 100 | | EMBEZZLEMENT (FEDERAL ONLY) | 0 | 0 | 0 | 1 | | ESCAPE FROM CUSTODY | 0.988 | 0.991 | 0.989 | 4035 | | FAMILY RELATED OFFENSES | 0.739 | 0.773 | 0.755 | 442 | | FELONY - UNSPECIFIED | 0.692 | 0.735 | 0.712 | 122 | | FLIGHT TO AVOID PROSECUTION | 0.46 | 0.407 | 0.425 | 38 | | FORCIBLE SODOMY | 0.82 | 0.8 | 0.809 | 76 | | FORGERY (FEDERAL ONLY) | 0 | 0 | 0 | 2 | | FORGERY/FRAUD | 0.911 | 0.928 | 0.919 | 4687 | | FRAUD (FEDERAL ONLY) | 0 | 0 | 0 | 2 | | GRAND LARCENY - THEFT OVER $200 | 0.957 | 0.973 | 0.965 | 2412 | | HABITUAL OFFENDER | 0.742 | 0.627 | 0.679 | 53 | | HEROIN VIOLATION - OFFENSE UNSPECIFIED | 0.879 | 0.811 | 0.843 | 24 | | HIT AND RUN DRIVING | 0.922 | 0.94 | 0.931 | 303 | | HIT/RUN DRIVING - PROPERTY DAMAGE | 0.929 | 0.918 | 0.923 | 362 | | IMMIGRATION VIOLATIONS | 0.84 | 0.609 | 0.697 | 19 | | INVASION OF PRIVACY | 0.927 | 0.923 | 0.925 | 1235 | | JUVENILE OFFENSES | 0.928 | 0.866 | 0.895 | 144 | | KIDNAPPING | 0.937 | 0.93 | 0.933 | 553 | | LARCENY/THEFT - VALUE UNKNOWN | 0.955 | 0.945 | 0.95 | 3175 | | LEWD ACT WITH CHILDREN | 0.775 | 0.85 | 0.811 | 596 | | LIQUOR LAW VIOLATIONS | 0.741 | 0.768 | 0.755 | 214 | | MANSLAUGHTER - NON-VEHICULAR | 0.626 | 0.802 | 0.701 | 139 | | MANSLAUGHTER - VEHICULAR | 0.79 | 0.853 | 0.819 | 117 | | MARIJUANA/HASHISH VIOLATION - OFFENSE UNSPECIFIED | 0.741 | 0.662 | 0.699 | 62 | | MISDEMEANOR UNSPECIFIED | 0.63 | 0.243 | 0.347 | 57 | | MORALS/DECENCY - OFFENSE | 0.774 | 0.764 | 0.769 | 412 | | MURDER | 0.965 | 0.915 | 0.939 | 621 | | OBSTRUCTION - LAW ENFORCEMENT | 0.939 | 0.947 | 0.943 | 4220 | | OFFENSES AGAINST COURTS, LEGISLATURES, AND COMMISSIONS | 0.881 | 0.895 | 0.888 | 1965 | | PAROLE VIOLATION | 0.97 | 0.953 | 0.962 | 946 | | PETTY LARCENY - THEFT UNDER $200 | 0.965 | 0.761 | 0.85 | 139 | | POSSESSION/USE - COCAINE OR CRACK | 0.893 | 0.928 | 0.908 | 68 | | POSSESSION/USE - DRUG UNSPECIFIED | 0.624 | 0.535 | 0.572 | 189 | | POSSESSION/USE - HEROIN | 0.884 | 0.852 | 0.866 | 25 | | POSSESSION/USE - MARIJUANA/HASHISH | 0.977 | 0.97 | 0.973 | 556 | | POSSESSION/USE - OTHER CONTROLLED SUBSTANCES | 0.975 | 0.965 | 0.97 | 3271 | | PROBATION VIOLATION | 0.963 | 0.953 | 0.958 | 1158 | | PROPERTY OFFENSES - OTHER | 0.901 | 0.87 | 0.885 | 446 | | PUBLIC ORDER OFFENSES - OTHER | 0.7 | 0.721 | 0.71 | 1871 | | RACKETEERING/EXTORTION (FEDERAL ONLY) | 0 | 0 | 0 | 2 | | RAPE - FORCE | 0.842 | 0.873 | 0.857 | 641 | | RAPE - STATUTORY - NO FORCE | 0.707 | 0.55 | 0.611 | 140 | | REGULATORY OFFENSES (FEDERAL ONLY) | 0.847 | 0.567 | 0.674 | 70 | | RIOTING | 0.784 | 0.605 | 0.68 | 119 | | SEXUAL ASSAULT - OTHER | 0.836 | 0.836 | 0.836 | 971 | | SIMPLE ASSAULT | 0.976 | 0.967 | 0.972 | 4577 | | STOLEN PROPERTY - RECEIVING | 0.959 | 0.957 | 0.958 | 1193 | | STOLEN PROPERTY - TRAFFICKING | 0.902 | 0.888 | 0.895 | 491 | | TAX LAW (FEDERAL ONLY) | 0.373 | 0.233 | 0.286 | 30 | | TRAFFIC OFFENSES - MINOR | 0.974 | 0.977 | 0.976 | 8699 | | TRAFFICKING - COCAINE OR CRACK | 0.896 | 0.951 | 0.922 | 185 | | TRAFFICKING - DRUG UNSPECIFIED | 0.709 | 0.795 | 0.749 | 516 | | TRAFFICKING - HEROIN | 0.871 | 0.92 | 0.894 | 54 | | TRAFFICKING - OTHER CONTROLLED SUBSTANCES | 0.963 | 0.954 | 0.959 | 2832 | | TRAFFICKING MARIJUANA/HASHISH | 0.921 | 0.943 | 0.932 | 255 | | TRESPASSING | 0.974 | 0.98 | 0.977 | 1916 | | UNARMED ROBBERY | 0.941 | 0.939 | 0.94 | 377 | | UNAUTHORIZED USE OF VEHICLE | 0.94 | 0.908 | 0.924 | 304 | | UNSPECIFIED HOMICIDE | 0.61 | 0.554 | 0.577 | 60 | | VIOLENT OFFENSES - OTHER | 0.827 | 0.817 | 0.822 | 606 | | VOLUNTARY/NONNEGLIGENT MANSLAUGHTER | 0.619 | 0.513 | 0.542 | 54 | | WEAPON OFFENSE | 0.943 | 0.949 | 0.946 | 2466 | *Note: `support` is the average number of observations predicted on per fold, so the total number of observations per class is roughly 3x `support`.* ### Using Confidence Scores If we interpret the classification probability as a confidence score, we can use it to filter out predictions that the model isn't as confident about. We applied this process in 3-fold cross validation. The numbers presented below indicate how much of the prediction data is retained given a confidence score cutoff of `p`. We present the overall accuracy and MCC metrics as if the model was only evaluated on this subset of confident predictions. | | cutoff | percent retained | mcc | acc | | --- | ------ | ---------------- | ----- | ----- | | 0 | 0.85 | 0.952 | 0.96 | 0.961 | | 1 | 0.9 | 0.943 | 0.964 | 0.965 | | 2 | 0.95 | 0.928 | 0.97 | 0.971 | | 3 | 0.975 | 0.912 | 0.975 | 0.976 | | 4 | 0.99 | 0.886 | 0.982 | 0.983 | | 5 | 0.999 | 0.733 | 0.995 | 0.996 |