MLRC

MLRC (Medical, Legal, Regulatory, and Compliance) takes weeks (and sometimes months) to review any consumer-facing content submitted by marketing agencies e.g., website text, Facebook ads, Instagram posts, TV ads, etc. Content could be text, audio, images, or video. This review process involving tens of people from medical, legal, regulatory, and compliance results in slow releases of ad campaigns or website content to consumers. Because there are thousands of content jobs for MLRC to review monthly, this backlog reduces the amount of time for them to actually do their day jobs. And these review jobs are increasing monthly with pressure being put on them to speed up reviews.

Inabia-AI

Inabia AI will reduce the review time from weeks to days by front-loading the review on the text content creators e.g., marketing agencies using a Grammarly-like web UI that will do four levels of review (similar to what MLRC reviewers conduct on actual content).

Level-1-Review (Detection)

Find the location of problem sentences/clauses aka error detection.

Fine-tuned BERT-large on MLRC dataset

This custom BERT-large was fine-tuned on MLRC balanced dataset.

Model description

BERT is a transformers model pretrained on a large corpus of English data in a self-supervised fashion. This means it was pretrained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was pretrained with two objectives:

Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run the entire masked sentence through the model and has to predict the masked words. This is different from traditional recurrent neural networks (RNNs) that usually see the words one after the other, or from autoregressive models like GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the sentence.
Next sentence prediction (NSP): the models concatenates two masked sentences as inputs during pretraining. Sometimes they correspond to sentences that were next to each other in the original text, sometimes not. The model then has to predict if the two sentences were following each other or not.

This way, the model learns an inner representation of the English language that can then be used to extract features useful for downstream tasks: if you have a dataset of labeled sentences, for instance, you can train a standard classifier using the features produced by the BERT model as inputs.

The detailed release history can be found on the google-research/bert readme on github.

Model	#params	Language
`bert-base-uncased`	110M	English
`bert-large-uncased`	340M	English
`bert-base-cased`	110M	English
`bert-large-cased`	340M	English
`bert-base-chinese`	110M	Chinese
`bert-base-multilingual-cased`	110M	Multiple
`bert-large-uncased-whole-word-masking`	340M	English
`bert-large-cased-whole-word-masking`	340M	English

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('Inabia-AI/bert-large-uncased-mlrc')
model = BertModel.from_pretrained("Inabia-AI/bert-large-uncased-mlrc")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

and in TensorFlow:

from transformers import BertTokenizer, TFBertModel
tokenizer = BertTokenizer.from_pretrained('Inabia-AI/bert-large-uncased-mlrc')
model = TFBertModel.from_pretrained("Inabia-AI/bert-large-uncased-mlrc")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

Evaluation results

When fine-tuned on downstream tasks (text classification), this model achieves the following results:

Training dataset	TP:TN	# of TPs	# of TNs	Precision	Recall	F1
MLRC	1:1	200	200	64%	55%	50%