`false-positives-scancode-bert-base-uncased-L8-1`

Intended Use

This model is intended to be used for Sentence Classification which is used for results analysis in scancode-results-analyzer.

scancode-results-analyzer helps detect faulty scans in scancode-toolkit by using statistics and nlp modeling, among other tools, to make Scancode better.

How to use

Refer quickstart section in scancode-results-analyzer documentation, for installing and getting started.

Link to Code

Then in NLPModelsPredict class, function predict_basic_false_positive uses this classifier to predict sentances as either valid license tags or false positives.

Limitations and bias

As this model is a fine-tuned version of the bert-base-uncased model, it has the same biases, but as the task it is fine-tuned to is a very specific field (license tags vs false positives) without those intended biases, it's safe to assume those don't apply at all here.

Training and Fine-Tuning Data

The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).

Then this bert-base-uncased model was fine-tuned on Scancode Rule texts, specifically trained in the context of sentence classification, where the two classes are

- License Tags 
- False Positives of License Tags

Training procedure

For fine-tuning procedure and training, refer scancode-results-analyzer code.

Link to Code

In NLPModelsTrain class, function prepare_input_data_false_positive prepares the training data.

In NLPModelsTrain class, function train_basic_false_positive_classifier fine-tunes this classifier.

Model - BertBaseUncased (Weights 0.5 GB)
Sentence Length - 8
Labels - 2 (False Positive/License Tag)
After 4-6 Epochs of Fine-Tuning with learning rate 2e-5 (6 secs each on an RTX 2060)

Note: The classes aren't balanced.

Eval results

Accuracy on the training data (90%) : 0.99 (+- 0.005)
Accuracy on the validation data (10%) : 0.96 (+- 0.015)

The errors have lower confidence scores using thresholds on confidence scores almost makes it a perfect classifier as the classification task is comparatively easier.

Results are stable, in the sence fine-tuning accuracy is very easily achieved every time, though more learning epochs makes the data overfit, i.e. the training loss decreases, but the validation loss increases, even though accuracies are very stable even on overfitting.

ayansinha
/

false-positives-scancode-bert-base-uncased-L8-1