ayansinha
/

false-positives-scancode-bert-base-uncased-L8-1

Fill-Mask Transformers TensorFlow

English bert license sentence-classification scancode license-compliance Inference Endpoints

Model card Files Files and versions Community

system HF staff commited on Oct 20, 2020

Commit

9330f34

•

1 Parent(s): ab6c1f1

Update README.md

Browse files

Files changed (1) hide show

README.md +82 -0

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+---
+language: en
+tags:
+- license
+- sentence-classification
+- scancode
+- license-compliance
+license: apache-2.0
+datasets:
+- bookcorpus
+- wikipedia
+- scancode-rules
+version: 1.0
+---
+# `false-positives-scancode-bert-base-uncased-L8-1`
+## Intended Use
+This model is intended to be used for Sentence Classification which is used for results
+analysis in [`scancode-results-analyzer`](https://github.com/nexB/scancode-results-analyzer).
+`scancode-results-analyzer` helps detect faulty scans in [`scancode-toolkit`](https://github.com/nexB/scancode-results-analyzer) by using statistics and nlp modeling, among other tools,
+to make Scancode better.
+#### How to use
+Refer [quickstart](https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine) section in `scancode-results-analyzer` documentation, for installing and getting started.
+- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
+Then in `NLPModelsPredict` class, function `predict_basic_false_positive` uses this classifier to
+predict sentances as either valid license tags or false positives.
+#### Limitations and bias
+As this model is a fine-tuned version of the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model,
+it has the same biases, but as the task it is fine-tuned to is a very specific field
+(license tags vs false positives) without those intended biases, it's safe to assume
+those don't apply at all here.
+## Training and Fine-Tuning Data
+The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).
+Then this `bert-base-uncased` model was fine-tuned on Scancode Rule texts, specifically
+trained in the context of sentence classification, where the two classes are
+	- License Tags
+	- False Positives of License Tags
+## Training procedure
+For fine-tuning procedure and training, refer `scancode-results-analyzer` code.
+- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
+In `NLPModelsTrain` class, function `prepare_input_data_false_positive` prepares the
+training data.
+In `NLPModelsTrain` class, function `train_basic_false_positive_classifier` fine-tunes
+this classifier.
+1. Model - [BertBaseUncased](https://huggingface.co/bert-base-uncased) (Weights 0.5 GB)
+2. Sentence Length - 8
+3. Labels - 2 (False Positive/License Tag)
+4. After 4-6 Epochs of Fine-Tuning with learning rate 2e-5 (6 secs each on an RTX 2060)
+Note: The classes aren't balanced.
+## Eval results
+- Accuracy on the training data (90%)   : 0.99 (+- 0.005)
+- Accuracy on the validation data (10%) : 0.96 (+- 0.015)
+The errors have lower confidence scores using thresholds on confidence scores almost
+makes it a perfect classifier as the classification task is comparatively easier.
+Results are stable, in the sence fine-tuning accuracy is very easily achieved every
+time, though more learning epochs makes the data overfit, i.e. the training loss
+decreases, but the validation loss increases, even though accuracies are very stable
+even on overfitting.