system HF staff commited on
Commit
9330f34
1 Parent(s): ab6c1f1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - license
5
+ - sentence-classification
6
+ - scancode
7
+ - license-compliance
8
+ license: apache-2.0
9
+ datasets:
10
+ - bookcorpus
11
+ - wikipedia
12
+ - scancode-rules
13
+ version: 1.0
14
+ ---
15
+
16
+ # `false-positives-scancode-bert-base-uncased-L8-1`
17
+
18
+ ## Intended Use
19
+
20
+ This model is intended to be used for Sentence Classification which is used for results
21
+ analysis in [`scancode-results-analyzer`](https://github.com/nexB/scancode-results-analyzer).
22
+
23
+ `scancode-results-analyzer` helps detect faulty scans in [`scancode-toolkit`](https://github.com/nexB/scancode-results-analyzer) by using statistics and nlp modeling, among other tools,
24
+ to make Scancode better.
25
+
26
+ #### How to use
27
+
28
+ Refer [quickstart](https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine) section in `scancode-results-analyzer` documentation, for installing and getting started.
29
+
30
+ - [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
31
+
32
+ Then in `NLPModelsPredict` class, function `predict_basic_false_positive` uses this classifier to
33
+ predict sentances as either valid license tags or false positives.
34
+
35
+ #### Limitations and bias
36
+
37
+ As this model is a fine-tuned version of the [`bert-base-uncased`](https://huggingface.co/bert-base-uncased) model,
38
+ it has the same biases, but as the task it is fine-tuned to is a very specific field
39
+ (license tags vs false positives) without those intended biases, it's safe to assume
40
+ those don't apply at all here.
41
+
42
+ ## Training and Fine-Tuning Data
43
+
44
+ The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).
45
+
46
+ Then this `bert-base-uncased` model was fine-tuned on Scancode Rule texts, specifically
47
+ trained in the context of sentence classification, where the two classes are
48
+
49
+ - License Tags
50
+ - False Positives of License Tags
51
+
52
+ ## Training procedure
53
+
54
+ For fine-tuning procedure and training, refer `scancode-results-analyzer` code.
55
+
56
+ - [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
57
+
58
+ In `NLPModelsTrain` class, function `prepare_input_data_false_positive` prepares the
59
+ training data.
60
+
61
+ In `NLPModelsTrain` class, function `train_basic_false_positive_classifier` fine-tunes
62
+ this classifier.
63
+
64
+ 1. Model - [BertBaseUncased](https://huggingface.co/bert-base-uncased) (Weights 0.5 GB)
65
+ 2. Sentence Length - 8
66
+ 3. Labels - 2 (False Positive/License Tag)
67
+ 4. After 4-6 Epochs of Fine-Tuning with learning rate 2e-5 (6 secs each on an RTX 2060)
68
+
69
+ Note: The classes aren't balanced.
70
+
71
+ ## Eval results
72
+
73
+ - Accuracy on the training data (90%) : 0.99 (+- 0.005)
74
+ - Accuracy on the validation data (10%) : 0.96 (+- 0.015)
75
+
76
+ The errors have lower confidence scores using thresholds on confidence scores almost
77
+ makes it a perfect classifier as the classification task is comparatively easier.
78
+
79
+ Results are stable, in the sence fine-tuning accuracy is very easily achieved every
80
+ time, though more learning epochs makes the data overfit, i.e. the training loss
81
+ decreases, but the validation loss increases, even though accuracies are very stable
82
+ even on overfitting.