system HF staff commited on
Commit
46ae5c6
1 Parent(s): 63017b4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ tags:
4
+ - license
5
+ - sentence-classification
6
+ - scancode
7
+ - license-compliance
8
+ license: apache-2.0
9
+ datasets:
10
+ - bookcorpus
11
+ - wikipedia
12
+ - scancode-rules
13
+ version: 1.0
14
+ ---
15
+
16
+ # `lic-class-scancode-bert-base-cased-L32-1`
17
+
18
+ ## Intended Use
19
+
20
+ This model is intended to be used for Sentence Classification which is used for results
21
+ analysis in [`scancode-results-analyzer`](https://github.com/nexB/scancode-results-analyzer).
22
+
23
+ `scancode-results-analyzer` helps detect faulty scans in [`scancode-toolkit`](https://github.com/nexB/scancode-results-analyzer) by using statistics and nlp modeling, among other tools,
24
+ to make Scancode better.
25
+
26
+ ## How to Use
27
+
28
+ Refer [quickstart](https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine) section in `scancode-results-analyzer` documentation, for installing and getting started.
29
+
30
+ - [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
31
+
32
+ Then in `NLPModelsPredict` class, function `predict_basic_lic_class` uses this classifier to
33
+ predict sentances as either valid license tags or false positives.
34
+
35
+ ## Limitations and Bias
36
+
37
+ As this model is a fine-tuned version of the [`bert-base-cased`](https://huggingface.co/bert-base-cased) model,
38
+ it has the same biases, but as the task it is fine-tuned to is a very specific task
39
+ (license text/notice/tag/referance) without those intended biases, it's safe to assume
40
+ those don't apply at all here.
41
+
42
+ ## Training and Fine-Tuning Data
43
+
44
+ The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).
45
+
46
+ Then this `bert-base-cased` model was fine-tuned on Scancode Rule texts, specifically
47
+ trained in the context of sentence classification, where the four classes are
48
+
49
+ - License Text
50
+ - License Notice
51
+ - License Tag
52
+ - License Referance
53
+
54
+ ## Training Procedure
55
+
56
+ For fine-tuning procedure and training, refer `scancode-results-analyzer` code.
57
+
58
+ - [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)
59
+
60
+ In `NLPModelsTrain` class, function `prepare_input_data_false_positive` prepares the
61
+ training data.
62
+
63
+ In `NLPModelsTrain` class, function `train_basic_false_positive_classifier` fine-tunes
64
+ this classifier.
65
+
66
+ 1. Model - [BertBaseCased](https://huggingface.co/bert-base-cased) (Weights 0.5 GB)
67
+ 2. Sentence Length - 32
68
+ 3. Labels - 4 (License Text/Notice/Tag/Referance)
69
+ 4. After 4 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)
70
+
71
+ Note: The classes aren't balanced.
72
+
73
+ ## Eval Results
74
+
75
+ - Accuracy on the training data (90%) : 0.98 (+- 0.01)
76
+ - Accuracy on the validation data (10%) : 0.84 (+- 0.01)
77
+
78
+ ## Further Work
79
+
80
+ 1. Apllying Splitting/Aggregation Strategies
81
+ 2. Data Augmentation according to Vaalidation Errors
82
+ 3. Bigger/Better Suited Models