File size: 2,872 Bytes
46ae5c6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
---
language: en
tags:
- license
- sentence-classification
- scancode
- license-compliance
license: apache-2.0
datasets:
- bookcorpus
- wikipedia
- scancode-rules
version: 1.0
---

# `lic-class-scancode-bert-base-cased-L32-1`

## Intended Use

This model is intended to be used for Sentence Classification which is used for results
analysis in [`scancode-results-analyzer`](https://github.com/nexB/scancode-results-analyzer).

`scancode-results-analyzer` helps detect faulty scans in [`scancode-toolkit`](https://github.com/nexB/scancode-results-analyzer) by using statistics and nlp modeling, among other tools,
to make Scancode better.

## How to Use

Refer [quickstart](https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine) section in `scancode-results-analyzer` documentation, for installing and getting started.

- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)

Then in `NLPModelsPredict` class, function `predict_basic_lic_class` uses this classifier to
predict sentances as either valid license tags or false positives.

## Limitations and Bias

As this model is a fine-tuned version of the [`bert-base-cased`](https://huggingface.co/bert-base-cased) model,
it has the same biases, but as the task it is fine-tuned to is a very specific task
(license text/notice/tag/referance) without those intended biases, it's safe to assume
those don't apply at all here.  

## Training and Fine-Tuning Data

The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).

Then this `bert-base-cased` model was fine-tuned on Scancode Rule texts, specifically
trained in the context of sentence classification, where the four classes are

	- License Text
	- License Notice
	- License Tag
	- License Referance

## Training Procedure

For fine-tuning procedure and training, refer `scancode-results-analyzer` code.

- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)

In `NLPModelsTrain` class, function `prepare_input_data_false_positive` prepares the
training data.

In `NLPModelsTrain` class, function `train_basic_false_positive_classifier` fine-tunes
this classifier.

1. Model - [BertBaseCased](https://huggingface.co/bert-base-cased) (Weights 0.5 GB)
2. Sentence Length - 32
3. Labels - 4 (License Text/Notice/Tag/Referance)
4. After 4 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)

Note: The classes aren't balanced.

## Eval Results

- Accuracy on the training data (90%)   : 0.98 (+- 0.01) 
- Accuracy on the validation data (10%) : 0.84 (+- 0.01)

## Further Work

1. Apllying Splitting/Aggregation Strategies
2. Data Augmentation according to Vaalidation Errors
3. Bigger/Better Suited Models