lic-class-scancode-bert-base-cased-L32-1

Intended Use

This model is intended to be used for Sentence Classification which is used for results analysis in scancode-results-analyzer.

scancode-results-analyzer helps detect faulty scans in scancode-toolkit by using statistics and nlp modeling, among other tools, to make Scancode better.

How to Use

Refer quickstart section in scancode-results-analyzer documentation, for installing and getting started.

Then in NLPModelsPredict class, function predict_basic_lic_class uses this classifier to predict sentances as either valid license tags or false positives.

Limitations and Bias

As this model is a fine-tuned version of the bert-base-cased model, it has the same biases, but as the task it is fine-tuned to is a very specific task (license text/notice/tag/referance) without those intended biases, it's safe to assume those don't apply at all here.

Training and Fine-Tuning Data

The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).

Then this bert-base-cased model was fine-tuned on Scancode Rule texts, specifically trained in the context of sentence classification, where the four classes are

- License Text
- License Notice
- License Tag
- License Referance

Training Procedure

For fine-tuning procedure and training, refer scancode-results-analyzer code.

In NLPModelsTrain class, function prepare_input_data_false_positive prepares the training data.

In NLPModelsTrain class, function train_basic_false_positive_classifier fine-tunes this classifier.

  1. Model - BertBaseCased (Weights 0.5 GB)
  2. Sentence Length - 32
  3. Labels - 4 (License Text/Notice/Tag/Referance)
  4. After 4 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)

Note: The classes aren't balanced.

Eval Results

  • Accuracy on the training data (90%) : 0.98 (+- 0.01)
  • Accuracy on the validation data (10%) : 0.84 (+- 0.01)

Further Work

  1. Apllying Splitting/Aggregation Strategies
  2. Data Augmentation according to Vaalidation Errors
  3. Bigger/Better Suited Models
Downloads last month
11
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train ayansinha/lic-class-scancode-bert-base-cased-L32-1