ayansinha
/

lic-class-scancode-bert-base-cased-L32-1

sentence-classification

license-compliance

Inference Endpoints

Model card Files Files and versions Community

lic-class-scancode-bert-base-cased-L32-1 / README.md

system's picture

system HF staff

Update README.md

46ae5c6 over 3 years ago

|

raw history blame contribute delete

No virus

2.87 kB

	---
	language: en
	tags:
	- license
	- sentence-classification
	- scancode
	- license-compliance
	license: apache-2.0
	datasets:
	- bookcorpus
	- wikipedia
	- scancode-rules
	version: 1.0
	---

	# `lic-class-scancode-bert-base-cased-L32-1`

	## Intended Use

	This model is intended to be used for Sentence Classification which is used for results
	analysis in [`scancode-results-analyzer`](https://github.com/nexB/scancode-results-analyzer).

	`scancode-results-analyzer` helps detect faulty scans in [`scancode-toolkit`](https://github.com/nexB/scancode-results-analyzer) by using statistics and nlp modeling, among other tools,
	to make Scancode better.

	## How to Use

	Refer [quickstart](https://github.com/nexB/scancode-results-analyzer#quickstart---local-machine) section in `scancode-results-analyzer` documentation, for installing and getting started.

	- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)

	Then in `NLPModelsPredict` class, function `predict_basic_lic_class` uses this classifier to
	predict sentances as either valid license tags or false positives.

	## Limitations and Bias

	As this model is a fine-tuned version of the [`bert-base-cased`](https://huggingface.co/bert-base-cased) model,
	it has the same biases, but as the task it is fine-tuned to is a very specific task
	(license text/notice/tag/referance) without those intended biases, it's safe to assume
	those don't apply at all here.

	## Training and Fine-Tuning Data

	The BERT model was pretrained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers).

	Then this `bert-base-cased` model was fine-tuned on Scancode Rule texts, specifically
	trained in the context of sentence classification, where the four classes are

	- License Text
	- License Notice
	- License Tag
	- License Referance

	## Training Procedure

	For fine-tuning procedure and training, refer `scancode-results-analyzer` code.

	- [Link to Code](https://github.com/nexB/scancode-results-analyzer/blob/master/src/results_analyze/nlp_models.py)

	In `NLPModelsTrain` class, function `prepare_input_data_false_positive` prepares the
	training data.

	In `NLPModelsTrain` class, function `train_basic_false_positive_classifier` fine-tunes
	this classifier.

	1. Model - [BertBaseCased](https://huggingface.co/bert-base-cased) (Weights 0.5 GB)
	2. Sentence Length - 32
	3. Labels - 4 (License Text/Notice/Tag/Referance)
	4. After 4 Epochs of Fine-Tuning with learning rate 2e-5 (60 secs each on an RTX 2060)

	Note: The classes aren't balanced.

	## Eval Results

	- Accuracy on the training data (90%) : 0.98 (+- 0.01)
	- Accuracy on the validation data (10%) : 0.84 (+- 0.01)

	## Further Work

	1. Apllying Splitting/Aggregation Strategies
	2. Data Augmentation according to Vaalidation Errors
	3. Bigger/Better Suited Models