pebblo-classifier / README.md

Tihsrah-CD

Update model to version 9: Improved performance metrics and evaluation results

5fbbe83 5 months ago

preview code

raw

history blame

No virus

8.86 kB

	---
	license: mit
	language:
	- en
	---

	# Model Card for Model ID

	This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 21 distinct labels.

	## Model Details

	### Model Description

	The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.

	- Developed by: DAXA.AI
	- Funded by: Open Source
	- Model type: Classification model
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model: distilbert-base-uncased

	### Model Sources

	- Repository: [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
	- Demo: [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)

	## Uses

	### Intended Use

	The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.

	### Recommendations

	End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	# Import necessary libraries
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import joblib
	from huggingface_hub import hf_hub_url, cached_download

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
	model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")

	# Example text
	text = "Please enter your text here."
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)

	# Apply softmax to the logits
	probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

	# Get the predicted label
	predicted_label = torch.argmax(probabilities, dim=-1)

	# URL of your Hugging Face model repository
	REPO_NAME = "daxa-ai/pebblo-classifier"

	# Path to the label encoder file in the repository
	LABEL_ENCODER_FILE = "label_encoder.joblib"

	# Construct the URL to the label encoder file
	url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)

	# Download and cache the label encoder file
	filename = cached_download(url)

	# Load the label encoder
	label_encoder = joblib.load(filename)

	# Decode the predicted label
	decoded_label = label_encoder.inverse_transform(predicted_label.numpy())

	print(decoded_label)

	```

	## Training Details

	### Training Data

	The training dataset consists of 141,055 entries, with 21 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
	Here are the labels along with their respective counts in the dataset:

	\| Agreement Type \| Instances \|
	\| ------------------------------------- \| --------- \|
	\| BOARD_MEETING_AGREEMENT \| 4,206 \|
	\| CONSULTING_AGREEMENT \| 2,965 \|
	\| CUSTOMER_LIST_AGREEMENT \| 8,966 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 5,144 \|
	\| EMPLOYEE_AGREEMENT \| 3,876 \|
	\| ENTERPRISE_AGREEMENT \| 4,213 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 8,999 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 8,996 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 11,384 \|
	\| HARMFUL_ADVICE \| 1,887 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 6,982 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 8,957 \|
	\| MEDICAL_ADVICE \| 3,847 \|
	\| MERGER_AGREEMENT \| 7,704 \|
	\| NDA_AGREEMENT \| 5,221 \|
	\| NORMAL_TEXT \| 8,994 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 8,802 \|
	\| PRICE_LIST_AGREEMENT \| 8,906 \|
	\| SETTLEMENT_AGREEMENT \| 3,737 \|
	\| SEXUAL_CONTENT \| 8,957 \|
	\| SEXUAL_INCIDENT_REPORT \| 8,321 \|

	## Evaluation

	### Testing Data & Metrics

	#### Testing Data

	Evaluation was performed on a dataset of 86,281 entries with a temperature range of 1-1.25 for randomness.
	Here are the labels along with their respective counts in the dataset:

	\| Agreement Type \| Instances \|
	\| ------------------------------------- \| --------- \|
	\| BOARD_MEETING_AGREEMENT \| 3,975 \|
	\| CONSULTING_AGREEMENT \| 1,430 \|
	\| CUSTOMER_LIST_AGREEMENT \| 4,488 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 6,696 \|
	\| EMPLOYEE_AGREEMENT \| 1,310 \|
	\| ENTERPRISE_AGREEMENT \| 1,501 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 7,967 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 4,795 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 4,686 \|
	\| HARMFUL_ADVICE \| 361 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 3,740 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 5,833 \|
	\| MEDICAL_ADVICE \| 643 \|
	\| MERGER_AGREEMENT \| 6,557 \|
	\| NDA_AGREEMENT \| 1,352 \|
	\| NORMAL_TEXT \| 5,811 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 5,608 \|
	\| PRICE_LIST_AGREEMENT \| 5,044 \|
	\| SETTLEMENT_AGREEMENT \| 5,377 \|
	\| SEXUAL_CONTENT \| 4,356 \|
	\| SEXUAL_INCIDENT_REPORT \| 4,750 \|

	#### Metrics

	\| Agreement Type \| precision \| recall \| f1-score \| support \|
	\| ------------------------------------- \| --------- \| ------ \| -------- \| ------- \|
	\| BOARD_MEETING_AGREEMENT \| 0.92 \| 0.95 \| 0.93 \| 3,975 \|
	\| CONSULTING_AGREEMENT \| 0.81 \| 0.85 \| 0.83 \| 1,430 \|
	\| CUSTOMER_LIST_AGREEMENT \| 0.90 \| 0.88 \| 0.89 \| 4,488 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 0.73 \| 0.63 \| 0.68 \| 6,696 \|
	\| EMPLOYEE_AGREEMENT \| 0.85 \| 0.84 \| 0.85 \| 1,310 \|
	\| ENTERPRISE_AGREEMENT \| 0.18 \| 0.70 \| 0.29 \| 1,501 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 0.92 \| 0.78 \| 0.84 \| 7,967 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 0.97 \| 0.88 \| 0.92 \| 4,795 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 0.93 \| 0.99 \| 0.96 \| 4,686 \|
	\| HARMFUL_ADVICE \| 0.92 \| 0.94 \| 0.93 \| 361 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 0.94 \| 0.98 \| 0.96 \| 3,740 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 0.93 \| 0.97 \| 0.95 \| 5,833 \|
	\| MEDICAL_ADVICE \| 0.93 \| 1.00 \| 0.96 \| 643 \|
	\| MERGER_AGREEMENT \| 0.93 \| 0.45 \| 0.61 \| 6,557 \|
	\| NDA_AGREEMENT \| 0.68 \| 0.91 \| 0.78 \| 1,352 \|
	\| NORMAL_TEXT \| 0.95 \| 0.94 \| 0.95 \| 5,811 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 0.96 \| 0.99 \| 0.98 \| 5,608 \|
	\| PRICE_LIST_AGREEMENT \| 0.76 \| 0.79 \| 0.77 \| 5,044 \|
	\| SETTLEMENT_AGREEMENT \| 0.76 \| 0.78 \| 0.77 \| 5,377 \|
	\| SEXUAL_CONTENT \| 0.92 \| 0.97 \| 0.94 \| 4,356 \|
	\| SEXUAL_INCIDENT_REPORT \| 0.99 \| 0.94 \| 0.96 \| 4,750 \|
	\| accuracy \| \| \| 0.84 \| 86,280 \|
	\| macro avg \| 0.85 \| 0.86 \| 0.84 \| 86,280 \|
	\| weighted avg \| 0.88 \| 0.84 \| 0.85 \| 86,280 \|

	#### Results

	The model’s performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 21 labels in the dataset. Based on the test data evaluation results, the model achieved an accuracy of 0.8424, a precision of 0.8794, and a recall of 0.8424. The F1-score, which is the harmonic mean of precision and recall, stands at 0.8505.

	The evaluation loss, which measures the discrepancy between the model’s predictions and the actual values, is 0.6815. Lower loss values indicate better model performance.

	The model was able to process approximately 97.684 samples per second during the evaluation, which took a total runtime of 883.2545 seconds. The model performed approximately 0.764 evaluation steps per second.

	---
	license: mit
	language:
	- en
	---

	# Model Card for Model ID

	This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 21 distinct labels.

	## Model Details

	### Model Description

	The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.

	- Developed by: DAXA.AI
	- Funded by: Open Source
	- Model type: Classification model
	- Language(s) (NLP): English
	- License: MIT
	- Finetuned from model: distilbert-base-uncased

	### Model Sources

	- Repository: [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
	- Demo: [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)

	## Uses

	### Intended Use

	The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.

	### Recommendations

	End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	# Import necessary libraries
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch
	import joblib
	from huggingface_hub import hf_hub_url, cached_download

	# Load the tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
	model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")

	# Example text
	text = "Please enter your text here."
	encoded_input = tokenizer(text, return_tensors='pt')
	output = model(**encoded_input)

	# Apply softmax to the logits
	probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

	# Get the predicted label
	predicted_label = torch.argmax(probabilities, dim=-1)

	# URL of your Hugging Face model repository
	REPO_NAME = "daxa-ai/pebblo-classifier"

	# Path to the label encoder file in the repository
	LABEL_ENCODER_FILE = "label_encoder.joblib"

	# Construct the URL to the label encoder file
	url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)

	# Download and cache the label encoder file
	filename = cached_download(url)

	# Load the label encoder
	label_encoder = joblib.load(filename)

	# Decode the predicted label
	decoded_label = label_encoder.inverse_transform(predicted_label.numpy())

	print(decoded_label)

	```

	## Training Details

	### Training Data

	The training dataset consists of 141,055 entries, with 21 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
	Here are the labels along with their respective counts in the dataset:

	\| Agreement Type \| Instances \|
	\| ------------------------------------- \| --------- \|
	\| BOARD_MEETING_AGREEMENT \| 4,206 \|
	\| CONSULTING_AGREEMENT \| 2,965 \|
	\| CUSTOMER_LIST_AGREEMENT \| 8,966 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 5,144 \|
	\| EMPLOYEE_AGREEMENT \| 3,876 \|
	\| ENTERPRISE_AGREEMENT \| 4,213 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 8,999 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 8,996 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 11,384 \|
	\| HARMFUL_ADVICE \| 1,887 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 6,982 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 8,957 \|
	\| MEDICAL_ADVICE \| 3,847 \|
	\| MERGER_AGREEMENT \| 7,704 \|
	\| NDA_AGREEMENT \| 5,221 \|
	\| NORMAL_TEXT \| 8,994 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 8,802 \|
	\| PRICE_LIST_AGREEMENT \| 8,906 \|
	\| SETTLEMENT_AGREEMENT \| 3,737 \|
	\| SEXUAL_CONTENT \| 8,957 \|
	\| SEXUAL_INCIDENT_REPORT \| 8,321 \|

	## Evaluation

	### Testing Data & Metrics

	#### Testing Data

	Evaluation was performed on a dataset of 86,281 entries with a temperature range of 1-1.25 for randomness.
	Here are the labels along with their respective counts in the dataset:

	\| Agreement Type \| Instances \|
	\| ------------------------------------- \| --------- \|
	\| BOARD_MEETING_AGREEMENT \| 3,975 \|
	\| CONSULTING_AGREEMENT \| 1,430 \|
	\| CUSTOMER_LIST_AGREEMENT \| 4,488 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 6,696 \|
	\| EMPLOYEE_AGREEMENT \| 1,310 \|
	\| ENTERPRISE_AGREEMENT \| 1,501 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 7,967 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 4,795 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 4,686 \|
	\| HARMFUL_ADVICE \| 361 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 3,740 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 5,833 \|
	\| MEDICAL_ADVICE \| 643 \|
	\| MERGER_AGREEMENT \| 6,557 \|
	\| NDA_AGREEMENT \| 1,352 \|
	\| NORMAL_TEXT \| 5,811 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 5,608 \|
	\| PRICE_LIST_AGREEMENT \| 5,044 \|
	\| SETTLEMENT_AGREEMENT \| 5,377 \|
	\| SEXUAL_CONTENT \| 4,356 \|
	\| SEXUAL_INCIDENT_REPORT \| 4,750 \|

	#### Metrics

	\| Agreement Type \| precision \| recall \| f1-score \| support \|
	\| ------------------------------------- \| --------- \| ------ \| -------- \| ------- \|
	\| BOARD_MEETING_AGREEMENT \| 0.92 \| 0.95 \| 0.93 \| 3,975 \|
	\| CONSULTING_AGREEMENT \| 0.81 \| 0.85 \| 0.83 \| 1,430 \|
	\| CUSTOMER_LIST_AGREEMENT \| 0.90 \| 0.88 \| 0.89 \| 4,488 \|
	\| DISTRIBUTION_PARTNER_AGREEMENT \| 0.73 \| 0.63 \| 0.68 \| 6,696 \|
	\| EMPLOYEE_AGREEMENT \| 0.85 \| 0.84 \| 0.85 \| 1,310 \|
	\| ENTERPRISE_AGREEMENT \| 0.18 \| 0.70 \| 0.29 \| 1,501 \|
	\| ENTERPRISE_LICENSE_AGREEMENT \| 0.92 \| 0.78 \| 0.84 \| 7,967 \|
	\| EXECUTIVE_SEVERANCE_AGREEMENT \| 0.97 \| 0.88 \| 0.92 \| 4,795 \|
	\| FINANCIAL_REPORT_AGREEMENT \| 0.93 \| 0.99 \| 0.96 \| 4,686 \|
	\| HARMFUL_ADVICE \| 0.92 \| 0.94 \| 0.93 \| 361 \|
	\| INTERNAL_PRODUCT_ROADMAP_AGREEMENT \| 0.94 \| 0.98 \| 0.96 \| 3,740 \|
	\| LOAN_AND_SECURITY_AGREEMENT \| 0.93 \| 0.97 \| 0.95 \| 5,833 \|
	\| MEDICAL_ADVICE \| 0.93 \| 1.00 \| 0.96 \| 643 \|
	\| MERGER_AGREEMENT \| 0.93 \| 0.45 \| 0.61 \| 6,557 \|
	\| NDA_AGREEMENT \| 0.68 \| 0.91 \| 0.78 \| 1,352 \|
	\| NORMAL_TEXT \| 0.95 \| 0.94 \| 0.95 \| 5,811 \|
	\| PATENT_APPLICATION_FILLINGS_AGREEMENT \| 0.96 \| 0.99 \| 0.98 \| 5,608 \|
	\| PRICE_LIST_AGREEMENT \| 0.76 \| 0.79 \| 0.77 \| 5,044 \|
	\| SETTLEMENT_AGREEMENT \| 0.76 \| 0.78 \| 0.77 \| 5,377 \|
	\| SEXUAL_CONTENT \| 0.92 \| 0.97 \| 0.94 \| 4,356 \|
	\| SEXUAL_INCIDENT_REPORT \| 0.99 \| 0.94 \| 0.96 \| 4,750 \|
	\| accuracy \| \| \| 0.84 \| 86,280 \|
	\| macro avg \| 0.85 \| 0.86 \| 0.84 \| 86,280 \|
	\| weighted avg \| 0.88 \| 0.84 \| 0.85 \| 86,280 \|

	#### Results

	The model’s performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 21 labels in the dataset. Based on the test data evaluation results, the model achieved an accuracy of 0.8424, a precision of 0.8794, and a recall of 0.8424. The F1-score, which is the harmonic mean of precision and recall, stands at 0.8505.

	The evaluation loss, which measures the discrepancy between the model’s predictions and the actual values, is 0.6815. Lower loss values indicate better model performance.

	The model was able to process approximately 97.684 samples per second during the evaluation, which took a total runtime of 883.2545 seconds. The model performed approximately 0.764 evaluation steps per second.