Edit model card

Model Card for Model ID

This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels.

Model Details

Model Description

The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.

  • Developed by: DAXA.AI
  • Funded by: Open Source
  • Model type: Classification model
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: distilbert-base-uncased

Model Sources

Uses

Intended Use

The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.

Recommendations

End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.

How to Get Started with the Model

Use the code below to get started with the model.

# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import joblib
from huggingface_hub import hf_hub_url, cached_download

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")

# Example text
text = "Please enter your text here."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# Apply softmax to the logits
probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

# Get the predicted label
predicted_label = torch.argmax(probabilities, dim=-1)

# URL of your Hugging Face model repository
REPO_NAME = "daxa-ai/pebblo-classifier"

# Path to the label encoder file in the repository
LABEL_ENCODER_FILE = "label_encoder.joblib"

# Construct the URL to the label encoder file
url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)

# Download and cache the label encoder file
filename = cached_download(url)

# Load the label encoder
label_encoder = joblib.load(filename)

# Decode the predicted label
decoded_label = label_encoder.inverse_transform(predicted_label.numpy())

print(decoded_label)

Training Details

Training Data

The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20). Here are the labels along with their respective counts in the dataset:

Agreement Type Instances
BOARD_MEETING_AGREEMENT 4,225
CONSULTING_AGREEMENT 2,965
CUSTOMER_LIST_AGREEMENT 9,000
DISTRIBUTION_PARTNER_AGREEMENT 5,162
EMPLOYEE_AGREEMENT 3,921
ENTERPRISE_AGREEMENT 4,217
ENTERPRISE_LICENSE_AGREEMENT 9,000
EXECUTIVE_SEVERANCE_AGREEMENT 9,000
FINANCIAL_REPORT_AGREEMENT 8,381
HARMFUL_ADVICE 2,025
INTERNAL_PRODUCT_ROADMAP_AGREEMENT 7,037
LOAN_AND_SECURITY_AGREEMENT 9,000
MEDICAL_ADVICE 2,359
MERGER_AGREEMENT 7,706
NDA_AGREEMENT 5,229
NORMAL_TEXT 9,000
PATENT_APPLICATION_FILLINGS_AGREEMENT 9,000
PRICE_LIST_AGREEMENT 9,000
SETTLEMENT_AGREEMENT 3,754
SEXUAL_HARRASSMENT 8,321

Evaluation

Testing Data & Metrics

Testing Data

Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness. Here are the labels along with their respective counts in the dataset:

Agreement Type Instances
BOARD_MEETING_AGREEMENT 4,335
CONSULTING_AGREEMENT 1,533
CUSTOMER_LIST_AGREEMENT 4,995
DISTRIBUTION_PARTNER_AGREEMENT 7,231
EMPLOYEE_AGREEMENT 1,433
ENTERPRISE_AGREEMENT 1,616
ENTERPRISE_LICENSE_AGREEMENT 8,574
EXECUTIVE_SEVERANCE_AGREEMENT 5,177
FINANCIAL_REPORT_AGREEMENT 4,264
HARMFUL_ADVICE 474
INTERNAL_PRODUCT_ROADMAP_AGREEMENT 4,116
LOAN_AND_SECURITY_AGREEMENT 6,354
MEDICAL_ADVICE 289
MERGER_AGREEMENT 7,079
NDA_AGREEMENT 1,452
NORMAL_TEXT 8,335
PATENT_APPLICATION_FILLINGS_AGREEMENT 6,177
PRICE_LIST_AGREEMENT 5,453
SETTLEMENT_AGREEMENT 5,806
SEXUAL_HARRASSMENT 4,750

Metrics

Agreement Type precision recall f1-score support
BOARD_MEETING_AGREEMENT 0.96 0.94 0.95 4335
CONSULTING_AGREEMENT 0.77 0.89 0.83 1533
CUSTOMER_LIST_AGREEMENT 0.84 0.87 0.85 4995
DISTRIBUTION_PARTNER_AGREEMENT 0.71 0.64 0.67 7231
EMPLOYEE_AGREEMENT 0.78 0.90 0.83 1433
ENTERPRISE_AGREEMENT 0.19 0.72 0.30 1616
ENTERPRISE_LICENSE_AGREEMENT 0.92 0.78 0.84 8574
EXECUTIVE_SEVERANCE_AGREEMENT 0.96 0.85 0.90 5177
FINANCIAL_REPORT_AGREEMENT 0.92 0.98 0.95 4264
HARMFUL_ADVICE 0.82 0.92 0.87 474
INTERNAL_PRODUCT_ROADMAP_AGREEMENT 0.94 0.97 0.96 4116
LOAN_AND_SECURITY_AGREEMENT 0.92 0.96 0.94 6354
MEDICAL_ADVICE 0.76 1.00 0.86 289
MERGER_AGREEMENT 0.90 0.55 0.68 7079
NDA_AGREEMENT 0.62 0.89 0.74 1452
NORMAL_TEXT 0.99 0.99 0.99 6049
PATENT_APPLICATION_FILLINGS_AGREEMENT 0.95 0.99 0.97 6177
PRICE_LIST_AGREEMENT 0.81 0.75 0.78 5453
SETTLEMENT_AGREEMENT 0.83 0.73 0.78 5806
SEXUAL_HARRASSMENT 0.98 0.93 0.96 4750
accuracy 0.84 87157
macro avg 0.83 0.86 0.83 87157
weighted avg 0.87 0.84 0.85 87157

Results

The model’s performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. Based on the test data evaluation results, the model achieved an accuracy of 0.8376, a precision of 0.8744, and a recall of 0.8376. The F1-score, which is the harmonic mean of precision and recall, stands at 0.8478.

The evaluation loss, which measures the discrepancy between the model’s predictions and the actual values, is 0.5616. Lower loss values indicate better model performance.

The model was able to process approximately 101.886 samples per second during the evaluation, which took a total runtime of 855.4327 seconds. The model performed approximately 0.796 evaluation steps per second.

Downloads last month
1,598