|
--- |
|
license: mit |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 21 distinct labels. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes. |
|
|
|
- **Developed by:** DAXA.AI |
|
- **Funded by:** Open Source |
|
- **Model type:** Classification model |
|
- **Language(s) (NLP):** English |
|
- **License:** MIT |
|
- **Finetuned from model:** distilbert-base-uncased |
|
|
|
### Model Sources |
|
|
|
- **Repository:** [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you) |
|
- **Demo:** [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier) |
|
|
|
## Uses |
|
|
|
### Intended Use |
|
|
|
The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning. |
|
|
|
### Recommendations |
|
|
|
End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
```python |
|
# Import necessary libraries |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
import torch |
|
import joblib |
|
from huggingface_hub import hf_hub_url, cached_download |
|
|
|
# Load the tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier") |
|
model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier") |
|
|
|
# Example text |
|
text = "Please enter your text here." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
|
|
# Apply softmax to the logits |
|
probabilities = torch.nn.functional.softmax(output.logits, dim=-1) |
|
|
|
# Get the predicted label |
|
predicted_label = torch.argmax(probabilities, dim=-1) |
|
|
|
# URL of your Hugging Face model repository |
|
REPO_NAME = "daxa-ai/pebblo-classifier" |
|
|
|
# Path to the label encoder file in the repository |
|
LABEL_ENCODER_FILE = "label_encoder.joblib" |
|
|
|
# Construct the URL to the label encoder file |
|
url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE) |
|
|
|
# Download and cache the label encoder file |
|
filename = cached_download(url) |
|
|
|
# Load the label encoder |
|
label_encoder = joblib.load(filename) |
|
|
|
# Decode the predicted label |
|
decoded_label = label_encoder.inverse_transform(predicted_label.numpy()) |
|
|
|
print(decoded_label) |
|
|
|
``` |
|
|
|
## Training Details |
|
|
|
### Training Data |
|
|
|
The training dataset consists of 141,055 entries, with 21 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20). |
|
Here are the labels along with their respective counts in the dataset: |
|
|
|
| Agreement Type | Instances | |
|
| ------------------------------------- | --------- | |
|
| BOARD_MEETING_AGREEMENT | 4,206 | |
|
| CONSULTING_AGREEMENT | 2,965 | |
|
| CUSTOMER_LIST_AGREEMENT | 8,966 | |
|
| DISTRIBUTION_PARTNER_AGREEMENT | 5,144 | |
|
| EMPLOYEE_AGREEMENT | 3,876 | |
|
| ENTERPRISE_AGREEMENT | 4,213 | |
|
| ENTERPRISE_LICENSE_AGREEMENT | 8,999 | |
|
| EXECUTIVE_SEVERANCE_AGREEMENT | 8,996 | |
|
| FINANCIAL_REPORT_AGREEMENT | 11,384 | |
|
| HARMFUL_ADVICE | 1,887 | |
|
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 6,982 | |
|
| LOAN_AND_SECURITY_AGREEMENT | 8,957 | |
|
| MEDICAL_ADVICE | 3,847 | |
|
| MERGER_AGREEMENT | 7,704 | |
|
| NDA_AGREEMENT | 5,221 | |
|
| NORMAL_TEXT | 8,994 | |
|
| PATENT_APPLICATION_FILLINGS_AGREEMENT | 8,802 | |
|
| PRICE_LIST_AGREEMENT | 8,906 | |
|
| SETTLEMENT_AGREEMENT | 3,737 | |
|
| SEXUAL_CONTENT | 8,957 | |
|
| SEXUAL_INCIDENT_REPORT | 8,321 | |
|
|
|
## Evaluation |
|
|
|
### Testing Data & Metrics |
|
|
|
#### Testing Data |
|
|
|
Evaluation was performed on a dataset of 86,281 entries with a temperature range of 1-1.25 for randomness. |
|
Here are the labels along with their respective counts in the dataset: |
|
|
|
| Agreement Type | Instances | |
|
| ------------------------------------- | --------- | |
|
| BOARD_MEETING_AGREEMENT | 3,975 | |
|
| CONSULTING_AGREEMENT | 1,430 | |
|
| CUSTOMER_LIST_AGREEMENT | 4,488 | |
|
| DISTRIBUTION_PARTNER_AGREEMENT | 6,696 | |
|
| EMPLOYEE_AGREEMENT | 1,310 | |
|
| ENTERPRISE_AGREEMENT | 1,501 | |
|
| ENTERPRISE_LICENSE_AGREEMENT | 7,967 | |
|
| EXECUTIVE_SEVERANCE_AGREEMENT | 4,795 | |
|
| FINANCIAL_REPORT_AGREEMENT | 4,686 | |
|
| HARMFUL_ADVICE | 361 | |
|
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 3,740 | |
|
| LOAN_AND_SECURITY_AGREEMENT | 5,833 | |
|
| MEDICAL_ADVICE | 643 | |
|
| MERGER_AGREEMENT | 6,557 | |
|
| NDA_AGREEMENT | 1,352 | |
|
| NORMAL_TEXT | 5,811 | |
|
| PATENT_APPLICATION_FILLINGS_AGREEMENT | 5,608 | |
|
| PRICE_LIST_AGREEMENT | 5,044 | |
|
| SETTLEMENT_AGREEMENT | 5,377 | |
|
| SEXUAL_CONTENT | 4,356 | |
|
| SEXUAL_INCIDENT_REPORT | 4,750 | |
|
|
|
#### Metrics |
|
|
|
| Agreement Type | precision | recall | f1-score | support | |
|
| ------------------------------------- | --------- | ------ | -------- | ------- | |
|
| BOARD_MEETING_AGREEMENT | 0.92 | 0.95 | 0.93 | 3,975 | |
|
| CONSULTING_AGREEMENT | 0.81 | 0.85 | 0.83 | 1,430 | |
|
| CUSTOMER_LIST_AGREEMENT | 0.90 | 0.88 | 0.89 | 4,488 | |
|
| DISTRIBUTION_PARTNER_AGREEMENT | 0.73 | 0.63 | 0.68 | 6,696 | |
|
| EMPLOYEE_AGREEMENT | 0.85 | 0.84 | 0.85 | 1,310 | |
|
| ENTERPRISE_AGREEMENT | 0.18 | 0.70 | 0.29 | 1,501 | |
|
| ENTERPRISE_LICENSE_AGREEMENT | 0.92 | 0.78 | 0.84 | 7,967 | |
|
| EXECUTIVE_SEVERANCE_AGREEMENT | 0.97 | 0.88 | 0.92 | 4,795 | |
|
| FINANCIAL_REPORT_AGREEMENT | 0.93 | 0.99 | 0.96 | 4,686 | |
|
| HARMFUL_ADVICE | 0.92 | 0.94 | 0.93 | 361 | |
|
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 0.94 | 0.98 | 0.96 | 3,740 | |
|
| LOAN_AND_SECURITY_AGREEMENT | 0.93 | 0.97 | 0.95 | 5,833 | |
|
| MEDICAL_ADVICE | 0.93 | 1.00 | 0.96 | 643 | |
|
| MERGER_AGREEMENT | 0.93 | 0.45 | 0.61 | 6,557 | |
|
| NDA_AGREEMENT | 0.68 | 0.91 | 0.78 | 1,352 | |
|
| NORMAL_TEXT | 0.95 | 0.94 | 0.95 | 5,811 | |
|
| PATENT_APPLICATION_FILLINGS_AGREEMENT | 0.96 | 0.99 | 0.98 | 5,608 | |
|
| PRICE_LIST_AGREEMENT | 0.76 | 0.79 | 0.77 | 5,044 | |
|
| SETTLEMENT_AGREEMENT | 0.76 | 0.78 | 0.77 | 5,377 | |
|
| SEXUAL_CONTENT | 0.92 | 0.97 | 0.94 | 4,356 | |
|
| SEXUAL_INCIDENT_REPORT | 0.99 | 0.94 | 0.96 | 4,750 | |
|
| accuracy | | | 0.84 | 86,280 | |
|
| macro avg | 0.85 | 0.86 | 0.84 | 86,280 | |
|
| weighted avg | 0.88 | 0.84 | 0.85 | 86,280 | |
|
|
|
#### Results |
|
|
|
The model’s performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 21 labels in the dataset. Based on the test data evaluation results, the model achieved an accuracy of 0.8424, a precision of 0.8794, and a recall of 0.8424. The F1-score, which is the harmonic mean of precision and recall, stands at 0.8505. |
|
|
|
The evaluation loss, which measures the discrepancy between the model’s predictions and the actual values, is 0.6815. Lower loss values indicate better model performance. |
|
|
|
The model was able to process approximately 97.684 samples per second during the evaluation, which took a total runtime of 883.2545 seconds. The model performed approximately 0.764 evaluation steps per second. |
|
|