File size: 8,563 Bytes
9cd46cf
 
3afb3ae
 
9cd46cf
3afb3ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
---
license: mit
language:
- en
---

# Model Card for Model ID

This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels.

## Model Details

### Model Description

The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.

- **Developed by:** DAXA.AI
- **Funded by:** Open Source
- **Model type:** Classification model
- **Language(s) (NLP):** English
- **License:** MIT
- **Finetuned from model:** distilbert-base-uncased

### Model Sources

- **Repository:** [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
- **Demo:** [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)

## Uses

### Intended Use

The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.

### Recommendations

End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import joblib
from huggingface_hub import hf_hub_url, cached_download

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")

# Example text
text = "Please enter your text here."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# Apply softmax to the logits
probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

# Get the predicted label
predicted_label = torch.argmax(probabilities, dim=-1)

# URL of your Hugging Face model repository
REPO_NAME = "daxa-ai/pebblo-classifier"

# Path to the label encoder file in the repository
LABEL_ENCODER_FILE = "label encoder.joblib"

# Construct the URL to the label encoder file
url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)

# Download and cache the label encoder file
filename = cached_download(url)

# Load the label encoder
label_encoder = joblib.load(filename)

# Decode the predicted label
decoded_label = label_encoder.inverse_transform(predicted_label.numpy())

print(decoded_label)

```

## Training Details

### Training Data

The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
Here are the labels along with their respective counts in the dataset:

| Agreement Type                          | Instances |
| --------------------------------------- | --------- |
| BOARD_MEETING_AGREEMENT                 | 4,225     |
| CONSULTING_AGREEMENT                    | 2,965     |
| CUSTOMER_LIST_AGREEMENT                 | 9,000     |
| DISTRIBUTION_PARTNER_AGREEMENT          | 8,339     |
| EMPLOYEE_AGREEMENT                      | 3,921     |
| ENTERPRISE_AGREEMENT                    | 3,820     |
| ENTERPRISE_LICENSE_AGREEMENT            | 9,000     |
| EXECUTIVE_SEVERANCE_AGREEMENT           | 9,000     |
| FINANCIAL_REPORT_AGREEMENT              | 8,381     |
| HARMFUL_ADVICE                          | 2,025     |
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT      | 7,037     |
| LOAN_AND_SECURITY_AGREEMENT             | 9,000     |
| MEDICAL_ADVICE                          | 2,359     |
| MERGER_AGREEMENT                        | 7,706     |
| NDA_AGREEMENT                           | 2,966     |
| NORMAL_TEXT                             | 6,742     |
| PATENT_APPLICATION_FILLINGS_AGREEMENT   | 9,000     |
| PRICE_LIST_AGREEMENT                    | 9,000     |
| SETTLEMENT_AGREEMENT                    | 9,000     |
| SEXUAL_HARRASSMENT                      | 8,321     |



## Evaluation

### Testing Data & Metrics

#### Testing Data
Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness.
Here are the labels along with their respective counts in the dataset:

| Agreement Type                          | Instances |
| --------------------------------------- | --------- |
| BOARD_MEETING_AGREEMENT                 | 4,335     |
| CONSULTING_AGREEMENT                    | 1,533     |
| CUSTOMER_LIST_AGREEMENT                 | 4,995     |
| DISTRIBUTION_PARTNER_AGREEMENT          | 7,231     |
| EMPLOYEE_AGREEMENT                      | 1,433     |
| ENTERPRISE_AGREEMENT                    | 1,616     |
| ENTERPRISE_LICENSE_AGREEMENT            | 8,574     |
| EXECUTIVE_SEVERANCE_AGREEMENT           | 5,177     |
| FINANCIAL_REPORT_AGREEMENT              | 4,264     |
| HARMFUL_ADVICE                          | 474       |
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT      | 4,116     |
| LOAN_AND_SECURITY_AGREEMENT             | 6,354     |
| MEDICAL_ADVICE                          | 289       |
| MERGER_AGREEMENT                        | 7,079     |
| NDA_AGREEMENT                           | 1,452     |
| NORMAL_TEXT                             | 1,808     |
| PATENT_APPLICATION_FILLINGS_AGREEMENT   | 6,177     |
| PRICE_LIST_AGREEMENT                    | 5,453     |
| SETTLEMENT_AGREEMENT                    | 5,806     |
| SEXUAL_HARRASSMENT                      | 4,750     |



#### Metrics

| Agreement Type                              | precision | recall | f1-score | support |
| ------------------------------------------- | --------- | ------ | -------- | ------- |
| BOARD_MEETING_AGREEMENT                     | 0.93      | 0.95   | 0.94     | 4335    |
| CONSULTING_AGREEMENT                        | 0.72      | 0.98   | 0.84     | 1593    |
| CUSTOMER_LIST_AGREEMENT                     | 0.64      | 0.82   | 0.72     | 4335    |
| DISTRIBUTION_PARTNER_AGREEMENT              | 0.83      | 0.47   | 0.61     | 7231    |
| EMPLOYEE_AGREEMENT                          | 0.78      | 0.92   | 0.85     | 1333    |
| ENTERPRISE_AGREEMENT                        | 0.29      | 0.40   | 0.34     | 1616    |
| ENTERPRISE_LICENSE_AGREEMENT                | 0.88      | 0.79   | 0.83     | 5574    |
| EXECUTIVE_SERVICE_AGREEMENT                 | 0.92      | 0.85   | 0.89     | 8177    |
| FINANCIAL_REPORT_AGREEMENT                  | 0.89      | 0.98   | 0.93     | 4264    |
| HARMFUL_ADVICE                              | 0.79      | 0.95   | 0.86     | 474     |
| INTERNAL_PRODUCT_ROADMAP_AGREEMENT          | 0.91      | 0.98   | 0.94     | 4116    |
| LOAN_AND_SECURITY_AGREEMENT                 | 0.77      | 0.98   | 0.86     | 6354    |
| MEDICAL_ADVICE                              | 0.81      | 0.99   | 0.89     | 289     |
| MERGER_AGREEMENT                            | 0.89      | 0.77   | 0.83     | 7279    |
| NDA_AGREEMENT                               | 0.70      | 0.57   | 0.62     | 1452    |
| NORMAL_TEXT                                 | 0.79      | 0.97   | 0.87     | 1888    |
| PATENT_APPLICATION_FILLINGS_AGREEMENT       | 0.95      | 0.99   | 0.97     | 6177    |
| PRICE_LIST_AGREEMENT                        | 0.60      | 0.75   | 0.67     | 5565    |
| SETTLEMENT_AGREEMENT                        | 0.82      | 0.54   | 0.65     | 5843    |
| SEXUAL_HARASSMENT                           | 0.97      | 0.94   | 0.95     | 440     |
|                                             |           |        |          |         |
| accuracy                                    |           |        | 0.79     | 82916   |
| macro avg                                   | 0.79      | 0.83   | 0.80     | 82916   |
| weighted avg                                | 0.83      | 0.81   | 0.81     | 82916   |


#### Results

The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively.