Update config.json with consistent label names

#4
Files changed (2) hide show
  1. README.md +181 -0
  2. config.json +39 -38
README.md CHANGED
@@ -1,3 +1,184 @@
1
  ---
2
  license: mit
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
  ---
6
+
7
+ # Model Card for Model ID
8
+
9
+ This model card outlines the Pebblo Classifier, a machine learning system specialized in text classification. Developed by DAXA.AI, this model is adept at categorizing various agreement documents within organizational structures, trained on 20 distinct labels.
10
+
11
+ ## Model Details
12
+
13
+ ### Model Description
14
+
15
+ The Pebblo Classifier is a BERT-based model, fine-tuned from distilbert-base-uncased, targeting RAG (Retrieve-And-Generate) applications. It classifies text into categories such as "BOARD_MEETING_AGREEMENT," "CONSULTING_AGREEMENT," and others, streamlining document classification processes.
16
+
17
+ - **Developed by:** DAXA.AI
18
+ - **Funded by:** Open Source
19
+ - **Model type:** Classification model
20
+ - **Language(s) (NLP):** English
21
+ - **License:** MIT
22
+ - **Finetuned from model:** distilbert-base-uncased
23
+
24
+ ### Model Sources
25
+
26
+ - **Repository:** [https://huggingface.co/daxa-ai/pebblo-classifier](https://huggingface.co/daxa-ai/pebblo-classifier?text=I+like+you.+I+love+you)
27
+ - **Demo:** [https://huggingface.co/spaces/daxa-ai/Daxa-Classifier](https://huggingface.co/spaces/daxa-ai/Daxa-Classifier)
28
+
29
+ ## Uses
30
+
31
+ ### Intended Use
32
+
33
+ The model is designed for direct application in document classification, capable of immediate deployment without additional fine-tuning.
34
+
35
+ ### Recommendations
36
+
37
+ End-users should be cognizant of potential biases and limitations inherent in the model. For optimal use, understanding these aspects is recommended.
38
+
39
+ ## How to Get Started with the Model
40
+
41
+ Use the code below to get started with the model.
42
+
43
+ ```python
44
+ # Import necessary libraries
45
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
46
+ import torch
47
+ import joblib
48
+ from huggingface_hub import hf_hub_url, cached_download
49
+
50
+ # Load the tokenizer and model
51
+ tokenizer = AutoTokenizer.from_pretrained("daxa-ai/pebblo-classifier")
52
+ model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/pebblo-classifier")
53
+
54
+ # Example text
55
+ text = "Please enter your text here."
56
+ encoded_input = tokenizer(text, return_tensors='pt')
57
+ output = model(**encoded_input)
58
+
59
+ # Apply softmax to the logits
60
+ probabilities = torch.nn.functional.softmax(output.logits, dim=-1)
61
+
62
+ # Get the predicted label
63
+ predicted_label = torch.argmax(probabilities, dim=-1)
64
+
65
+ # URL of your Hugging Face model repository
66
+ REPO_NAME = "daxa-ai/pebblo-classifier"
67
+
68
+ # Path to the label encoder file in the repository
69
+ LABEL_ENCODER_FILE = "label encoder.joblib"
70
+
71
+ # Construct the URL to the label encoder file
72
+ url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)
73
+
74
+ # Download and cache the label encoder file
75
+ filename = cached_download(url)
76
+
77
+ # Load the label encoder
78
+ label_encoder = joblib.load(filename)
79
+
80
+ # Decode the predicted label
81
+ decoded_label = label_encoder.inverse_transform(predicted_label.numpy())
82
+
83
+ print(decoded_label)
84
+
85
+ ```
86
+
87
+ ## Training Details
88
+
89
+ ### Training Data
90
+
91
+ The training dataset consists of 131,771 entries, with 20 unique labels. The labels span various document types, with instances distributed across three text sizes (128 ± x, 256 ± x, and 512 ± x words; x varies within 20).
92
+ Here are the labels along with their respective counts in the dataset:
93
+
94
+ | Agreement Type | Instances |
95
+ | --------------------------------------- | --------- |
96
+ | BOARD_MEETING_AGREEMENT | 4,225 |
97
+ | CONSULTING_AGREEMENT | 2,965 |
98
+ | CUSTOMER_LIST_AGREEMENT | 9,000 |
99
+ | DISTRIBUTION_PARTNER_AGREEMENT | 8,339 |
100
+ | EMPLOYEE_AGREEMENT | 3,921 |
101
+ | ENTERPRISE_AGREEMENT | 3,820 |
102
+ | ENTERPRISE_LICENSE_AGREEMENT | 9,000 |
103
+ | EXECUTIVE_SEVERANCE_AGREEMENT | 9,000 |
104
+ | FINANCIAL_REPORT_AGREEMENT | 8,381 |
105
+ | HARMFUL_ADVICE | 2,025 |
106
+ | INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 7,037 |
107
+ | LOAN_AND_SECURITY_AGREEMENT | 9,000 |
108
+ | MEDICAL_ADVICE | 2,359 |
109
+ | MERGER_AGREEMENT | 7,706 |
110
+ | NDA_AGREEMENT | 2,966 |
111
+ | NORMAL_TEXT | 6,742 |
112
+ | PATENT_APPLICATION_FILLINGS_AGREEMENT | 9,000 |
113
+ | PRICE_LIST_AGREEMENT | 9,000 |
114
+ | SETTLEMENT_AGREEMENT | 9,000 |
115
+ | SEXUAL_HARRASSMENT | 8,321 |
116
+
117
+
118
+
119
+ ## Evaluation
120
+
121
+ ### Testing Data & Metrics
122
+
123
+ #### Testing Data
124
+ Evaluation was performed on a dataset of 82,917 entries with a temperature range of 1-1.25 for randomness.
125
+ Here are the labels along with their respective counts in the dataset:
126
+
127
+ | Agreement Type | Instances |
128
+ | --------------------------------------- | --------- |
129
+ | BOARD_MEETING_AGREEMENT | 4,335 |
130
+ | CONSULTING_AGREEMENT | 1,533 |
131
+ | CUSTOMER_LIST_AGREEMENT | 4,995 |
132
+ | DISTRIBUTION_PARTNER_AGREEMENT | 7,231 |
133
+ | EMPLOYEE_AGREEMENT | 1,433 |
134
+ | ENTERPRISE_AGREEMENT | 1,616 |
135
+ | ENTERPRISE_LICENSE_AGREEMENT | 8,574 |
136
+ | EXECUTIVE_SEVERANCE_AGREEMENT | 5,177 |
137
+ | FINANCIAL_REPORT_AGREEMENT | 4,264 |
138
+ | HARMFUL_ADVICE | 474 |
139
+ | INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 4,116 |
140
+ | LOAN_AND_SECURITY_AGREEMENT | 6,354 |
141
+ | MEDICAL_ADVICE | 289 |
142
+ | MERGER_AGREEMENT | 7,079 |
143
+ | NDA_AGREEMENT | 1,452 |
144
+ | NORMAL_TEXT | 1,808 |
145
+ | PATENT_APPLICATION_FILLINGS_AGREEMENT | 6,177 |
146
+ | PRICE_LIST_AGREEMENT | 5,453 |
147
+ | SETTLEMENT_AGREEMENT | 5,806 |
148
+ | SEXUAL_HARRASSMENT | 4,750 |
149
+
150
+
151
+
152
+ #### Metrics
153
+
154
+ | Agreement Type | precision | recall | f1-score | support |
155
+ | ------------------------------------------- | --------- | ------ | -------- | ------- |
156
+ | BOARD_MEETING_AGREEMENT | 0.93 | 0.95 | 0.94 | 4335 |
157
+ | CONSULTING_AGREEMENT | 0.72 | 0.98 | 0.84 | 1593 |
158
+ | CUSTOMER_LIST_AGREEMENT | 0.64 | 0.82 | 0.72 | 4335 |
159
+ | DISTRIBUTION_PARTNER_AGREEMENT | 0.83 | 0.47 | 0.61 | 7231 |
160
+ | EMPLOYEE_AGREEMENT | 0.78 | 0.92 | 0.85 | 1333 |
161
+ | ENTERPRISE_AGREEMENT | 0.29 | 0.40 | 0.34 | 1616 |
162
+ | ENTERPRISE_LICENSE_AGREEMENT | 0.88 | 0.79 | 0.83 | 5574 |
163
+ | EXECUTIVE_SERVICE_AGREEMENT | 0.92 | 0.85 | 0.89 | 8177 |
164
+ | FINANCIAL_REPORT_AGREEMENT | 0.89 | 0.98 | 0.93 | 4264 |
165
+ | HARMFUL_ADVICE | 0.79 | 0.95 | 0.86 | 474 |
166
+ | INTERNAL_PRODUCT_ROADMAP_AGREEMENT | 0.91 | 0.98 | 0.94 | 4116 |
167
+ | LOAN_AND_SECURITY_AGREEMENT | 0.77 | 0.98 | 0.86 | 6354 |
168
+ | MEDICAL_ADVICE | 0.81 | 0.99 | 0.89 | 289 |
169
+ | MERGER_AGREEMENT | 0.89 | 0.77 | 0.83 | 7279 |
170
+ | NDA_AGREEMENT | 0.70 | 0.57 | 0.62 | 1452 |
171
+ | NORMAL_TEXT | 0.79 | 0.97 | 0.87 | 1888 |
172
+ | PATENT_APPLICATION_FILLINGS_AGREEMENT | 0.95 | 0.99 | 0.97 | 6177 |
173
+ | PRICE_LIST_AGREEMENT | 0.60 | 0.75 | 0.67 | 5565 |
174
+ | SETTLEMENT_AGREEMENT | 0.82 | 0.54 | 0.65 | 5843 |
175
+ | SEXUAL_HARASSMENT | 0.97 | 0.94 | 0.95 | 440 |
176
+ | | | | | |
177
+ | accuracy | | | 0.79 | 82916 |
178
+ | macro avg | 0.79 | 0.83 | 0.80 | 82916 |
179
+ | weighted avg | 0.83 | 0.81 | 0.81 | 82916 |
180
+
181
+
182
+ #### Results
183
+
184
+ The model's performance is summarized by precision, recall, and f1-score metrics, which are detailed across all 20 labels in the dataset. The accuracy stands at 0.79 for the entire test set, with a macro average and weighted average of precision, recall, and f1-score around 0.80 and 0.81, respectively.
config.json CHANGED
@@ -9,53 +9,53 @@
9
  "dropout": 0.1,
10
  "hidden_dim": 3072,
11
  "id2label": {
12
- "0": "Board meeting",
13
- "1": "Consulting Agreement",
14
- "2": "Customer List",
15
- "3": "Distribution/Partner Agreement",
16
- "4": "Enterprise License Agreement",
17
- "5": "Executive Severance Agreement",
18
- "6": "Financial Report",
19
  "7": "HARMFUL_ADVICE",
20
- "8": "Internal Use Only",
21
- "9": "Loan and security Agreement",
22
  "10": "MEDICAL_ADVICE",
23
- "11": "Merger Agreement",
24
- "12": "NDA",
25
  "13": "NORMAL_TEXT",
26
- "14": "Patent Application Fillings",
27
- "15": "Price list",
28
- "16": "Secret Sauce",
29
- "17": "Security Breach",
30
- "18": "Settlement Agreement",
31
- "19": "Sexual Harrassment",
32
- "20": "employee agreement",
33
- "21": "enterprise agreement"
34
  },
35
  "initializer_range": 0.02,
36
  "label2id": {
37
- "Board meeting": 0,
38
- "Consulting Agreement": 1,
39
  "MEDICAL_ADVICE": 10,
40
- "Merger Agreement": 11,
41
- "NDA": 12,
42
  "NORMAL_TEXT": 13,
43
- "Patent Application Fillings": 14,
44
- "Price list": 15,
45
- "Secret Sauce": 16,
46
- "Security Breach": 17,
47
- "Settlement Agreement": 18,
48
- "Sexual Harrassment": 19,
49
- "Customer List": 2,
50
- "employee agreement": 20,
51
- "enterprise agreement": 21,
52
- "Distribution/Partner Agreement": 3,
53
- "Enterprise License Agreement": 4,
54
- "Executive Severance Agreement": 5,
55
- "Financial Report": 6,
56
  "HARMFUL_ADVICE": 7,
57
- "Internal Use Only": 8,
58
- "Loan and security Agreement": 9
59
  },
60
  "max_position_embeddings": 512,
61
  "model_type": "distilbert",
@@ -70,3 +70,4 @@
70
  "transformers_version": "4.36.2",
71
  "vocab_size": 30522
72
  }
 
 
9
  "dropout": 0.1,
10
  "hidden_dim": 3072,
11
  "id2label": {
12
+ "0": "BOARD_MEETING_AGREEMENT",
13
+ "1": "CONSULTING_AGREEMENT",
14
+ "2": "CUSTOMER_LIST_AGREEMENT",
15
+ "3": "DISTRIBUTION_PARTNER_AGREEMENT",
16
+ "4": "ENTERPRISE_LICENSE_AGREEMENT",
17
+ "5": "EXECUTIVE_SEVERANCE_AGREEMENT",
18
+ "6": "FINANCIAL_REPORT_AGREEMENT",
19
  "7": "HARMFUL_ADVICE",
20
+ "8": "INTERNAL_USE_ONLY_AGREEMENT",
21
+ "9": "LOAN_AND_SECURITY_AGREEMENT",
22
  "10": "MEDICAL_ADVICE",
23
+ "11": "MERGER_AGREEMENT",
24
+ "12": "NDA_AGREEMENT",
25
  "13": "NORMAL_TEXT",
26
+ "14": "PATENT_APPLICATION_FILLINGS_AGREEMENT",
27
+ "15": "PRICE_LIST_AGREEMENT",
28
+ "16": "SECRET_SAUCE_AGREEMENT",
29
+ "17": "SECURITY_BREACH_AGREEMENT",
30
+ "18": "SETTLEMENT_AGREEMENT",
31
+ "19": "SEXUAL_HARRASSMENT_AGREEMENT",
32
+ "20": "EMPLOYEE_AGREEMENT",
33
+ "21": "ENTERPRISE_AGREEMENT"
34
  },
35
  "initializer_range": 0.02,
36
  "label2id": {
37
+ "BOARD_MEETING_AGREEMENT": 0,
38
+ "CONSULTING_AGREEMENT": 1,
39
  "MEDICAL_ADVICE": 10,
40
+ "MERGER_AGREEMENT": 11,
41
+ "NDA_AGREEMENT": 12,
42
  "NORMAL_TEXT": 13,
43
+ "PATENT_APPLICATION_FILLINGS_AGREEMENT": 14,
44
+ "PRICE_LIST_AGREEMENT": 15,
45
+ "SECRET_SAUCE_AGREEMENT": 16,
46
+ "SECURITY_BREACH_AGREEMENT": 17,
47
+ "SETTLEMENT_AGREEMENT": 18,
48
+ "SEXUAL_HARRASSMENT_AGREEMENT": 19,
49
+ "CUSTOMER_LIST_AGREEMENT": 2,
50
+ "EMPLOYEE_AGREEMENT": 20,
51
+ "ENTERPRISE_AGREEMENT": 21,
52
+ "DISTRIBUTION_PARTNER_AGREEMENT": 3,
53
+ "ENTERPRISE_LICENSE_AGREEMENT": 4,
54
+ "EXECUTIVE_SEVERANCE_AGREEMENT": 5,
55
+ "FINANCIAL_REPORT_AGREEMENT": 6,
56
  "HARMFUL_ADVICE": 7,
57
+ "INTERNAL_USE_ONLY_AGREEMENT": 8,
58
+ "LOAN_AND_SECURITY_AGREEMENT": 9
59
  },
60
  "max_position_embeddings": 512,
61
  "model_type": "distilbert",
 
70
  "transformers_version": "4.36.2",
71
  "vocab_size": 30522
72
  }
73
+