Topic Classifier v2 Added

feat: Push updated Topic Classifier model with eval_loss 0.0233, eval_accuracy 0.9908, eval_f1 0.9908, CORPORATE_DOCUMENTS precision 1.00, FINANCIAL precision 0.95, HARMFUL precision 0.95, MEDICAL precision 0.99, accuracy 0.99, macro avg F1 0.97, weighted avg F1 0.99, support 4565 samples

Files changed (8) hide show

README.md +128 -0
config.json +36 -0
label_encoder.joblib +3 -0
pytorch_model.bin +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +55 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,128 @@

+# Topic Classifier
+This repository contains the Topic Classifier model developed by DAXA.AI. The Topic Classifier is a machine learning model designed to categorize text documents across various domains, such as corporate documents, financial texts, harmful content, and medical documents.
+## Model Details
+### Model Description
+The Topic Classifier is a BERT-based model, fine-tuned from the `distilbert-base-uncased` model. It is intended for categorizing text into specific topics, including "CORPORATE_DOCUMENTS," "FINANCIAL," "HARMFUL," and "MEDICAL." This model streamlines text classification tasks across multiple sectors, making it suitable for various business use cases.
+- **Developed by:** DAXA.AI
+- **Funded by:** Open Source
+- **Model type:** Text classification
+- **Language(s):** English
+- **License:** MIT
+- **Fine-tuned from:** `distilbert-base-uncased`
+### Model Sources
+- **Repository:** [https://huggingface.co/daxa-ai/topic-classifier](https://huggingface.co/daxa-ai/Topic-Classifier-2)
+- **Demo:** [https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2](https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2)
+## Usage
+### How to Get Started with the Model
+To use the Topic Classifier in your Python project, you can follow the steps below:
+```python
+# Import necessary libraries
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import torch
+import joblib
+from huggingface_hub import hf_hub_url, cached_download
+# Load the tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("daxa-ai/topic-classifier")
+model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/topic-classifier")
+# Example text
+text = "Please enter your text here."
+encoded_input = tokenizer(text, return_tensors='pt')
+output = model(**encoded_input)
+# Apply softmax to the logits
+probabilities = torch.nn.functional.softmax(output.logits, dim=-1)
+# Get the predicted label
+predicted_label = torch.argmax(probabilities, dim=-1)
+# URL of your Hugging Face model repository
+REPO_NAME = "daxa-ai/topic-classifier"
+# Path to the label encoder file in the repository
+LABEL_ENCODER_FILE = "label_encoder.joblib"
+# Construct the URL to the label encoder file
+url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)
+# Download and cache the label encoder file
+filename = cached_download(url)
+# Load the label encoder
+label_encoder = joblib.load(filename)
+# Decode the predicted label
+decoded_label = label_encoder.inverse_transform(predicted_label.numpy())
+print(decoded_label)
+```
+## Training Details
+### Training Data
+The training dataset consists of 29,286 entries, categorized into four distinct labels. The distribution of these labels is presented below:
+| Document Type       | Instances |
+| ------------------- | --------- |
+| CORPORATE_DOCUMENTS | 17,649    |
+| FINANCIAL           | 3,385     |
+| HARMFUL             | 2,388     |
+| MEDICAL             | 5,864     |
+### Evaluation
+#### Testing Data & Metrics
+The model was evaluated on a dataset consisting of 4,565 entries. The distribution of labels in the evaluation set is shown below:
+| Document Type       | Instances |
+| ------------------- | --------- |
+| CORPORATE_DOCUMENTS | 3,051     |
+| FINANCIAL           | 409       |
+| HARMFUL             | 246       |
+| MEDICAL             | 859       |
+The evaluation metrics include precision, recall, and F1-score, calculated for each label:
+| Document Type       | Precision | Recall | F1-Score | Support |
+| ------------------- | --------- | ------ | -------- | ------- |
+| CORPORATE_DOCUMENTS | 1.00      | 1.00   | 1.00     | 3,051   |
+| FINANCIAL           | 0.95      | 0.96   | 0.96     | 409     |
+| HARMFUL             | 0.95      | 0.95   | 0.95     | 246     |
+| MEDICAL             | 0.99      | 1.00   | 0.99     | 859     |
+| Accuracy            |           |        | 0.99     | 4,565   |
+| Macro Avg           | 0.97      | 0.98   | 0.97     | 4,565   |
+| Weighted Avg        | 0.99      | 0.99   | 0.99     | 4,565   |
+#### Test Data Evaluation Results
+The model's evaluation results are as follows:
+- **Evaluation Loss:** 0.0233
+- **Accuracy:** 0.9908
+- **Precision:** 0.9909
+- **Recall:** 0.9908
+- **F1-Score:** 0.9908
+- **Evaluation Runtime:** 30.1149 seconds
+- **Evaluation Samples Per Second:** 151.586
+- **Evaluation Steps Per Second:** 2.391
+## Conclusion
+The Topic Classifier achieves high accuracy, precision, recall, and F1-score, making it a reliable model for categorizing text across the domains of corporate documents, financial content, harmful content, and medical texts. The model is optimized for immediate deployment and works efficiently in real-world applications.
+For more information or to try the model yourself, check out the public space [here](https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2).

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "_name_or_path": "distilbert-base-uncased",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "CORPORATE_DOCUMENTS",
+    "1": "FINANCIAL",
+    "2": "HARMFUL",
+    "3": "MEDICAL"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "CORPORATE_DOCUMENTS": 0,
+    "FINANCIAL": 1,
+    "HARMFUL": 2,
+    "MEDICAL": 3
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "torch_dtype": "float32",
+  "transformers_version": "4.45.1",
+  "vocab_size": 30522
+}

label_encoder.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ecc34413f18d00dd522f2996ce202a485c39fc1e0def340590a6469914332400
+size 582

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:01349bd229a507099512340ff61bf05d9a05fc96556d78f49f9338025ff60fa7
+size 267860714

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,55 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff