File size: 5,916 Bytes
0d254bc
 
 
 
 
 
 
 
c97f929
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a9a3816
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c97f929
 
 
 
0d254bc
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
---
license: mit
language:
- en
base_model:
- distilbert/distilbert-base-uncased
pipeline_tag: text-classification
---

# Topic Classifier

This repository contains the Topic Classifier model developed by DAXA.AI. The Topic Classifier is a machine learning model designed to categorize text documents across various domains, such as corporate documents, financial texts, harmful content, and medical documents.

## Model Details

### Model Description

The Topic Classifier is a BERT-based model, fine-tuned from the `distilbert-base-uncased` model. It is intended for categorizing text into specific topics, including "CORPORATE_DOCUMENTS," "FINANCIAL," "HARMFUL," and "MEDICAL." This model streamlines text classification tasks across multiple sectors, making it suitable for various business use cases.

- **Developed by:** DAXA.AI
- **Funded by:** Open Source
- **Model type:** Text classification
- **Language(s):** English
- **License:** MIT
- **Fine-tuned from:** `distilbert-base-uncased`

### Model Sources

- **Repository:** [https://huggingface.co/daxa-ai/topic-classifier](https://huggingface.co/daxa-ai/Topic-Classifier-2)
- **Demo:** [https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2](https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2)

## Usage

### How to Get Started with the Model

To use the Topic Classifier in your Python project, you can follow the steps below:

```python
# Import necessary libraries
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import joblib
from huggingface_hub import hf_hub_url, cached_download

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("daxa-ai/topic-classifier")
model = AutoModelForSequenceClassification.from_pretrained("daxa-ai/topic-classifier")

# Example text
text = "Please enter your text here."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

# Apply softmax to the logits
probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

# Get the predicted label
predicted_label = torch.argmax(probabilities, dim=-1)

# URL of your Hugging Face model repository
REPO_NAME = "daxa-ai/topic-classifier"

# Path to the label encoder file in the repository
LABEL_ENCODER_FILE = "label_encoder.joblib"

# Construct the URL to the label encoder file
url = hf_hub_url(REPO_NAME, filename=LABEL_ENCODER_FILE)

# Download and cache the label encoder file
filename = cached_download(url)

# Load the label encoder
label_encoder = joblib.load(filename)

# Decode the predicted label
decoded_label = label_encoder.inverse_transform(predicted_label.numpy())

print(decoded_label)
```

## Training Details

### Training Data

The training dataset consists of 29,286 entries, categorized into four distinct labels. The distribution of these labels is presented below:

| Document Type       | Instances |
| ------------------- | --------- |
| CORPORATE_DOCUMENTS | 17,649    |
| FINANCIAL           | 3,385     |
| HARMFUL             | 2,388     |
| MEDICAL             | 5,864     |

### Evaluation

#### Testing Data & Metrics

The model was evaluated on a dataset consisting of 4,565 entries. The distribution of labels in the evaluation set is shown below:

| Document Type       | Instances |
| ------------------- | --------- |
| CORPORATE_DOCUMENTS | 3,051     |
| FINANCIAL           | 409       |
| HARMFUL             | 246       |
| MEDICAL             | 859       |

The evaluation metrics include precision, recall, and F1-score, calculated for each label:

| Document Type       | Precision | Recall | F1-Score | Support |
| ------------------- | --------- | ------ | -------- | ------- |
| CORPORATE_DOCUMENTS | 1.00      | 1.00   | 1.00     | 3,051   |
| FINANCIAL           | 0.95      | 0.96   | 0.96     | 409     |
| HARMFUL             | 0.95      | 0.95   | 0.95     | 246     |
| MEDICAL             | 0.99      | 1.00   | 0.99     | 859     |
| Accuracy            |           |        | 0.99     | 4,565   |
| Macro Avg           | 0.97      | 0.98   | 0.97     | 4,565   |
| Weighted Avg        | 0.99      | 0.99   | 0.99     | 4,565   |

#### Test Data Evaluation Results

The model's evaluation results are as follows:

- **Evaluation Loss:** 0.0233
- **Accuracy:** 0.9908
- **Precision:** 0.9909
- **Recall:** 0.9908
- **F1-Score:** 0.9908
- **Evaluation Runtime:** 30.1149 seconds
- **Evaluation Samples Per Second:** 151.586
- **Evaluation Steps Per Second:** 2.391

#### Inference Code

```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline


def model_fn(model_dir):
    """
    Load the model and tokenizer from the specified paths
    :param model_dir:
    :return:
    """
    tokenizer = AutoTokenizer.from_pretrained(model_dir)
    model = AutoModelForSequenceClassification.from_pretrained(model_dir)
    return model, tokenizer


def predict_fn(data, model_and_tokenizer):
    # destruct model and tokenizer
    model, tokenizer = model_and_tokenizer

    bert_pipe = pipeline("text-classification", model=model, tokenizer=tokenizer,
                         truncation=True, max_length=512, return_all_scores=True)
    # Tokenize the input, pick up first 512 tokens before passing it further
    tokens = tokenizer.encode(data['inputs'], add_special_tokens=False, max_length=512, truncation=True)
    input_data = tokenizer.decode(tokens)
    return bert_pipe(input_data)

```

## Conclusion

The Topic Classifier achieves high accuracy, precision, recall, and F1-score, making it a reliable model for categorizing text across the domains of corporate documents, financial content, harmful content, and medical texts. The model is optimized for immediate deployment and works efficiently in real-world applications.

For more information or to try the model yourself, check out the public space [here](https://huggingface.co/spaces/daxa-ai/Topic-Classifier-2).