PavanDeepak/Topic_Classification

BERT-based Text Classification Model

This model is a fine-tuned version of the bert-base-uncased model, specifically adapted for text classification across a diverse set of categories. The model has been trained on a dataset collected from multiple sources, including the News Category Dataset on Kaggle and various other websites.

The model classifies text into one of the following 12 categories:

Food
Videogames & Shows
Kids and fun
Homestyle
Travel
Health
Charity
Electronics & Technology
Sports
Cultural & Music
Education
Convenience The model has demonstrated robust performance with an accuracy of 0.721459, F1 score of 0.659451, precision of 0.707620, and recall of 0.635155.

Model Architecture

The model leverages the BertForSequenceClassification architecture, It has been fine-tuned on the aforementioned dataset, with the following key configuration parameters:

Hidden size: 768
Number of attention heads: 12
Number of hidden layers: 12
Max position embeddings: 512
Type vocab size: 2
Vocab size: 30522
The model uses the GELU activation function in its hidden layers and applies dropout with a probability of 0.1 to the attention probabilities to prevent overfitting.

Example

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import numpy as np
from scipy.special import expit

MODEL = "PavanDeepak/Topic_Classification"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)
class_mapping = model.config.id2label

text = "I love chicken manchuria"
tokens = tokenizer(text, return_tensors="pt")
output = model(**tokens)

scores = output.logits[0][0].detach().numpy()
scores = expit(scores)
predictions = (scores >= 0.5) * 1

for i in range(len(predictions)):
    if predictions[i]:
        print(class_mapping[i])

Output:

Food
Videogames & Shows
Homestyle
Travel
Health