--- language: - en tags: - bert - text-classification - advertisements license: apache-2.0 datasets: - custom --- ## Kaleemullah/bert-base-uncased-ad-nonad-classifier ### Model Description This model is a fine-tuned version of `bert-base-uncased`, specifically tailored for distinguishing between advertising (ad) and non-advertising (non-ad) text content. It is designed to understand the nuances and language patterns that differentiate promotional content from other types of text. ### Intended Use - **Primary Use Case:** Text classification, specifically identifying whether a given piece of text is an advertisement or not. - **Out-of-Scope Use Cases:** This model is not intended for understanding context beyond the binary classification of ads vs. non-ads. It should not be used for complex natural language understanding tasks like sentiment analysis, question-answering, etc. ### Training Data The model was trained on a balanced dataset consisting of 40,000 examples, with 20,000 ads and 20,000 non-ads. Each text entry was preprocessed and tokenized using the BERT tokenizer. ### Training Procedure - **Preprocessing:** Text entries were tokenized using `BertTokenizer` with a maximum length of 512 tokens. - **Fine-Tuning:** The model was fine-tuned on the preprocessed data for 3 epochs using the Hugging Face `transformers` Trainer API. - **Evaluation Metrics:** The model's performance was evaluated based on accuracy, precision, recall, and F1-score. ### Performance The model achieved the following metrics on the test dataset: - Accuracy: 99.71% - Precision: 99.76% - Recall: 99.67% - F1-score: 99.72% Note: this model meant to be update soon (it is overfitting on one Non-Ad Catagory (will be updated soon)) ### How to Use ```python from transformers import BertTokenizer, BertForSequenceClassification import torch model_name = "Kaleemullah/bert-base-uncased-ad-nonad-classifier" tokenizer = BertTokenizer.from_pretrained(model_name) model = BertForSequenceClassification.from_pretrained(model_name) def predict(text): inputs = tokenizer(text, padding=True, truncation=True, max_length=512, return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) prediction = torch.argmax(outputs.logits, dim=-1).numpy()[0] return "Ad" if prediction == 1 else "Non-Ad" # Example predict("Your example text here")