Kaleemullah/paraphrase-mpnet-ad-classifier

The model "Kaleemullah/paraphrase-mpnet-ad-classifier" adaptation of the SetFit model, a robust architecture designed for text classification tasks. The distinctive feature of this model lies in its ability to leverage few-shot learning techniques effectively, thereby optimizing performance with limited training data. The model's architecture and training process can be delineated into two primary stages:

Fine-tuning of a Sentence Transformer: At the core of the model is a Sentence Transformer, fine-tuned using a contrastive learning approach. This step focuses on enhancing the transformer's ability to generate semantically rich and contextually nuanced sentence embeddings, a critical factor for accurate classification in downstream tasks.
Training of a Classification Head: Post fine-tuning, the model integrates a classification head. This component is trained utilizing the feature representations extracted from the fine-tuned Sentence Transformer. The classification head's primary role is to map the nuanced sentence embeddings to specific classes, thereby facilitating effective text classification.

The model's design and training methodology underscore its capability to handle a wide range of text classification scenarios, especially in contexts where data availability is limited. This approach not only maximizes the efficiency of the learning process but also ensures the model's applicability in diverse real-world applications where large-scale labeled datasets may not be readily available.

Training Data:

The training dataset for this text classification model is designed to differentiate between ad and non-ad texts. It comprises a balanced set of 2000 ad instances and 2000 non-ad instances, each meticulously curated to train the model effectively. The details of the dataset composition are as follows:

Ads Data

Web-Scraped Ads: The primary source of ad data involves scraping textual advertisements from various websites. These ads represent a diverse range of internet advertising styles and formats.
Data Augmentation via ChatGPT-3.5 Turbo: To enhance the dataset, original scraped ads were used as a basis for generating additional ad content. This was accomplished using OpenAI's ChatGPT-3.5 Turbo, which created new ads by mimicking the structure and features of the scraped ads. This method ensured the inclusion of varied and realistic ad content in the training data.
E-commerce Listings: As a part of the ad dataset, listings from various e-commerce platforms were also included. These listings are considered as ads due to their promotional nature and direct product showcasing.

Non-Ads Data

Open Source Datasets: The non-ads portion of the dataset encompasses a variety of text sources to provide a contrast to ad-like text structures. This includes:

Tweets from various open-source collections.
Articles from BBC News.
Narratives and stories from open-source human story databases.
Conversations derived from human interaction datasets.
Transcriptions of YouTube videos, encompassing a wide range of topics and speaking styles.

Data Balance and Diversity

The dataset maintains an equal balance between ad and non-ad texts, with 2000 instances in each category. This balance ensures that the model is not biased towards any particular class during training. Additionally, the diversity in the data sources for both ads and non-ads contributes to the robustness of the model, enabling it to effectively distinguish between the two categories in varied contexts.

Usage

To use this model for inference, first install the SetFit library:

python -m pip install setfit

You can then run inference as follows:

from setfit import SetFitModel

# Download from Hub and run inference
model = SetFitModel.from_pretrained("Kaleemullah/paraphrase-mpnet-ad-classifier")
# Run inference
preds = model(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

Evaluation Results

The model's performance was rigorously evaluated on unseen data to ensure its reliability and effectiveness in real-world applications. The key metric reflecting the model's performance is outlined below:

Accuracy: The model achieved a remarkable accuracy of 99.25% on the validation set. This high level of accuracy indicates the model's proficiency in correctly classifying texts as ads or non-ads.

Limitations and Bias

While the model exhibits high accuracy, it is important to acknowledge its limitations and potential areas of bias:

Data Diversity: The model's performance is contingent on the diversity of the training data. It may not perform as effectively on ad or non-ad texts that significantly deviate from the styles and formats present in the training dataset.
Contextual Limitations: The model might struggle with texts where the distinction between ads and non-ads is subtle or context-dependent.
Language and Cultural Bias: Since the training data predominantly consists of English texts, the model may not generalize well to ads and non-ads in other languages or cultural contexts.

Ethical Considerations

Misuse Potential: There is a potential for misuse in applications where the distinction between promotional content and genuine information is critical, such as in news dissemination or educational content.
Privacy Concerns: Care must be taken to ensure that the model is not used to classify sensitive or private texts without proper authorization or consent.

@article{Kaleemullah2023ParaphraseMPNetAdClassifier,
  author = {Kaleem Ullah Qasim},
  title = {Paraphrase MPNet for Ad Classification},
  year = {2023},
  organization = {Southwest Jiaotong University},
  model_url = {https://huggingface.co/Kaleemullah/paraphrase-mpnet-ad-classifier},
}