NemoCurator Content Type Classifier DeBERTa

Model Overview

This is a text classification model designed to categorize documents into one of 11 distinct speech types based on their content. It analyzes and understands the nuances of textual information, enabling accurate classification across a diverse range of content types. The model's classifications include:

Product/Company/Organization/Personal Websites: Informational pages about companies, products, or individuals.
Explanatory Articles: Detailed, informative articles that aim to explain concepts or topics.
News: News articles covering current events, updates, and factual reporting.
Blogs: Personal or opinion-based entries typically found on blogging platforms.
MISC: Miscellaneous content that doesn’t fit neatly into the other categories.
Boilerplate Content: Standardized text used frequently across documents, often repetitive or generic.
Analytical Exposition: Analytical or argumentative pieces with in-depth discussion and evaluation.
Online Comments: Short, often informal comments typically found on social media or forums.
Reviews: Content sharing opinions or assessments about products, services, or experiences.
Books and Literature: Excerpts or full texts from books, literary works, or similar long-form writing.
Conversational: Informal, dialogue-like text that mimics a conversational tone.

License

This model is released under the NVIDIA Open Model License Agreement.

References

Model Architecture

The model architecture is Deberta V3 Base
Context length is 1024 tokens

How to Use in NVIDIA NeMo Curator

NeMo Curator improves generative AI model accuracy by processing text, image, and video data at scale for training and customization. It also provides pre-built pipelines for generating synthetic data to customize and evaluate generative AI systems.

The inference code for this model is available through the NeMo Curator GitHub repository. Check out this example notebook to get started.

How to Use in Transformers

To use the content type classifier, use the following code:

import torch
from torch import nn
from transformers import AutoModel, AutoTokenizer, AutoConfig
from huggingface_hub import PyTorchModelHubMixin

class CustomModel(nn.Module, PyTorchModelHubMixin):
    def __init__(self, config):
        super(CustomModel, self).__init__()
        self.model = AutoModel.from_pretrained(config["base_model"])
        self.dropout = nn.Dropout(config["fc_dropout"])
        self.fc = nn.Linear(self.model.config.hidden_size, len(config["id2label"]))

    def forward(self, input_ids, attention_mask):
        features = self.model(input_ids=input_ids, attention_mask=attention_mask).last_hidden_state
        dropped = self.dropout(features)
        outputs = self.fc(dropped)
        return torch.softmax(outputs[:, 0, :], dim=1)

# Setup configuration and model
config = AutoConfig.from_pretrained("nvidia/content-type-classifier-deberta")
tokenizer = AutoTokenizer.from_pretrained("nvidia/content-type-classifier-deberta")
model = CustomModel.from_pretrained("nvidia/content-type-classifier-deberta")
model.eval()

# Prepare and process inputs
text_samples = ["Hi, great video! I am now a subscriber."]
inputs = tokenizer(text_samples, return_tensors="pt", padding="longest", truncation=True)
outputs = model(inputs["input_ids"], inputs["attention_mask"])

# Predict and display results
predicted_classes = torch.argmax(outputs, dim=1)
predicted_domains = [config.id2label[class_idx.item()] for class_idx in predicted_classes.cpu().numpy()]
print(predicted_domains)
# ['Online Comments']

Input & Output

Input

Input Type: Text
Input Format: String
Input Parameters: 1D
Other Properties Related to Input: Token Limit of 1024 tokens

Output

Output Type: Text Classification
Output Format: String
Output Parameters: 1D
Other Properties Related to Output: None

The model takes one or several paragraphs of text as input. Example input:

Brent awarded for leading collaborative efforts and leading SIA International Relations Committee.

Mar 20, 2018

The Security Industry Association (SIA) will recognize Richard Brent, CEO, Louroe Electronics with the prestigious 2017 SIA Chairman's Award for his work to support leading the SIA International Relations Committee and supporting key government relations initiatives.

With his service on the SIA Board of Directors and as Chair of the SIA International Relations Committee, Brent has forged relationships between SIA and agencies like the U.S. Commercial Service. A longtime advocate for government engagement generally and exports specifically, Brent's efforts resulted in the publication of the SIA Export Assistance Guide last year as a tool to assist SIA member companies exploring export opportunities or expanding their participation in trade.

SIA Chairman Denis Hébert will present the SIA Chairman's Award to Brent at The Advance, SIA's annual membership meeting, scheduled to occur on Tuesday, April 10, 2018, at ISC West.

"As the leader of an American manufacturing company, I have seen great business opportunities in foreign sales," said Brent. "Through SIA, I have been pleased to extend my knowledge and experience to other companies that can benefit from exporting. And that is the power of SIA: To bring together distinct companies to share expertise across vertical markets in a collaborative fashion. I'm pleased to contribute, and I thank the Chairman for his recognition."

"As a member of the SIA Board of Directors, Richard Brent is consistently engaged on a variety of issues of importance to the security industry, particularly related to export assistance programs that will help SIA members to grow their businesses," said Hébert. "His contributions in all areas of SIA programming have been formidable, but we owe him a particular debt in sharing his experiences in exporting. Thank you for your leadership, Richard."

Hébert will present SIA award recipients, including the SIA Chairman's Award, SIA Committee Chair of the Year Award and Sandy Jones Volunteer of the Year Award, at The Advance, held during ISC West in Rooms 505/506 of the Sands Expo in Las Vegas, Nevada, on Tuesday, April 10, 10:30-11:30 a.m. Find more info and register at https://www.securityindustry.org/advance.

The Advance is co-located with ISC West, produced by ISC Security Events. Security professionals can register to attend the ISC West trade show and conference, which runs April 10-13, at http://www.iscwest.com.

The model outputs one of the 11 type-of-speech classes as the predicted domain for each input sample. Example output:

News

Software Integration

Runtime Engine: Python 3.10 and NeMo Curator
Supported Hardware Microarchitecture Compatibility: NVIDIA GPU, Volta™ or higher (compute capability 7.0+), CUDA 12 (or above)
Preferred/Supported Operating System(s): Ubuntu 22.04/20.04

Training, Testing, and Evaluation Dataset

Training Data

Link: Jigsaw Toxic Comments, Jigsaw Unintended Biases Dataset, Toxigen Dataset, Common Crawl, Wikipedia
Data collection method by dataset
- Downloaded
Labeling method by dataset
- Human
Properties:
- 25,000 Common Crawl samples were labeled by an external vendor. Each sample was labeled by three annotators.
- The model is trained on the 19604 samples that are agreed by at least two annotators.

Label distribution:

Category	Count
Product/Company/Organization/Personal Websites	5227
Blogs	4930
News	2933
Explanatory Articles	2457
Analytical Exposition	1508
Online Comments	982
Reviews	512
Boilerplate Content	475
MISC	267
Books and Literature	164
Conversational	149

Evaluation

Metric: PR-AUC

Cross validation PR-AUC on the 19604 samples:

Produ=0.697, Expla=0.668, News=0.859, Blogs=0.872, MISC=0.593, Boile=0.383, Analy=0.371, Onlin=0.753, Revie=0.612, Books=0.462, Conve=0.541, avg AUC=0.6192, accuracy=0.6805

Cross validation PR-AUC on the 7738 subset that are agreed by all three annotators:

Produ=0.869, Expla=0.854, News=0.964, Blogs=0.964, MISC=0.876, Boile=0.558, Analy=0.334, Onlin=0.893, Revie=0.825, Books=0.780, Conve=0.793, avg AUC=0.7917, accuracy=0.8444

Inference

Engine: PyTorch
Test Hardware: V100

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

nvidia
/

content-type-classifier-deberta