domain-classifier / README.md
eharper's picture
Update README.md (#1)
a19ba92 verified
|
raw
history blame
2.32 kB
metadata
license: apache-2.0

Model Overview

This is a text classification model to classify documents into one of 26 domain classes:

'Adult', 'Arts_and_Entertainment', 'Autos_and_Vehicles', 'Beauty_and_Fitness', 'Books_and_Literature', 'Business_and_Industrial', 'Computers_and_Electronics', 'Finance', 'Food_and_Drink', 'Games', 'Health', 'Hobbies_and_Leisure', 'Home_and_Garden', 'Internet_and_Telecom', 'Jobs_and_Education', 'Law_and_Government', 'News', 'Online_Communities', 'People_and_Society', 'Pets_and_Animals', 'Real_Estate', 'Science', 'Sensitive_Subjects', 'Shopping', 'Sports', 'Travel_and_Transportation'

Model Architecture

The model architecture is Deberta V3 Base Context length is 512 tokens

Training (details)

Training data:

Training steps:

  • Train a first model on Wikipedia data
  • Randomly sample 1 million Common Crawl data; label them using Google Cloud API
  • Predict these 1 million samples using the first model
  • Google’s labels and first model’s prediction agree on about 500k samples
  • Split these 500k samples 80%/20%. Train the final model on the 80%, and evaluate on the 20%

How To Use This Model

Input

The model takes one or several paragraphs of text as input.

Example input: q Directions

  1. Mix 2 flours and baking powder together
  2. Mix water and egg in a separate bowl. Add dry to wet little by little
  3. Heat frying pan on medium
  4. Pour batter into pan and then put blueberries on top before flipping
  5. Top with desired toppings!

Output

The model outputs one of the 26 domain classes as the predicted domain for each input sample.

Example output: Food_and_Drink

Evaluation Benchmarks

Accuracy on 500 human annotated samples

  • Google API 77.5%
  • Our model 77.9%

PR-AUC score on evaluation set with 105k samples

  • 0.9873

References

https://arxiv.org/abs/2111.09543 https://github.com/microsoft/DeBERTa

License

License to use this model is covered by the Apache 2.0. By downloading the public and release version of the model, you accept the terms and conditions of the Apache License 2.0.