subreddit-predictor / README.md
daspartho's picture
Librarian Bot: Update dataset YAML metadata for model (#1)
eee935f
|
raw
history blame
1.22 kB
metadata
language: en
license: apache-2.0
datasets: daspartho/subreddit-posts

An NLP model that predicts subreddit based on the title of a post.

Training

DistilBERT is fine-tuned on subreddit-posts, a dataset of titles of the top 1000 posts from the top 250 subreddits.

For steps to make the model check out the model notebook in the github repo or open in Colab.

Limitations and bias

  • Since the model is trained on top 250 subreddits (for reference) therefore it can only categorise within those subreddits.
  • Some subreddits have a specific format for their post title, like r/todayilearned where post title starts with "TIL" so the model becomes biased towards "TIL" --> r/todayilearned. This can be removed by cleaning the dataset of these specific terms.
  • In some subreddit like r/gifs, the title of the post doesn't matter much, so the model performs poorly on them.