File size: 1,982 Bytes
0cabdda 2ce6f91 0cabdda 9d07a86 0cabdda dc5c34e da4d4bf dc5c34e 8177305 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 |
---
license: cc-by-sa-4.0
language:
- en
metrics:
- accuracy
pipeline_tag: text-classification
tags:
- sports
datasets:
- Chrisneverdie/OnlySports_Dataset
base_model: Snowflake/snowflake-arctic-embed-xs
---
# Sports Text Classifier
## Overview
This Sports Text Classifier is a crucial component of the OnlySports Dataset creation pipeline. It's designed to accurately identify and extract sports-related documents from a large corpus of web content.
## Model Architecture
- Base model: [Snowflake-arctic-embed-xs](https://huggingface.co/Snowflake/snowflake-arctic-embed-xs)
- Additional layer: Binary classification layer
- Training: 10 epochs with a learning rate of 3e-4
## Performance
The classifier achieves exceptional accuracy in distinguishing between sports and non-sports documents:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/656590bd40440ddcc051ade7/hK_a183i2_H5AfUF6ZXd6.png)
## Training Data
The classifier was trained on a balanced dataset of sports and non-sports content:
- 64k samples from seven prestigious sports websites
- 36k non-sports text documents classified using GPT-3.5
## Usage
This classifier is primarily used in the creation of the OnlySports Dataset, presented in this [paper](https://arxiv.org/abs/2409.00286). It can be applied to filter large text corpora for sports-related content with high accuracy.
## Integration
The classifier is integrated into a MapReduce architecture for efficient processing of large-scale datasets. It's used in conjunction with URL keyword filtering to create a comprehensive sports text dataset.
## Related Projects
This classifier is part of the larger OnlySports collection, which includes:
- [OnlySports Dataset](https://huggingface.co/collections/Chrisneverdie/onlysports-66b3e5cf595eb81220cc27a6)
- [OnlySportsLM](https://huggingface.co/Chrisneverdie/OnlySportsLM_196M)
For more information, check our [paper](https://arxiv.org/abs/2409.00286) or email zc2404@nyu.edu. |