Spaces:
Running
Running
Joshua Lochner
commited on
Commit
·
bfb080b
1
Parent(s):
0e18e8c
Update README.md
Browse files
README.md
CHANGED
@@ -11,4 +11,104 @@ pinned: true
|
|
11 |
# SponsorBlock-ML
|
12 |
Automatically detect in-video YouTube sponsorships, self/unpaid promotions, and interaction reminders. The model was trained using the [SponsorBlock](https://sponsor.ajay.app/) [database](https://sponsor.ajay.app/database) licensed used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
|
13 |
|
14 |
-
Check out the demo application by visiting [https://xenova.github.io/sponsorblock-ml/](https://xenova.github.io/sponsorblock-ml/).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
# SponsorBlock-ML
|
12 |
Automatically detect in-video YouTube sponsorships, self/unpaid promotions, and interaction reminders. The model was trained using the [SponsorBlock](https://sponsor.ajay.app/) [database](https://sponsor.ajay.app/database) licensed used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
|
13 |
|
14 |
+
Check out the demo application by visiting [https://xenova.github.io/sponsorblock-ml/](https://xenova.github.io/sponsorblock-ml/). You can also run it locally using `make run`.
|
15 |
+
|
16 |
+
---
|
17 |
+
|
18 |
+
|
19 |
+
## Predicting
|
20 |
+
|
21 |
+
1. Download the repository:
|
22 |
+
```bash
|
23 |
+
git clone https://github.com/xenova/sponsorblock-ml.git
|
24 |
+
cd sponsorblock-ml
|
25 |
+
```
|
26 |
+
|
27 |
+
2. Run predictions:
|
28 |
+
- Predict for a single video using the `--video_id` argument. For example:
|
29 |
+
```bash
|
30 |
+
python src/predict --video_id zo_uoFI1WXM
|
31 |
+
```
|
32 |
+
|
33 |
+
- Predict for multiple videos using the `--video_ids` argument. For example:
|
34 |
+
```bash
|
35 |
+
python src/predict.py --video_ids IgF3OX8nT0w ao2Jfm35XeE
|
36 |
+
```
|
37 |
+
|
38 |
+
- Predict for a whole channel using the `--channel_id` argument. For example:
|
39 |
+
|
40 |
+
```bash
|
41 |
+
python src/predict.py --channel_id UCHnyfMqiRRG1u-2MsSQLbXA
|
42 |
+
```
|
43 |
+
|
44 |
+
Note that on the first run, the program will download the necessary models (which may take some time).
|
45 |
+
|
46 |
+
|
47 |
+
---
|
48 |
+
|
49 |
+
## Evaluating
|
50 |
+
|
51 |
+
### Measuring Accuracy
|
52 |
+
This is primarly used to measure the accuracy (and other metrics) of the model (defaults to [Xenova/sponsorblock-small](https://huggingface.co/Xenova/sponsorblock-small)).
|
53 |
+
```bash
|
54 |
+
python src/evaluate.py
|
55 |
+
```
|
56 |
+
In addition to the calculated metrics, missing and incorrect segments are output, allowing for improvements to be made to the database:
|
57 |
+
- Missing segments: Segments which were predicted by the model, but are not in the database.
|
58 |
+
- Incorrect segments: Segments which are in the database, but the model did not predict (meaning that the model thinks those segments are incorrect).
|
59 |
+
|
60 |
+
### Moderation
|
61 |
+
This can also be used to moderate parts of the database. To moderate the whole database, first run:
|
62 |
+
```bash
|
63 |
+
python src/preprocess.py --do_process_database --processed_database whole_database.json --min_votes -1 --min_views 0 --min_date 01/01/2000 --max_date 01/01/9999 --keep_duplicate_segments
|
64 |
+
```
|
65 |
+
|
66 |
+
followed by
|
67 |
+
```bash
|
68 |
+
python src/evaluate.py --processed_file data/whole_database.json
|
69 |
+
```
|
70 |
+
|
71 |
+
The `--video_ids` and `--channel_id` arguments can also be used here. Remember to keep your database and processed database file up-to-date before running evaluations.
|
72 |
+
|
73 |
+
---
|
74 |
+
|
75 |
+
## Training
|
76 |
+
### Preprocessing
|
77 |
+
|
78 |
+
1. Download the SponsorBlock database
|
79 |
+
```bash
|
80 |
+
python src/preprocess.py --update_database
|
81 |
+
```
|
82 |
+
|
83 |
+
2. Preprocess the database and generate training, testing and validation data
|
84 |
+
|
85 |
+
```bash
|
86 |
+
python src/preprocess.py --do_transcribe --do_create --do_generate --do_split --model_name_or_path Xenova/sponsorblock-small
|
87 |
+
```
|
88 |
+
|
89 |
+
|
90 |
+
1. `--do_transcribe` - Downloads and parses the transcripts from YouTube.
|
91 |
+
2. `--do_create` - Process the database (removing unwanted and duplicate segments) and create the labelled dataset.
|
92 |
+
3. `--do_generate` - Using the downloaded transcripts and labelled segment data, extract positive (sponsors, unpaid/self-promos and interaction reminders) and negative (normal video content) text segments and create large lists of input and target texts.
|
93 |
+
4. `--do_split` - Using the generated positive and negative segments, split them into training, validation and testing sets (according to the specified ratios).
|
94 |
+
|
95 |
+
Each of the above steps can be run independently (as separate commands, e.g. `python src/preprocess.py --do_transcribe`), but should be performed in order.
|
96 |
+
|
97 |
+
For more advanced preprocessing options, run `python src/preprocess.py --help`
|
98 |
+
|
99 |
+
### Transformer
|
100 |
+
The transformer is used to extract relevent segments from the transcript and apply a preliminary classification to the extracted text. To start finetuning from the current checkpoint, run:
|
101 |
+
|
102 |
+
```bash
|
103 |
+
python src/train.py --model_name_or_path Xenova/sponsorblock-small
|
104 |
+
```
|
105 |
+
|
106 |
+
If you wish to finetune an original transformer model, use one of the supported models (*t5-small*, *t5-base*, *t5-large*, *t5-3b*, *t5-11b*, *google/t5-v1_1-small*, *google/t5-v1_1-base*, *google/t5-v1_1-large*, *google/t5-v1_1-xl*, *google/t5-v1_1-xxl*) as the `--model_name_or_path`. For more information, check out the relevant documentation ([t5](https://huggingface.co/docs/transformers/model_doc/t5) or [t5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)).
|
107 |
+
|
108 |
+
|
109 |
+
|
110 |
+
### Classifier
|
111 |
+
The classifier is used to add probabilities to the category predictions. Train the classifier using:
|
112 |
+
```bash
|
113 |
+
python src/train.py --do_train_classifier --skip_train_transformer
|
114 |
+
```
|