Joshua Lochner commited on
Commit
bfb080b
·
1 Parent(s): 0e18e8c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -1
README.md CHANGED
@@ -11,4 +11,104 @@ pinned: true
11
  # SponsorBlock-ML
12
  Automatically detect in-video YouTube sponsorships, self/unpaid promotions, and interaction reminders. The model was trained using the [SponsorBlock](https://sponsor.ajay.app/) [database](https://sponsor.ajay.app/database) licensed used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
13
 
14
- Check out the demo application by visiting [https://xenova.github.io/sponsorblock-ml/](https://xenova.github.io/sponsorblock-ml/).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  # SponsorBlock-ML
12
  Automatically detect in-video YouTube sponsorships, self/unpaid promotions, and interaction reminders. The model was trained using the [SponsorBlock](https://sponsor.ajay.app/) [database](https://sponsor.ajay.app/database) licensed used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).
13
 
14
+ Check out the demo application by visiting [https://xenova.github.io/sponsorblock-ml/](https://xenova.github.io/sponsorblock-ml/). You can also run it locally using `make run`.
15
+
16
+ ---
17
+
18
+
19
+ ## Predicting
20
+
21
+ 1. Download the repository:
22
+ ```bash
23
+ git clone https://github.com/xenova/sponsorblock-ml.git
24
+ cd sponsorblock-ml
25
+ ```
26
+
27
+ 2. Run predictions:
28
+ - Predict for a single video using the `--video_id` argument. For example:
29
+ ```bash
30
+ python src/predict --video_id zo_uoFI1WXM
31
+ ```
32
+
33
+ - Predict for multiple videos using the `--video_ids` argument. For example:
34
+ ```bash
35
+ python src/predict.py --video_ids IgF3OX8nT0w ao2Jfm35XeE
36
+ ```
37
+
38
+ - Predict for a whole channel using the `--channel_id` argument. For example:
39
+
40
+ ```bash
41
+ python src/predict.py --channel_id UCHnyfMqiRRG1u-2MsSQLbXA
42
+ ```
43
+
44
+ Note that on the first run, the program will download the necessary models (which may take some time).
45
+
46
+
47
+ ---
48
+
49
+ ## Evaluating
50
+
51
+ ### Measuring Accuracy
52
+ This is primarly used to measure the accuracy (and other metrics) of the model (defaults to [Xenova/sponsorblock-small](https://huggingface.co/Xenova/sponsorblock-small)).
53
+ ```bash
54
+ python src/evaluate.py
55
+ ```
56
+ In addition to the calculated metrics, missing and incorrect segments are output, allowing for improvements to be made to the database:
57
+ - Missing segments: Segments which were predicted by the model, but are not in the database.
58
+ - Incorrect segments: Segments which are in the database, but the model did not predict (meaning that the model thinks those segments are incorrect).
59
+
60
+ ### Moderation
61
+ This can also be used to moderate parts of the database. To moderate the whole database, first run:
62
+ ```bash
63
+ python src/preprocess.py --do_process_database --processed_database whole_database.json --min_votes -1 --min_views 0 --min_date 01/01/2000 --max_date 01/01/9999 --keep_duplicate_segments
64
+ ```
65
+
66
+ followed by
67
+ ```bash
68
+ python src/evaluate.py --processed_file data/whole_database.json
69
+ ```
70
+
71
+ The `--video_ids` and `--channel_id` arguments can also be used here. Remember to keep your database and processed database file up-to-date before running evaluations.
72
+
73
+ ---
74
+
75
+ ## Training
76
+ ### Preprocessing
77
+
78
+ 1. Download the SponsorBlock database
79
+ ```bash
80
+ python src/preprocess.py --update_database
81
+ ```
82
+
83
+ 2. Preprocess the database and generate training, testing and validation data
84
+
85
+ ```bash
86
+ python src/preprocess.py --do_transcribe --do_create --do_generate --do_split --model_name_or_path Xenova/sponsorblock-small
87
+ ```
88
+
89
+
90
+ 1. `--do_transcribe` - Downloads and parses the transcripts from YouTube.
91
+ 2. `--do_create` - Process the database (removing unwanted and duplicate segments) and create the labelled dataset.
92
+ 3. `--do_generate` - Using the downloaded transcripts and labelled segment data, extract positive (sponsors, unpaid/self-promos and interaction reminders) and negative (normal video content) text segments and create large lists of input and target texts.
93
+ 4. `--do_split` - Using the generated positive and negative segments, split them into training, validation and testing sets (according to the specified ratios).
94
+
95
+ Each of the above steps can be run independently (as separate commands, e.g. `python src/preprocess.py --do_transcribe`), but should be performed in order.
96
+
97
+ For more advanced preprocessing options, run `python src/preprocess.py --help`
98
+
99
+ ### Transformer
100
+ The transformer is used to extract relevent segments from the transcript and apply a preliminary classification to the extracted text. To start finetuning from the current checkpoint, run:
101
+
102
+ ```bash
103
+ python src/train.py --model_name_or_path Xenova/sponsorblock-small
104
+ ```
105
+
106
+ If you wish to finetune an original transformer model, use one of the supported models (*t5-small*, *t5-base*, *t5-large*, *t5-3b*, *t5-11b*, *google/t5-v1_1-small*, *google/t5-v1_1-base*, *google/t5-v1_1-large*, *google/t5-v1_1-xl*, *google/t5-v1_1-xxl*) as the `--model_name_or_path`. For more information, check out the relevant documentation ([t5](https://huggingface.co/docs/transformers/model_doc/t5) or [t5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)).
107
+
108
+
109
+
110
+ ### Classifier
111
+ The classifier is used to add probabilities to the category predictions. Train the classifier using:
112
+ ```bash
113
+ python src/train.py --do_train_classifier --skip_train_transformer
114
+ ```