Spaces:

Xenova
/

sponsorblock-ml

Running

App Files Files Community

sponsorblock-ml / README.md

Joshua Lochner

Fix classifier train command

90d506c over 2 years ago

preview code

raw

history blame

No virus

4.86 kB

	---
	title: Sponsorblock ML
	emoji: 🤖
	colorFrom: yellow
	colorTo: indigo
	sdk: streamlit
	app_file: app.py
	pinned: true
	---

	# SponsorBlock-ML
	Automatically detect in-video YouTube sponsorships, self/unpaid promotions, and interaction reminders. The model was trained using the [SponsorBlock](https://sponsor.ajay.app/) [database](https://sponsor.ajay.app/database) licensed used under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).

	Check out the online demo application at [https://xenova.github.io/sponsorblock-ml/](https://xenova.github.io/sponsorblock-ml/), or follow the instructions below to run it locally.

	---
	## Installation

	1. Download the repository:
	```bash
	git clone https://github.com/xenova/sponsorblock-ml.git
	cd sponsorblock-ml
	```

	2. Install the necessary dependencies:
	```bash
	pip install -r requirements.txt
	```

	3. Run the application:
	```bash
	streamlit run app.py
	```
	## Predicting

	- Predict for a single video using the `--video_id` argument. For example:
	```bash
	python src/predict.py --video_id zo_uoFI1WXM
	```

	- Predict for multiple videos using the `--video_ids` argument. For example:
	```bash
	python src/predict.py --video_ids IgF3OX8nT0w ao2Jfm35XeE
	```

	- Predict for a whole channel using the `--channel_id` argument. For example:

	```bash
	python src/predict.py --channel_id UCHnyfMqiRRG1u-2MsSQLbXA
	```

	Note that on the first run, the program will download the necessary models (which may take some time).


	---

	## Evaluating

	### Measuring Accuracy
	This is primarly used to measure the accuracy (and other metrics) of the model (defaults to [Xenova/sponsorblock-small](https://huggingface.co/Xenova/sponsorblock-small)).
	```bash
	python src/evaluate.py
	```
	In addition to the calculated metrics, missing and incorrect segments are output, allowing for improvements to be made to the database:
	- Missing segments: Segments which were predicted by the model, but are not in the database.
	- Incorrect segments: Segments which are in the database, but the model did not predict (meaning that the model thinks those segments are incorrect).

	### Moderation
	This can also be used to moderate parts of the database. To moderate the whole database, first run:
	```bash
	python src/preprocess.py --do_process_database --processed_database whole_database.json --min_votes -1 --min_views 0 --min_date 01/01/2000 --max_date 01/01/9999 --keep_duplicate_segments
	```

	followed by
	```bash
	python src/evaluate.py --processed_file data/whole_database.json
	```

	The `--video_ids` and `--channel_id` arguments can also be used here. Remember to keep your database and processed database file up-to-date before running evaluations.

	---

	## Training
	### Preprocessing

	1. Download the SponsorBlock database
	```bash
	python src/preprocess.py --update_database
	```

	2. Preprocess the database and generate training, testing and validation data

	```bash
	python src/preprocess.py --do_transcribe --do_create --do_generate --do_split --model_name_or_path Xenova/sponsorblock-small
	```


	1. `--do_transcribe` - Downloads and parses the transcripts from YouTube.
	2. `--do_create` - Process the database (removing unwanted and duplicate segments) and create the labelled dataset.
	3. `--do_generate` - Using the downloaded transcripts and labelled segment data, extract positive (sponsors, unpaid/self-promos and interaction reminders) and negative (normal video content) text segments and create large lists of input and target texts.
	4. `--do_split` - Using the generated positive and negative segments, split them into training, validation and testing sets (according to the specified ratios).

	Each of the above steps can be run independently (as separate commands, e.g. `python src/preprocess.py --do_transcribe`), but should be performed in order.

	For more advanced preprocessing options, run `python src/preprocess.py --help`

	### Transformer
	The transformer is used to extract relevent segments from the transcript and apply a preliminary classification to the extracted text. To start finetuning from the current checkpoint, run:

	```bash
	python src/train.py --model_name_or_path Xenova/sponsorblock-small
	```

	If you wish to finetune an original transformer model, use one of the supported models (t5-small, t5-base, t5-large, t5-3b, t5-11b, google/t5-v1_1-small, google/t5-v1_1-base, google/t5-v1_1-large, google/t5-v1_1-xl, google/t5-v1_1-xxl) as the `--model_name_or_path`. For more information, check out the relevant documentation ([t5](https://huggingface.co/docs/transformers/model_doc/t5) or [t5v1.1](https://huggingface.co/docs/transformers/model_doc/t5v1.1)).



	### Classifier
	The classifier is used to add probabilities to the category predictions. Train the classifier using:
	```bash
	python src/train.py --train_classifier --skip_train_transformer
	```