mini_datathon / config.py
jeremyadd's picture
Upload 11 files
b5f6a08 verified
raw
history blame
4.02 kB
# Presentation of the challenge
context_markdown = """
The Goal of the first challenge is to estimate the category of the uploaded youtube video.
"""
content_markdown = """
### Multi-class Problem
It has the following features/target:
#### Features
- local_path: The local path to the upload's data.
- upload_id: The unique identifier of the upload.
- clean_upload_id: The upload_id with the "suicide_out_" prefix removed.
- upload_type: An enumeration representing the type of upload. Default is UploadType.GENERAL.
- features: A dictionary containing additional features associated with the upload.
- title: The title of the upload.
- playlist_title: The title of the playlist the upload belongs to.
- description: The description of the upload.
- duration_string: The duration of the upload in string format.
- duration: The duration of the upload in seconds.
- upload_date: The date when the upload was uploaded.
- view_count: The number of views the upload has received.
- comment_count: The number of comments on the upload.
- like_count: The number of likes on the upload.
- tags: The tags associated with the upload.
### Target
- categories: The categories associated with the upload.
You can find the details about the context/data/challenge [here](https://drive.google.com/file/d/1qyEmi6UUWlyzeVPhPnqY2JNRHBPutak-/view?usp=sharing)
"""
#------------------------------------------------------------------------------------------------------------------#
# Guide for the participants to get X_train, y_train and X_test
# The google link can be placed in your google drive => get the shared links and place them here.
data_instruction_commands = """
The data can be parsed using the [youtube_modules.py](https://drive.google.com/file/d/1FCKpBTvTdL2RoNpIp9fHY18006CiglT2/view?usp=drive_link) script.
You can find the readme [here](https://drive.google.com/file/d/1wBJmwfZ9JzcQ0MxvwamYxBjwYbpEsgMx/view?usp=drive_link)
```python
from youtube_modules import *
import pickle
import random
import numpy as np
train_uploads: List[Upload] = pickle.load(open("<path/to/data>/train_uploads.pkl", 'rb' ))
test_uploads: List[Upload] = pickle.load(open("<path/to/data>/test_uploads.pkl", 'rb' ))
```
Make sure to upload your predictions as a .csv file with the columns: "id" (range(len(test_file))) and "label" (1, 2, 3).
## Quickstart: use notebook remotely
1. conda activate py38_default
2. notebook load from remote
$ jupyter notebook --ip=0.0.0.0 --no-browser
then after receiving the URL copied and put it in your browser
https://127.0.0.1:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed2433
Then replace 127.0.0.1 with your I.P. e.g
https://1.222.333.4:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed24336
"""
# Target on test (hidden from the participants)
Y_TEST_GOOGLE_PUBLIC_LINK = 'https://drive.google.com/file/d/1gQ3_ywJElpcBrewCFhVUM-fnV4SN62na/view?usp=sharing'
#------------------------------------------------------------------------------------------------------------------#
# Evaluation metric and content
from sklearn.metrics import f1_score
GREATER_IS_BETTER = True # example for ROC-AUC == True, for MSE == False, etc.
SKLEARN_SCORER = f1_score
SKLEARN_ADDITIONAL_PARAMETERS = {'average': 'weighted'}
evaluation_content = """
The predictions are evaluated according to the f1-score (weighted).
You can get it using
```python
from sklearn.metrics import f1_score
f1_score(y_train, y_pred_train, average='weighted')
```
More details [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).
"""
#------------------------------------------------------------------------------------------------------------------#
# leaderboard benchmark score, will be displayed to everyone
BENCHMARK_SCORE = 0.2
#------------------------------------------------------------------------------------------------------------------#