jeremyadd commited on
Commit
b5f6a08
·
verified ·
1 Parent(s): 2355079

Upload 11 files

Browse files
Files changed (11) hide show
  1. LICENSE +21 -0
  2. README.md +107 -14
  3. STATUS_DATATHON.txt +1 -0
  4. config.py +93 -0
  5. leaderboard.csv +2 -0
  6. leaderboard.py +53 -0
  7. requirements.txt +5 -0
  8. runtime.txt +1 -0
  9. users.csv +32 -0
  10. users.py +38 -0
  11. utils.py +30 -0
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2017 Geoff Pleiss
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,14 +1,107 @@
1
- ---
2
- title: Mini Datathon
3
- emoji: 💻
4
- colorFrom: purple
5
- colorTo: red
6
- sdk: streamlit
7
- sdk_version: 1.40.1
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: A platform to manage your datathon
12
- ---
13
-
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
2
+
3
+ [Heroku web app](https://minidatathon.herokuapp.com/)
4
+
5
+ ![](mini_datathon.gif)
6
+
7
+ # Mini Datathon
8
+
9
+ This datathon platform is fully developped in python using *streamlit* with very few lines of code!
10
+
11
+ As written in the title, it is designed for *small datathon* (but can easily scale) and the scripts are easy to understand.
12
+
13
+ ## Installation
14
+
15
+ 1) Easy way => using docker hub:
16
+ `docker pull spotep/mini_datathon:latest`
17
+
18
+ 2) Alternative way => clone the repo into your server:
19
+ `git clone mini_datathon; cd mini_datathon`
20
+
21
+ ## Usage
22
+
23
+ You need 3 simple steps to setup your mini hackathon:
24
+
25
+ 1) Edit the password of the **admin** user in [users.csv](users.csv) and the login & passwords for the participants
26
+ 2) Edit the [config.py](config.py) file\
27
+ a) The **presentation** & the **context** of the challenge \
28
+ b) The **data content** and `X_train`, `y_train`, `X_test` & `y_test` that you can upload on google drive and just **share the links**. \
29
+ c) The **evaluation metric** & **benchmark score**
30
+ 3) Run the scripts\
31
+ a) If you installed it the _alternative way_: `streamlit run main.py` \
32
+ b) If you pulled the docker image, just **build** and **run** the container.
33
+
34
+ Please do not forget to notify the participants that the submission file need to be a csv **ordered the same way as given
35
+ in `y_train`**.
36
+
37
+ _Ps: anytime the admin user has the possibility to **pause** the challenge, in that case the participants won't be able to upload their submissions._
38
+
39
+ ## Example
40
+
41
+ An example version of the code is deployed on heroku here: [web app](https://minidatathon.herokuapp.com/)
42
+
43
+ In the deployed version, we have the [UCI Secom](https://archive.ics.uci.edu/ml/datasets/SECOM)
44
+ imbalanced dataset (binary classification) and evaluated by the [PR-AUC score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score):
45
+
46
+ in the [config.py](config.py) file you would need to fill the following parameters:
47
+
48
+ - `GREATER_IS_BETTER = True`
49
+ - `SKLEARN_SCORER = average_precision_score`
50
+ - `SKLEARN_ADDITIONAL_PARAMETERS = {'average': 'micro'}`
51
+ - upload the relevant data the your Google Drive & share the links.
52
+
53
+ ## Behind the scenes
54
+ ### Databases
55
+ The platform needs only 2 components to be saved:
56
+ #### The leaderboard
57
+ The leaderboard is in fact a csv file that is being updated everytime a user submit predictions.
58
+ The csv file contains 4 columns:
59
+ - _id_: the login of the team
60
+ - _score_: the **best** score of the team
61
+ - _nb\_submissions_: the number of submissions the team uploads
62
+ - _rank_: the live rank of the team
63
+
64
+ We will have only 1 row per team since only the best score is being saved.
65
+
66
+ By default, a benchmark score is pushed to the leaderboard:
67
+
68
+ | id | score |
69
+ |-----------|-------|
70
+ | benchmark | 0.6 |
71
+
72
+ For more details, please refer to the script [leaderboard](leaderboard.py).
73
+
74
+ #### The users
75
+ Like the leaderboard, it is a csv file.
76
+ It is supposed to be defined by the admin of the competition.
77
+ It contains 2 columns:
78
+ - login
79
+ - password
80
+
81
+ A default user is created at first to begin to play with the platform:
82
+
83
+ | login | password |
84
+ |-----------|----------|
85
+ | admin | password |
86
+
87
+ In order to add new participants, simply add rows to the current users.csv file.
88
+
89
+ For more details, please refer to the script [users](users.py).
90
+
91
+ ## Next steps
92
+
93
+ - [ ] allow to have a *private* and *public* leaderboard like it is done on kaggle.com
94
+ - [ ] allow to connect using oauth
95
+
96
+
97
+ ## License
98
+ MIT License [here](LICENSE).
99
+
100
+ ## Credits
101
+ We could not find an easy implementation for our yearly internal hackathon at Intel.
102
+ The idea originally came from my dear devops coworker [Elhay Efrat](https://github.com/shdowofdeath)
103
+ and I took the responsability to develop it.
104
+
105
+ If you like this project, let me know by [buying me a coffee](https://www.buymeacoffee.com/jeremyatia) :)
106
+
107
+ <a href="https://www.buymeacoffee.com/jeremyatia" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 100px !important;width: 300px !important;" ></a>
STATUS_DATATHON.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ running
config.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Presentation of the challenge
2
+ context_markdown = """
3
+ The Goal of the first challenge is to estimate the category of the uploaded youtube video.
4
+
5
+ """
6
+ content_markdown = """
7
+ ### Multi-class Problem
8
+
9
+ It has the following features/target:
10
+
11
+ #### Features
12
+
13
+ - local_path: The local path to the upload's data.
14
+ - upload_id: The unique identifier of the upload.
15
+ - clean_upload_id: The upload_id with the "suicide_out_" prefix removed.
16
+ - upload_type: An enumeration representing the type of upload. Default is UploadType.GENERAL.
17
+ - features: A dictionary containing additional features associated with the upload.
18
+ - title: The title of the upload.
19
+ - playlist_title: The title of the playlist the upload belongs to.
20
+ - description: The description of the upload.
21
+ - duration_string: The duration of the upload in string format.
22
+ - duration: The duration of the upload in seconds.
23
+ - upload_date: The date when the upload was uploaded.
24
+ - view_count: The number of views the upload has received.
25
+ - comment_count: The number of comments on the upload.
26
+ - like_count: The number of likes on the upload.
27
+ - tags: The tags associated with the upload.
28
+
29
+ ### Target
30
+ - categories: The categories associated with the upload.
31
+
32
+
33
+ You can find the details about the context/data/challenge [here](https://drive.google.com/file/d/1qyEmi6UUWlyzeVPhPnqY2JNRHBPutak-/view?usp=sharing)
34
+ """
35
+ #------------------------------------------------------------------------------------------------------------------#
36
+
37
+ # Guide for the participants to get X_train, y_train and X_test
38
+ # The google link can be placed in your google drive => get the shared links and place them here.
39
+ data_instruction_commands = """
40
+ The data can be parsed using the [youtube_modules.py](https://drive.google.com/file/d/1FCKpBTvTdL2RoNpIp9fHY18006CiglT2/view?usp=drive_link) script.
41
+ You can find the readme [here](https://drive.google.com/file/d/1wBJmwfZ9JzcQ0MxvwamYxBjwYbpEsgMx/view?usp=drive_link)
42
+
43
+ ```python
44
+ from youtube_modules import *
45
+ import pickle
46
+ import random
47
+ import numpy as np
48
+
49
+ train_uploads: List[Upload] = pickle.load(open("<path/to/data>/train_uploads.pkl", 'rb' ))
50
+
51
+ test_uploads: List[Upload] = pickle.load(open("<path/to/data>/test_uploads.pkl", 'rb' ))
52
+ ```
53
+
54
+ Make sure to upload your predictions as a .csv file with the columns: "id" (range(len(test_file))) and "label" (1, 2, 3).
55
+
56
+ ## Quickstart: use notebook remotely
57
+ 1. conda activate py38_default
58
+ 2. notebook load from remote
59
+ $ jupyter notebook --ip=0.0.0.0 --no-browser
60
+
61
+ then after receiving the URL copied and put it in your browser
62
+ https://127.0.0.1:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed2433
63
+
64
+ Then replace 127.0.0.1 with your I.P. e.g
65
+ https://1.222.333.4:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed24336
66
+ """
67
+
68
+ # Target on test (hidden from the participants)
69
+ Y_TEST_GOOGLE_PUBLIC_LINK = 'https://drive.google.com/file/d/1gQ3_ywJElpcBrewCFhVUM-fnV4SN62na/view?usp=sharing'
70
+ #------------------------------------------------------------------------------------------------------------------#
71
+
72
+ # Evaluation metric and content
73
+ from sklearn.metrics import f1_score
74
+ GREATER_IS_BETTER = True # example for ROC-AUC == True, for MSE == False, etc.
75
+ SKLEARN_SCORER = f1_score
76
+ SKLEARN_ADDITIONAL_PARAMETERS = {'average': 'weighted'}
77
+
78
+ evaluation_content = """
79
+ The predictions are evaluated according to the f1-score (weighted).
80
+
81
+ You can get it using
82
+ ```python
83
+ from sklearn.metrics import f1_score
84
+
85
+ f1_score(y_train, y_pred_train, average='weighted')
86
+ ```
87
+ More details [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).
88
+ """
89
+ #------------------------------------------------------------------------------------------------------------------#
90
+
91
+ # leaderboard benchmark score, will be displayed to everyone
92
+ BENCHMARK_SCORE = 0.2
93
+ #------------------------------------------------------------------------------------------------------------------#
leaderboard.csv ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ id,score,nb_submissions
2
+ benchmark,0.2,1
leaderboard.py ADDED
@@ -0,0 +1,53 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataclasses import dataclass
2
+ from config import BENCHMARK_SCORE, GREATER_IS_BETTER
3
+ import numpy as np
4
+ import pandas as pd
5
+ import os
6
+
7
+
8
+ @dataclass
9
+ class LeaderBoard:
10
+ benchmark_score: float = BENCHMARK_SCORE
11
+ db_file: str = 'leaderboard.csv'
12
+ current_path: str = os.path.abspath(os.path.dirname(__file__))
13
+
14
+ def get(self):
15
+ try:
16
+ leaderboard = pd.read_csv(os.path.join(self.current_path, self.db_file))
17
+ except FileNotFoundError:
18
+ leaderboard = self.create()
19
+ return leaderboard
20
+
21
+ def create(self):
22
+ ldb = pd.DataFrame(columns=['id', 'score', 'nb_submissions'], index=[0])
23
+ ldb.loc[0, 'id'] = 'benchmark'
24
+ ldb.loc[0, 'score'] = self.benchmark_score
25
+ ldb.loc[0, 'nb_submissions'] = 1
26
+ ldb.to_csv(os.path.join(self.current_path, self.db_file), index=False)
27
+ return ldb
28
+
29
+ def edit(self, leaderboard: pd.DataFrame, id: str, score: float) -> pd.DataFrame:
30
+ new_lb = leaderboard.copy()
31
+ if new_lb[new_lb.id == id].shape[0] == 0:
32
+ new_lb = new_lb.append({'id': id, 'score': score, 'nb_submissions': 1}, ignore_index=True)
33
+ else:
34
+ current_score = new_lb.loc[new_lb.id == id, 'score'].values[0]
35
+ if self.compare_score(score, current_score, greater_is_better=GREATER_IS_BETTER):
36
+ new_lb.loc[new_lb.id == id, 'score'] = score
37
+ new_lb.loc[new_lb.id == id, 'nb_submissions'] += 1
38
+ new_lb.to_csv(os.path.join(self.current_path, self.db_file), index=False)
39
+ return new_lb
40
+
41
+ @staticmethod
42
+ def show(leaderboard: pd.DataFrame, ascending: bool) -> pd.DataFrame:
43
+ new_lb = leaderboard.sort_values('score', ascending=ascending, ignore_index=True)
44
+ new_lb['rank'] = np.arange(1, new_lb.shape[0] + 1)
45
+ return new_lb
46
+
47
+ @staticmethod
48
+ def compare_score(new_score: float, current_score: float, greater_is_better: bool=True) -> bool:
49
+ if greater_is_better:
50
+ return new_score > current_score
51
+ else:
52
+ return new_score < current_score
53
+
requirements.txt ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ pandas==1.4.4
2
+ streamlit==1.11.0
3
+ scikit-learn==1.3.0
4
+ numpy==1.24.2
5
+ altair<5
runtime.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ python-3.9.6
users.csv ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ login,password
2
+ admin,Datathon2023!
3
+ team1,zZS@LH804J
4
+ team2,7#onYaC3j5
5
+ team3,t6uN*4458f
6
+ team4,00Ne&o0x5!
7
+ team5,1P4&H4LVu$
8
+ team6,02#WZqB4s*
9
+ team7,66D5zCw!Y0
10
+ team8,5yW0*a&1MB
11
+ team9,k1A76TQ02&
12
+ team10,I1HuK8681@
13
+ team11,Eb$7$730Ty
14
+ team12,6%9PhRv4^A
15
+ team13,875^GBFi^l
16
+ team14,hT3^Va66!K
17
+ team15,Jib3W59@N0
18
+ team16,9d^D39a6fw
19
+ team17,iM@1$962U9
20
+ team18,&35$8f@Y$L
21
+ team19,r77h2E6qr^
22
+ team20,t28*8HAS5n
23
+ team21,33Ws2E&60*
24
+ team22,5$3M5BYTt@
25
+ team23,1Y2lf&9hlf
26
+ team24,N1g$4R9@2*
27
+ team25,b0*Dd5786B
28
+ team26,#8M03A0Bak
29
+ team27,84F!7f47aQ
30
+ team28,4f0H3Gi@Hv
31
+ team29,H7zC8%t2Y3
32
+ team30,7CZ7$40Wyl
users.py ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from dataclasses import dataclass
2
+ import pandas as pd
3
+ import os
4
+
5
+
6
+ @dataclass
7
+ class Users:
8
+ db_file: str = 'users.csv'
9
+ current_path: str = os.path.abspath(os.path.dirname(__file__))
10
+
11
+ def get_db(self):
12
+ try:
13
+ db = pd.read_csv(os.path.join(self.current_path, self.db_file))
14
+ except FileNotFoundError:
15
+ db = self.create_db()
16
+ return db
17
+
18
+ def create_db(self):
19
+ db = pd.DataFrame(columns=['login', 'password'])
20
+ db = db.append({'login': 'admin', 'password': 'password'}, ignore_index=True)
21
+ db.to_csv(os.path.join(self.current_path, self.db_file), index=False)
22
+ return db
23
+
24
+ def exists(self, login: str, password: str) -> bool:
25
+ db = self.get_db()
26
+ if db.loc[db.login == login].shape[0] != 0:
27
+ if db.loc[db.login == login, 'password'].values[0] == password:
28
+ return True
29
+ return False
30
+
31
+ def is_admin(self, login: str, password: str) -> bool:
32
+ db = self.get_db()
33
+ if db.loc[db.login == login, 'login'].values[0] == 'admin' and \
34
+ db.loc[db.login == login, 'password'].values[0] == password:
35
+ return True
36
+ else:
37
+ return False
38
+
utils.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import base64
2
+
3
+ # @st.cache
4
+ # def load_data():
5
+ # df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data',
6
+ # sep=' ', header=None)
7
+ # labels = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data',
8
+ # header=None, sep=' ', names=['target', 'time']).iloc[:, 0]
9
+ # X_train = df.sample(**train_test_sampling)
10
+ # y_train = labels.loc[X_train.index]
11
+ # X_test = df.loc[~df.index.isin(X_train.index), :]
12
+ # y_test = labels.loc[X_test.index]
13
+ # return X_train, y_train, X_test, y_test
14
+
15
+ # st.markdown(data, unsafe_allow_html=True)
16
+ # if st.button('Load data'):
17
+ # X_train, y_train, X_test, y_test = load_data()
18
+ # st.markdown(get_table_download_link(X_train, filename='X_train.csv'), unsafe_allow_html=True)
19
+ # st.markdown(get_table_download_link(y_train, filename='y_train.csv'), unsafe_allow_html=True)
20
+ # st.markdown(get_table_download_link(X_test, filename='X_test.csv'), unsafe_allow_html=True)
21
+
22
+
23
+ def get_table_download_link(df, filename):
24
+ """Generates a link allowing the data in a given panda dataframe to be downloaded
25
+ in: dataframe
26
+ out: href string
27
+ """
28
+ csv = df.to_csv(index=False)
29
+ b64 = base64.b64encode(csv.encode()).decode() # some strings <-> bytes conversions necessary here
30
+ return f'<a href="data:file/csv;base64,{b64}" download="{filename}">Download {filename}</a>'