Spaces:

jeremyadd
/

mini_datathon

Sleeping

App Files Files Community

jeremyadd commited on Nov 13, 2024

Commit

b5f6a08

verified ·

1 Parent(s): 2355079

Upload 11 files

Browse files

Files changed (11) hide show

LICENSE +21 -0
README.md +107 -14
STATUS_DATATHON.txt +1 -0
config.py +93 -0
leaderboard.csv +2 -0
leaderboard.py +53 -0
requirements.txt +5 -0
runtime.txt +1 -0
users.csv +32 -0
users.py +38 -0
utils.py +30 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2017 Geoff Pleiss
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,14 +1,107 @@
----
-title: Mini Datathon
-emoji: 💻
-colorFrom: purple
-colorTo: red
-sdk: streamlit
-sdk_version: 1.40.1
-app_file: app.py
-pinned: false
-license: mit
-short_description: A platform to manage your datathon
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
+[Heroku web app](https://minidatathon.herokuapp.com/)
+![](mini_datathon.gif)
+# Mini Datathon
+This datathon platform is fully developped in python using *streamlit* with very few lines of code!
+As written in the title, it is designed for *small datathon* (but can easily scale) and the scripts are easy to understand.
+## Installation
+1) Easy way => using docker hub:
+`docker pull spotep/mini_datathon:latest`
+2) Alternative way => clone the repo into your server:
+`git clone mini_datathon; cd mini_datathon`
+## Usage
+You need 3 simple steps to setup your mini hackathon:
+1) Edit the password of the **admin** user in [users.csv](users.csv) and the login & passwords for the participants
+2) Edit the [config.py](config.py) file\
+    a) The **presentation** & the **context** of the challenge \
+    b) The **data content** and `X_train`, `y_train`, `X_test` & `y_test` that you can upload on google drive and just **share the links**. \
+    c) The **evaluation metric** & **benchmark score**
+3) Run the scripts\
+    a) If you installed it the _alternative way_: `streamlit run main.py` \
+    b) If you pulled the docker image, just **build** and **run** the container.
+Please do not forget to notify the participants that the submission file need to be a csv **ordered the same way as given
+in `y_train`**.
+_Ps: anytime the admin user has the possibility to **pause** the challenge, in that case the participants won't be able to upload their submissions._
+## Example
+An example version of the code is deployed on heroku here: [web app](https://minidatathon.herokuapp.com/)
+In the deployed version, we have the [UCI Secom](https://archive.ics.uci.edu/ml/datasets/SECOM)
+imbalanced dataset (binary classification) and evaluated by the [PR-AUC score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score):
+in the [config.py](config.py) file you would need to fill the following parameters:
+- `GREATER_IS_BETTER = True`
+- `SKLEARN_SCORER = average_precision_score`
+- `SKLEARN_ADDITIONAL_PARAMETERS = {'average': 'micro'}`
+- upload the relevant data the your Google Drive & share the links.
+## Behind the scenes
+### Databases
+The platform needs only 2 components to be saved:
+#### The leaderboard
+The leaderboard is in fact a csv file that is being updated everytime a user submit predictions.
+The csv file contains 4 columns:
+- _id_: the login  of the team
+- _score_: the **best** score of the team
+- _nb\_submissions_: the number of submissions the team uploads
+- _rank_: the live rank of the team
+We will have only 1 row per team since only the best score is being saved.
+By default, a benchmark score is pushed to the leaderboard:
+| id        | score |
+|-----------|-------|
+| benchmark | 0.6   |
+For more details, please refer to the script [leaderboard](leaderboard.py).
+#### The users
+Like the leaderboard, it is a csv file.
+It is supposed to be defined by the admin of the competition.
+It contains 2 columns:
+- login
+- password
+A default user is created at first to begin to play with the platform:
+| login     | password |
+|-----------|----------|
+| admin     | password |
+In order to add new participants, simply add rows to the current users.csv file.
+For more details, please refer to the script [users](users.py).
+## Next steps
+- [ ] allow to have a *private* and *public* leaderboard like it is done on kaggle.com
+- [ ] allow to connect using oauth
+## License
+MIT License [here](LICENSE).
+## Credits
+We could not find an easy implementation for our yearly internal hackathon at Intel.
+The idea originally came from my dear devops coworker [Elhay Efrat](https://github.com/shdowofdeath)
+and I took the responsability to develop it.
+If you like this project, let me know by [buying me a coffee](https://www.buymeacoffee.com/jeremyatia) :)
+<a href="https://www.buymeacoffee.com/jeremyatia" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 100px !important;width: 300px !important;" ></a>

STATUS_DATATHON.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ running

config.py ADDED Viewed

	@@ -0,0 +1,93 @@

+# Presentation of the challenge
+context_markdown = """
+The Goal of the first challenge is to estimate the category of the uploaded youtube video.
+"""
+content_markdown = """
+### Multi-class Problem
+It has the following features/target:
+#### Features
+- local_path: The local path to the upload's data.
+- upload_id: The unique identifier of the upload.
+- clean_upload_id: The upload_id with the "suicide_out_" prefix removed.
+- upload_type: An enumeration representing the type of upload. Default is UploadType.GENERAL.
+- features: A dictionary containing additional features associated with the upload.
+- title: The title of the upload.
+- playlist_title: The title of the playlist the upload belongs to.
+- description: The description of the upload.
+- duration_string: The duration of the upload in string format.
+- duration: The duration of the upload in seconds.
+- upload_date: The date when the upload was uploaded.
+- view_count: The number of views the upload has received.
+- comment_count: The number of comments on the upload.
+- like_count: The number of likes on the upload.
+- tags: The tags associated with the upload.
+### Target
+- categories: The categories associated with the upload.
+You can find the details about the context/data/challenge [here](https://drive.google.com/file/d/1qyEmi6UUWlyzeVPhPnqY2JNRHBPutak-/view?usp=sharing)
+"""
+#------------------------------------------------------------------------------------------------------------------#
+# Guide for the participants to get X_train, y_train and X_test
+# The google link can be placed in your google drive => get the shared links and place them here.
+data_instruction_commands = """
+The data can be parsed using the [youtube_modules.py](https://drive.google.com/file/d/1FCKpBTvTdL2RoNpIp9fHY18006CiglT2/view?usp=drive_link) script.
+You can find the readme [here](https://drive.google.com/file/d/1wBJmwfZ9JzcQ0MxvwamYxBjwYbpEsgMx/view?usp=drive_link)
+```python
+from youtube_modules import *
+import pickle
+import random
+import numpy as np
+train_uploads: List[Upload] = pickle.load(open("<path/to/data>/train_uploads.pkl", 'rb' ))
+test_uploads: List[Upload] = pickle.load(open("<path/to/data>/test_uploads.pkl", 'rb' ))
+```
+Make sure to upload your predictions as a .csv file with the columns: "id" (range(len(test_file))) and "label" (1, 2, 3).
+## Quickstart: use notebook remotely
+1. conda activate py38_default
+2. notebook load from remote
+$ jupyter notebook --ip=0.0.0.0 --no-browser
+then after receiving the URL copied and put it in your browser
+ https://127.0.0.1:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed2433
+Then replace 127.0.0.1 with your I.P. e.g
+ https://1.222.333.4:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed24336
+"""
+# Target on test (hidden from the participants)
+Y_TEST_GOOGLE_PUBLIC_LINK = 'https://drive.google.com/file/d/1gQ3_ywJElpcBrewCFhVUM-fnV4SN62na/view?usp=sharing'
+#------------------------------------------------------------------------------------------------------------------#
+# Evaluation metric and content
+from sklearn.metrics import f1_score
+GREATER_IS_BETTER = True  # example for ROC-AUC == True, for MSE == False, etc.
+SKLEARN_SCORER = f1_score
+SKLEARN_ADDITIONAL_PARAMETERS = {'average': 'weighted'}
+evaluation_content = """
+The predictions are evaluated according to the f1-score (weighted).
+You can get it using
+```python
+from sklearn.metrics import f1_score
+f1_score(y_train, y_pred_train, average='weighted')
+```
+More details [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).
+"""
+#------------------------------------------------------------------------------------------------------------------#
+# leaderboard benchmark score, will be displayed to everyone
+BENCHMARK_SCORE = 0.2
+#------------------------------------------------------------------------------------------------------------------#

leaderboard.csv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ id,score,nb_submissions
2	+ benchmark,0.2,1

leaderboard.py ADDED Viewed

	@@ -0,0 +1,53 @@

+from dataclasses import dataclass
+from config import BENCHMARK_SCORE, GREATER_IS_BETTER
+import numpy as np
+import pandas as pd
+import os
+@dataclass
+class LeaderBoard:
+    benchmark_score: float = BENCHMARK_SCORE
+    db_file: str = 'leaderboard.csv'
+    current_path: str = os.path.abspath(os.path.dirname(__file__))
+    def get(self):
+        try:
+            leaderboard = pd.read_csv(os.path.join(self.current_path, self.db_file))
+        except FileNotFoundError:
+            leaderboard = self.create()
+        return leaderboard
+    def create(self):
+        ldb = pd.DataFrame(columns=['id', 'score', 'nb_submissions'], index=[0])
+        ldb.loc[0, 'id'] = 'benchmark'
+        ldb.loc[0, 'score'] = self.benchmark_score
+        ldb.loc[0, 'nb_submissions'] = 1
+        ldb.to_csv(os.path.join(self.current_path, self.db_file), index=False)
+        return ldb
+    def edit(self, leaderboard: pd.DataFrame, id: str, score: float) -> pd.DataFrame:
+        new_lb = leaderboard.copy()
+        if new_lb[new_lb.id == id].shape[0] == 0:
+            new_lb = new_lb.append({'id': id, 'score': score, 'nb_submissions': 1}, ignore_index=True)
+        else:
+            current_score = new_lb.loc[new_lb.id == id, 'score'].values[0]
+            if self.compare_score(score, current_score, greater_is_better=GREATER_IS_BETTER):
+                new_lb.loc[new_lb.id == id, 'score'] = score
+        new_lb.loc[new_lb.id == id, 'nb_submissions'] += 1
+        new_lb.to_csv(os.path.join(self.current_path, self.db_file),  index=False)
+        return new_lb
+    @staticmethod
+    def show(leaderboard: pd.DataFrame, ascending: bool) -> pd.DataFrame:
+        new_lb = leaderboard.sort_values('score', ascending=ascending, ignore_index=True)
+        new_lb['rank'] = np.arange(1, new_lb.shape[0] + 1)
+        return new_lb
+    @staticmethod
+    def compare_score(new_score: float, current_score: float, greater_is_better: bool=True) -> bool:
+        if greater_is_better:
+            return new_score > current_score
+        else:
+            return new_score < current_score

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+pandas==1.4.4
+streamlit==1.11.0
+scikit-learn==1.3.0
+numpy==1.24.2
+altair<5

runtime.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ python-3.9.6

users.csv ADDED Viewed

	@@ -0,0 +1,32 @@

+login,password
+admin,Datathon2023!
+team1,zZS@LH804J
+team2,7#onYaC3j5
+team3,t6uN*4458f
+team4,00Ne&o0x5!
+team5,1P4&H4LVu$
+team6,02#WZqB4s*
+team7,66D5zCw!Y0
+team8,5yW0*a&1MB
+team9,k1A76TQ02&
+team10,I1HuK8681@
+team11,Eb$7$730Ty
+team12,6%9PhRv4^A
+team13,875^GBFi^l
+team14,hT3^Va66!K
+team15,Jib3W59@N0
+team16,9d^D39a6fw
+team17,iM@1$962U9
+team18,&35$8f@Y$L
+team19,r77h2E6qr^
+team20,t28*8HAS5n
+team21,33Ws2E&60*
+team22,5$3M5BYTt@
+team23,1Y2lf&9hlf
+team24,N1g$4R9@2*
+team25,b0*Dd5786B
+team26,#8M03A0Bak
+team27,84F!7f47aQ
+team28,4f0H3Gi@Hv
+team29,H7zC8%t2Y3
+team30,7CZ7$40Wyl

users.py ADDED Viewed

	@@ -0,0 +1,38 @@

+from dataclasses import dataclass
+import pandas as pd
+import os
+@dataclass
+class Users:
+    db_file: str = 'users.csv'
+    current_path: str = os.path.abspath(os.path.dirname(__file__))
+    def get_db(self):
+        try:
+            db = pd.read_csv(os.path.join(self.current_path, self.db_file))
+        except FileNotFoundError:
+            db = self.create_db()
+        return db
+    def create_db(self):
+        db = pd.DataFrame(columns=['login', 'password'])
+        db = db.append({'login': 'admin', 'password': 'password'}, ignore_index=True)
+        db.to_csv(os.path.join(self.current_path, self.db_file), index=False)
+        return db
+    def exists(self, login: str, password: str) -> bool:
+        db = self.get_db()
+        if db.loc[db.login == login].shape[0] != 0:
+            if db.loc[db.login == login, 'password'].values[0] == password:
+                return True
+        return False
+    def is_admin(self, login: str, password: str) -> bool:
+        db = self.get_db()
+        if db.loc[db.login == login, 'login'].values[0] == 'admin' and \
+           db.loc[db.login == login, 'password'].values[0] == password:
+           return True
+        else:
+            return False

utils.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import base64
+# @st.cache
+# def load_data():
+#     df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data',
+#                      sep=' ', header=None)
+#     labels = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data',
+#                          header=None, sep=' ', names=['target', 'time']).iloc[:, 0]
+#     X_train = df.sample(**train_test_sampling)
+#     y_train = labels.loc[X_train.index]
+#     X_test = df.loc[~df.index.isin(X_train.index), :]
+#     y_test = labels.loc[X_test.index]
+#     return X_train, y_train, X_test, y_test
+# st.markdown(data, unsafe_allow_html=True)
+# if st.button('Load data'):
+#     X_train, y_train, X_test, y_test = load_data()
+#     st.markdown(get_table_download_link(X_train, filename='X_train.csv'), unsafe_allow_html=True)
+#     st.markdown(get_table_download_link(y_train, filename='y_train.csv'), unsafe_allow_html=True)
+#     st.markdown(get_table_download_link(X_test, filename='X_test.csv'), unsafe_allow_html=True)
+def get_table_download_link(df, filename):
+    """Generates a link allowing the data in a given panda dataframe to be downloaded
+    in:  dataframe
+    out: href string
+    """
+    csv = df.to_csv(index=False)
+    b64 = base64.b64encode(csv.encode()).decode()  # some strings <-> bytes conversions necessary here
+    return f'<a href="data:file/csv;base64,{b64}" download="{filename}">Download {filename}</a>'