Spaces:
Sleeping
Sleeping
Upload 11 files
Browse files- LICENSE +21 -0
- README.md +107 -14
- STATUS_DATATHON.txt +1 -0
- config.py +93 -0
- leaderboard.csv +2 -0
- leaderboard.py +53 -0
- requirements.txt +5 -0
- runtime.txt +1 -0
- users.csv +32 -0
- users.py +38 -0
- utils.py +30 -0
LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2017 Geoff Pleiss
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
CHANGED
@@ -1,14 +1,107 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
[](https://opensource.org/licenses/MIT)
|
2 |
+
|
3 |
+
[Heroku web app](https://minidatathon.herokuapp.com/)
|
4 |
+
|
5 |
+

|
6 |
+
|
7 |
+
# Mini Datathon
|
8 |
+
|
9 |
+
This datathon platform is fully developped in python using *streamlit* with very few lines of code!
|
10 |
+
|
11 |
+
As written in the title, it is designed for *small datathon* (but can easily scale) and the scripts are easy to understand.
|
12 |
+
|
13 |
+
## Installation
|
14 |
+
|
15 |
+
1) Easy way => using docker hub:
|
16 |
+
`docker pull spotep/mini_datathon:latest`
|
17 |
+
|
18 |
+
2) Alternative way => clone the repo into your server:
|
19 |
+
`git clone mini_datathon; cd mini_datathon`
|
20 |
+
|
21 |
+
## Usage
|
22 |
+
|
23 |
+
You need 3 simple steps to setup your mini hackathon:
|
24 |
+
|
25 |
+
1) Edit the password of the **admin** user in [users.csv](users.csv) and the login & passwords for the participants
|
26 |
+
2) Edit the [config.py](config.py) file\
|
27 |
+
a) The **presentation** & the **context** of the challenge \
|
28 |
+
b) The **data content** and `X_train`, `y_train`, `X_test` & `y_test` that you can upload on google drive and just **share the links**. \
|
29 |
+
c) The **evaluation metric** & **benchmark score**
|
30 |
+
3) Run the scripts\
|
31 |
+
a) If you installed it the _alternative way_: `streamlit run main.py` \
|
32 |
+
b) If you pulled the docker image, just **build** and **run** the container.
|
33 |
+
|
34 |
+
Please do not forget to notify the participants that the submission file need to be a csv **ordered the same way as given
|
35 |
+
in `y_train`**.
|
36 |
+
|
37 |
+
_Ps: anytime the admin user has the possibility to **pause** the challenge, in that case the participants won't be able to upload their submissions._
|
38 |
+
|
39 |
+
## Example
|
40 |
+
|
41 |
+
An example version of the code is deployed on heroku here: [web app](https://minidatathon.herokuapp.com/)
|
42 |
+
|
43 |
+
In the deployed version, we have the [UCI Secom](https://archive.ics.uci.edu/ml/datasets/SECOM)
|
44 |
+
imbalanced dataset (binary classification) and evaluated by the [PR-AUC score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html#sklearn.metrics.average_precision_score):
|
45 |
+
|
46 |
+
in the [config.py](config.py) file you would need to fill the following parameters:
|
47 |
+
|
48 |
+
- `GREATER_IS_BETTER = True`
|
49 |
+
- `SKLEARN_SCORER = average_precision_score`
|
50 |
+
- `SKLEARN_ADDITIONAL_PARAMETERS = {'average': 'micro'}`
|
51 |
+
- upload the relevant data the your Google Drive & share the links.
|
52 |
+
|
53 |
+
## Behind the scenes
|
54 |
+
### Databases
|
55 |
+
The platform needs only 2 components to be saved:
|
56 |
+
#### The leaderboard
|
57 |
+
The leaderboard is in fact a csv file that is being updated everytime a user submit predictions.
|
58 |
+
The csv file contains 4 columns:
|
59 |
+
- _id_: the login of the team
|
60 |
+
- _score_: the **best** score of the team
|
61 |
+
- _nb\_submissions_: the number of submissions the team uploads
|
62 |
+
- _rank_: the live rank of the team
|
63 |
+
|
64 |
+
We will have only 1 row per team since only the best score is being saved.
|
65 |
+
|
66 |
+
By default, a benchmark score is pushed to the leaderboard:
|
67 |
+
|
68 |
+
| id | score |
|
69 |
+
|-----------|-------|
|
70 |
+
| benchmark | 0.6 |
|
71 |
+
|
72 |
+
For more details, please refer to the script [leaderboard](leaderboard.py).
|
73 |
+
|
74 |
+
#### The users
|
75 |
+
Like the leaderboard, it is a csv file.
|
76 |
+
It is supposed to be defined by the admin of the competition.
|
77 |
+
It contains 2 columns:
|
78 |
+
- login
|
79 |
+
- password
|
80 |
+
|
81 |
+
A default user is created at first to begin to play with the platform:
|
82 |
+
|
83 |
+
| login | password |
|
84 |
+
|-----------|----------|
|
85 |
+
| admin | password |
|
86 |
+
|
87 |
+
In order to add new participants, simply add rows to the current users.csv file.
|
88 |
+
|
89 |
+
For more details, please refer to the script [users](users.py).
|
90 |
+
|
91 |
+
## Next steps
|
92 |
+
|
93 |
+
- [ ] allow to have a *private* and *public* leaderboard like it is done on kaggle.com
|
94 |
+
- [ ] allow to connect using oauth
|
95 |
+
|
96 |
+
|
97 |
+
## License
|
98 |
+
MIT License [here](LICENSE).
|
99 |
+
|
100 |
+
## Credits
|
101 |
+
We could not find an easy implementation for our yearly internal hackathon at Intel.
|
102 |
+
The idea originally came from my dear devops coworker [Elhay Efrat](https://github.com/shdowofdeath)
|
103 |
+
and I took the responsability to develop it.
|
104 |
+
|
105 |
+
If you like this project, let me know by [buying me a coffee](https://www.buymeacoffee.com/jeremyatia) :)
|
106 |
+
|
107 |
+
<a href="https://www.buymeacoffee.com/jeremyatia" target="_blank"><img src="https://cdn.buymeacoffee.com/buttons/v2/default-yellow.png" alt="Buy Me A Coffee" style="height: 100px !important;width: 300px !important;" ></a>
|
STATUS_DATATHON.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
running
|
config.py
ADDED
@@ -0,0 +1,93 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Presentation of the challenge
|
2 |
+
context_markdown = """
|
3 |
+
The Goal of the first challenge is to estimate the category of the uploaded youtube video.
|
4 |
+
|
5 |
+
"""
|
6 |
+
content_markdown = """
|
7 |
+
### Multi-class Problem
|
8 |
+
|
9 |
+
It has the following features/target:
|
10 |
+
|
11 |
+
#### Features
|
12 |
+
|
13 |
+
- local_path: The local path to the upload's data.
|
14 |
+
- upload_id: The unique identifier of the upload.
|
15 |
+
- clean_upload_id: The upload_id with the "suicide_out_" prefix removed.
|
16 |
+
- upload_type: An enumeration representing the type of upload. Default is UploadType.GENERAL.
|
17 |
+
- features: A dictionary containing additional features associated with the upload.
|
18 |
+
- title: The title of the upload.
|
19 |
+
- playlist_title: The title of the playlist the upload belongs to.
|
20 |
+
- description: The description of the upload.
|
21 |
+
- duration_string: The duration of the upload in string format.
|
22 |
+
- duration: The duration of the upload in seconds.
|
23 |
+
- upload_date: The date when the upload was uploaded.
|
24 |
+
- view_count: The number of views the upload has received.
|
25 |
+
- comment_count: The number of comments on the upload.
|
26 |
+
- like_count: The number of likes on the upload.
|
27 |
+
- tags: The tags associated with the upload.
|
28 |
+
|
29 |
+
### Target
|
30 |
+
- categories: The categories associated with the upload.
|
31 |
+
|
32 |
+
|
33 |
+
You can find the details about the context/data/challenge [here](https://drive.google.com/file/d/1qyEmi6UUWlyzeVPhPnqY2JNRHBPutak-/view?usp=sharing)
|
34 |
+
"""
|
35 |
+
#------------------------------------------------------------------------------------------------------------------#
|
36 |
+
|
37 |
+
# Guide for the participants to get X_train, y_train and X_test
|
38 |
+
# The google link can be placed in your google drive => get the shared links and place them here.
|
39 |
+
data_instruction_commands = """
|
40 |
+
The data can be parsed using the [youtube_modules.py](https://drive.google.com/file/d/1FCKpBTvTdL2RoNpIp9fHY18006CiglT2/view?usp=drive_link) script.
|
41 |
+
You can find the readme [here](https://drive.google.com/file/d/1wBJmwfZ9JzcQ0MxvwamYxBjwYbpEsgMx/view?usp=drive_link)
|
42 |
+
|
43 |
+
```python
|
44 |
+
from youtube_modules import *
|
45 |
+
import pickle
|
46 |
+
import random
|
47 |
+
import numpy as np
|
48 |
+
|
49 |
+
train_uploads: List[Upload] = pickle.load(open("<path/to/data>/train_uploads.pkl", 'rb' ))
|
50 |
+
|
51 |
+
test_uploads: List[Upload] = pickle.load(open("<path/to/data>/test_uploads.pkl", 'rb' ))
|
52 |
+
```
|
53 |
+
|
54 |
+
Make sure to upload your predictions as a .csv file with the columns: "id" (range(len(test_file))) and "label" (1, 2, 3).
|
55 |
+
|
56 |
+
## Quickstart: use notebook remotely
|
57 |
+
1. conda activate py38_default
|
58 |
+
2. notebook load from remote
|
59 |
+
$ jupyter notebook --ip=0.0.0.0 --no-browser
|
60 |
+
|
61 |
+
then after receiving the URL copied and put it in your browser
|
62 |
+
https://127.0.0.1:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed2433
|
63 |
+
|
64 |
+
Then replace 127.0.0.1 with your I.P. e.g
|
65 |
+
https://1.222.333.4:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed24336
|
66 |
+
"""
|
67 |
+
|
68 |
+
# Target on test (hidden from the participants)
|
69 |
+
Y_TEST_GOOGLE_PUBLIC_LINK = 'https://drive.google.com/file/d/1gQ3_ywJElpcBrewCFhVUM-fnV4SN62na/view?usp=sharing'
|
70 |
+
#------------------------------------------------------------------------------------------------------------------#
|
71 |
+
|
72 |
+
# Evaluation metric and content
|
73 |
+
from sklearn.metrics import f1_score
|
74 |
+
GREATER_IS_BETTER = True # example for ROC-AUC == True, for MSE == False, etc.
|
75 |
+
SKLEARN_SCORER = f1_score
|
76 |
+
SKLEARN_ADDITIONAL_PARAMETERS = {'average': 'weighted'}
|
77 |
+
|
78 |
+
evaluation_content = """
|
79 |
+
The predictions are evaluated according to the f1-score (weighted).
|
80 |
+
|
81 |
+
You can get it using
|
82 |
+
```python
|
83 |
+
from sklearn.metrics import f1_score
|
84 |
+
|
85 |
+
f1_score(y_train, y_pred_train, average='weighted')
|
86 |
+
```
|
87 |
+
More details [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).
|
88 |
+
"""
|
89 |
+
#------------------------------------------------------------------------------------------------------------------#
|
90 |
+
|
91 |
+
# leaderboard benchmark score, will be displayed to everyone
|
92 |
+
BENCHMARK_SCORE = 0.2
|
93 |
+
#------------------------------------------------------------------------------------------------------------------#
|
leaderboard.csv
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
id,score,nb_submissions
|
2 |
+
benchmark,0.2,1
|
leaderboard.py
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from dataclasses import dataclass
|
2 |
+
from config import BENCHMARK_SCORE, GREATER_IS_BETTER
|
3 |
+
import numpy as np
|
4 |
+
import pandas as pd
|
5 |
+
import os
|
6 |
+
|
7 |
+
|
8 |
+
@dataclass
|
9 |
+
class LeaderBoard:
|
10 |
+
benchmark_score: float = BENCHMARK_SCORE
|
11 |
+
db_file: str = 'leaderboard.csv'
|
12 |
+
current_path: str = os.path.abspath(os.path.dirname(__file__))
|
13 |
+
|
14 |
+
def get(self):
|
15 |
+
try:
|
16 |
+
leaderboard = pd.read_csv(os.path.join(self.current_path, self.db_file))
|
17 |
+
except FileNotFoundError:
|
18 |
+
leaderboard = self.create()
|
19 |
+
return leaderboard
|
20 |
+
|
21 |
+
def create(self):
|
22 |
+
ldb = pd.DataFrame(columns=['id', 'score', 'nb_submissions'], index=[0])
|
23 |
+
ldb.loc[0, 'id'] = 'benchmark'
|
24 |
+
ldb.loc[0, 'score'] = self.benchmark_score
|
25 |
+
ldb.loc[0, 'nb_submissions'] = 1
|
26 |
+
ldb.to_csv(os.path.join(self.current_path, self.db_file), index=False)
|
27 |
+
return ldb
|
28 |
+
|
29 |
+
def edit(self, leaderboard: pd.DataFrame, id: str, score: float) -> pd.DataFrame:
|
30 |
+
new_lb = leaderboard.copy()
|
31 |
+
if new_lb[new_lb.id == id].shape[0] == 0:
|
32 |
+
new_lb = new_lb.append({'id': id, 'score': score, 'nb_submissions': 1}, ignore_index=True)
|
33 |
+
else:
|
34 |
+
current_score = new_lb.loc[new_lb.id == id, 'score'].values[0]
|
35 |
+
if self.compare_score(score, current_score, greater_is_better=GREATER_IS_BETTER):
|
36 |
+
new_lb.loc[new_lb.id == id, 'score'] = score
|
37 |
+
new_lb.loc[new_lb.id == id, 'nb_submissions'] += 1
|
38 |
+
new_lb.to_csv(os.path.join(self.current_path, self.db_file), index=False)
|
39 |
+
return new_lb
|
40 |
+
|
41 |
+
@staticmethod
|
42 |
+
def show(leaderboard: pd.DataFrame, ascending: bool) -> pd.DataFrame:
|
43 |
+
new_lb = leaderboard.sort_values('score', ascending=ascending, ignore_index=True)
|
44 |
+
new_lb['rank'] = np.arange(1, new_lb.shape[0] + 1)
|
45 |
+
return new_lb
|
46 |
+
|
47 |
+
@staticmethod
|
48 |
+
def compare_score(new_score: float, current_score: float, greater_is_better: bool=True) -> bool:
|
49 |
+
if greater_is_better:
|
50 |
+
return new_score > current_score
|
51 |
+
else:
|
52 |
+
return new_score < current_score
|
53 |
+
|
requirements.txt
ADDED
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
pandas==1.4.4
|
2 |
+
streamlit==1.11.0
|
3 |
+
scikit-learn==1.3.0
|
4 |
+
numpy==1.24.2
|
5 |
+
altair<5
|
runtime.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
python-3.9.6
|
users.csv
ADDED
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
login,password
|
2 |
+
admin,Datathon2023!
|
3 |
+
team1,zZS@LH804J
|
4 |
+
team2,7#onYaC3j5
|
5 |
+
team3,t6uN*4458f
|
6 |
+
team4,00Ne&o0x5!
|
7 |
+
team5,1P4&H4LVu$
|
8 |
+
team6,02#WZqB4s*
|
9 |
+
team7,66D5zCw!Y0
|
10 |
+
team8,5yW0*a&1MB
|
11 |
+
team9,k1A76TQ02&
|
12 |
+
team10,I1HuK8681@
|
13 |
+
team11,Eb$7$730Ty
|
14 |
+
team12,6%9PhRv4^A
|
15 |
+
team13,875^GBFi^l
|
16 |
+
team14,hT3^Va66!K
|
17 |
+
team15,Jib3W59@N0
|
18 |
+
team16,9d^D39a6fw
|
19 |
+
team17,iM@1$962U9
|
20 |
+
team18,&35$8f@Y$L
|
21 |
+
team19,r77h2E6qr^
|
22 |
+
team20,t28*8HAS5n
|
23 |
+
team21,33Ws2E&60*
|
24 |
+
team22,5$3M5BYTt@
|
25 |
+
team23,1Y2lf&9hlf
|
26 |
+
team24,N1g$4R9@2*
|
27 |
+
team25,b0*Dd5786B
|
28 |
+
team26,#8M03A0Bak
|
29 |
+
team27,84F!7f47aQ
|
30 |
+
team28,4f0H3Gi@Hv
|
31 |
+
team29,H7zC8%t2Y3
|
32 |
+
team30,7CZ7$40Wyl
|
users.py
ADDED
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
from dataclasses import dataclass
|
2 |
+
import pandas as pd
|
3 |
+
import os
|
4 |
+
|
5 |
+
|
6 |
+
@dataclass
|
7 |
+
class Users:
|
8 |
+
db_file: str = 'users.csv'
|
9 |
+
current_path: str = os.path.abspath(os.path.dirname(__file__))
|
10 |
+
|
11 |
+
def get_db(self):
|
12 |
+
try:
|
13 |
+
db = pd.read_csv(os.path.join(self.current_path, self.db_file))
|
14 |
+
except FileNotFoundError:
|
15 |
+
db = self.create_db()
|
16 |
+
return db
|
17 |
+
|
18 |
+
def create_db(self):
|
19 |
+
db = pd.DataFrame(columns=['login', 'password'])
|
20 |
+
db = db.append({'login': 'admin', 'password': 'password'}, ignore_index=True)
|
21 |
+
db.to_csv(os.path.join(self.current_path, self.db_file), index=False)
|
22 |
+
return db
|
23 |
+
|
24 |
+
def exists(self, login: str, password: str) -> bool:
|
25 |
+
db = self.get_db()
|
26 |
+
if db.loc[db.login == login].shape[0] != 0:
|
27 |
+
if db.loc[db.login == login, 'password'].values[0] == password:
|
28 |
+
return True
|
29 |
+
return False
|
30 |
+
|
31 |
+
def is_admin(self, login: str, password: str) -> bool:
|
32 |
+
db = self.get_db()
|
33 |
+
if db.loc[db.login == login, 'login'].values[0] == 'admin' and \
|
34 |
+
db.loc[db.login == login, 'password'].values[0] == password:
|
35 |
+
return True
|
36 |
+
else:
|
37 |
+
return False
|
38 |
+
|
utils.py
ADDED
@@ -0,0 +1,30 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import base64
|
2 |
+
|
3 |
+
# @st.cache
|
4 |
+
# def load_data():
|
5 |
+
# df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data',
|
6 |
+
# sep=' ', header=None)
|
7 |
+
# labels = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom_labels.data',
|
8 |
+
# header=None, sep=' ', names=['target', 'time']).iloc[:, 0]
|
9 |
+
# X_train = df.sample(**train_test_sampling)
|
10 |
+
# y_train = labels.loc[X_train.index]
|
11 |
+
# X_test = df.loc[~df.index.isin(X_train.index), :]
|
12 |
+
# y_test = labels.loc[X_test.index]
|
13 |
+
# return X_train, y_train, X_test, y_test
|
14 |
+
|
15 |
+
# st.markdown(data, unsafe_allow_html=True)
|
16 |
+
# if st.button('Load data'):
|
17 |
+
# X_train, y_train, X_test, y_test = load_data()
|
18 |
+
# st.markdown(get_table_download_link(X_train, filename='X_train.csv'), unsafe_allow_html=True)
|
19 |
+
# st.markdown(get_table_download_link(y_train, filename='y_train.csv'), unsafe_allow_html=True)
|
20 |
+
# st.markdown(get_table_download_link(X_test, filename='X_test.csv'), unsafe_allow_html=True)
|
21 |
+
|
22 |
+
|
23 |
+
def get_table_download_link(df, filename):
|
24 |
+
"""Generates a link allowing the data in a given panda dataframe to be downloaded
|
25 |
+
in: dataframe
|
26 |
+
out: href string
|
27 |
+
"""
|
28 |
+
csv = df.to_csv(index=False)
|
29 |
+
b64 = base64.b64encode(csv.encode()).decode() # some strings <-> bytes conversions necessary here
|
30 |
+
return f'<a href="data:file/csv;base64,{b64}" download="{filename}">Download {filename}</a>'
|