Spaces:

jeremyadd
/

mini_datathon

Sleeping

App Files Files Community

jeremyadd commited on Nov 13, 2024

Commit

ad5fa96

verified ·

1 Parent(s): 8166422

Update config.py

Browse files

Files changed (1) hide show

config.py +61 -92

config.py CHANGED Viewed

@@ -1,93 +1,62 @@
-# Presentation of the challenge
-context_markdown = """
-The Goal of the first challenge is to estimate the category of the uploaded youtube video.
-"""
-content_markdown = """
-### Multi-class Problem
-It has the following features/target:
-#### Features
-- local_path: The local path to the upload's data.
-- upload_id: The unique identifier of the upload.
-- clean_upload_id: The upload_id with the "suicide_out_" prefix removed.
-- upload_type: An enumeration representing the type of upload. Default is UploadType.GENERAL.
-- features: A dictionary containing additional features associated with the upload.
-- title: The title of the upload.
-- playlist_title: The title of the playlist the upload belongs to.
-- description: The description of the upload.
-- duration_string: The duration of the upload in string format.
-- duration: The duration of the upload in seconds.
-- upload_date: The date when the upload was uploaded.
-- view_count: The number of views the upload has received.
-- comment_count: The number of comments on the upload.
-- like_count: The number of likes on the upload.
-- tags: The tags associated with the upload.
-### Target
-- categories: The categories associated with the upload.
-You can find the details about the context/data/challenge [here](https://drive.google.com/file/d/1qyEmi6UUWlyzeVPhPnqY2JNRHBPutak-/view?usp=sharing)
-"""
-#------------------------------------------------------------------------------------------------------------------#
-# Guide for the participants to get X_train, y_train and X_test
-# The google link can be placed in your google drive => get the shared links and place them here.
-data_instruction_commands = """
-The data can be parsed using the [youtube_modules.py](https://drive.google.com/file/d/1FCKpBTvTdL2RoNpIp9fHY18006CiglT2/view?usp=drive_link) script.
-You can find the readme [here](https://drive.google.com/file/d/1wBJmwfZ9JzcQ0MxvwamYxBjwYbpEsgMx/view?usp=drive_link)
-```python
-from youtube_modules import *
-import pickle
-import random
-import numpy as np
-train_uploads: List[Upload] = pickle.load(open("<path/to/data>/train_uploads.pkl", 'rb' ))
-test_uploads: List[Upload] = pickle.load(open("<path/to/data>/test_uploads.pkl", 'rb' ))
-```
-Make sure to upload your predictions as a .csv file with the columns: "id" (range(len(test_file))) and "label" (1, 2, 3).
-## Quickstart: use notebook remotely
-1. conda activate py38_default
-2. notebook load from remote
-$ jupyter notebook --ip=0.0.0.0 --no-browser
-then after receiving the URL copied and put it in your browser
- https://127.0.0.1:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed2433
-Then replace 127.0.0.1 with your I.P. e.g
- https://1.222.333.4:8888/?token=7de849a953befd20682d57ac33b3e6cd9024ca25eed24336
-"""
-# Target on test (hidden from the participants)
-Y_TEST_GOOGLE_PUBLIC_LINK = 'https://drive.google.com/file/d/1gQ3_ywJElpcBrewCFhVUM-fnV4SN62na/view?usp=sharing'
-#------------------------------------------------------------------------------------------------------------------#
-# Evaluation metric and content
-from sklearn.metrics import f1_score
-GREATER_IS_BETTER = True  # example for ROC-AUC == True, for MSE == False, etc.
-SKLEARN_SCORER = f1_score
-SKLEARN_ADDITIONAL_PARAMETERS = {'average': 'weighted'}
-evaluation_content = """
-The predictions are evaluated according to the f1-score (weighted).
-You can get it using
-```python
-from sklearn.metrics import f1_score
-f1_score(y_train, y_pred_train, average='weighted')
-```
-More details [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).
-"""
-#------------------------------------------------------------------------------------------------------------------#
-# leaderboard benchmark score, will be displayed to everyone
-BENCHMARK_SCORE = 0.2
 #------------------------------------------------------------------------------------------------------------------#

+# Presentation of the challenge
+context_markdown = """
+Manufacturing process feature selection and categorization
+"""
+content_markdown = """
+Abstract: Data from a semi-conductor manufacturing process
+    Data Set Characteristics: Multivariate
+    Number of Instances: 1567
+    Area: Computer
+    Attribute Characteristics: Real
+    Number of Attributes: 591
+    Date Donated: 2008-11-19
+    Associated Tasks: Classification, Causal-Discovery
+    Missing Values? Yes
+A complex modern semi-conductor manufacturing process is normally under consistent
+surveillance via the monitoring of signals/variables collected from sensors and or
+process measurement points. However, not all of these signals are equally valuable
+in a specific monitoring system. The measured signals contain a combination of
+useful information, irrelevant information as well as noise. It is often the case
+that useful information is buried in the latter two. Engineers typically have a
+much larger number of signals than are actually required. If we consider each type
+of signal as a feature, then feature selection may be applied to identify the most
+relevant signals. The Process Engineers may then use these signals to determine key
+factors contributing to yield excursions downstream in the process. This will
+enable an increase in process throughput, decreased time to learning and reduce the
+per unit production costs.
+"""
+#------------------------------------------------------------------------------------------------------------------#
+# Guide for the participants to get X_train, y_train and X_test
+# The google link can be placed in your google drive => get the shared links and place them here.
+data_instruction_commands = """
+In order to get the data simply run the following command:
+```python
+df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/secom/secom.data', sep=' ', header=None)
+```
+Please ask the admin in order to get the target and the random seed used for train/test split.
+"""
+# Target on test (hidden from the participants)
+Y_TEST_GOOGLE_PUBLIC_LINK = 'https://drive.google.com/file/d/1-3X4eN_xk00GY4Bf6YU4mGtvQ8s_MDCQ/view?usp=sharing'
+#------------------------------------------------------------------------------------------------------------------#
+# Evaluation metric and content
+from sklearn.metrics import precision_recall_curve as prauc
+GREATER_IS_BETTER = True  # example for ROC-AUC == True, for MSE == False, etc.
+SKLEARN_SCORER = prauc
+SKLEARN_ADDITIONAL_PARAMETERS = {}
+evaluation_content = """
+The predictions are evaluated according to the PR-AUC score.
+You can get it using
+```python
+from sklearn.metrics import average_precision_score as prauc
+```
+More details [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html).
+"""
+#------------------------------------------------------------------------------------------------------------------#
+# leaderboard benchmark score, will be displayed to everyone
+BENCHMARK_SCORE = 0.7
 #------------------------------------------------------------------------------------------------------------------#