diff --git "a/midterm/midterm/take_at_home_(1).ipynb" "b/midterm/midterm/take_at_home_(1).ipynb" deleted file mode 100644--- "a/midterm/midterm/take_at_home_(1).ipynb" +++ /dev/null @@ -1,1335 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "id": "oZ6_2B0E1DAh" - }, - "source": [ - "# Midterm - Spring 2023\n", - "\n", - "## Problem 1: Take-at-home (45 points total)\n", - "\n", - "You are applying for a position at the data science team of USDA and you are given data associated with determining appropriate parasite treatment of canines. The suggested treatment options are determined based on a **logistic regression** model that predicts if the canine is infected with a parasite. \n", - "\n", - "The data is given in the site: https://data.world/ehales/grls-parasite-study/workspace/file?filename=CBC_data.csv and more specifically in the CBC_data.csv file. Login using you University Google account to access the data and the description that includes a paper on the study (**you dont need to read the paper to solve this problem**). Your target variable $y$ column is titled `parasite_status`. \n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Aq8bln4u1DAo" - }, - "source": [ - "### Question 1 - Feature Engineering (5 points)\n", - "\n", - "Write the posterior probability expressions for logistic regression for the problem you are given to solve." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "_kd85pkA1DA3" - }, - "source": [ - "$$p(y=1| \\mathbf{x}, \\mathbf w)= \\frac{p(\\mathbf{x}| y=1)p(y=1)}{p(\\mathbf{x}|y=1)p(y=1)+p(\\mathbf{x}|y=0)}=\\frac{1}{1+\\exp(-\\alpha)}=\\sigma(\\alpha)$$\n", - "\n", - "$$p(y=0| \\mathbf{x}, \\mathbf w)=1-p(y=1|\\mathbf{x}^{T}\\mathbf{w})=1-\\sigma(\\alpha)=\\sigma(-\\alpha)$$" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "Gh6Fi5hz1DA6" - }, - "source": [ - "\n", - "\n", - "### Question 2 - Decision Boundary (5 points)\n", - "\n", - "Write the expression for the decision boundary assuming that $p(y=1)=p(y=0)$. The decision boundary is the line that separates the two classes.\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "HMr2tF_J1DA-" - }, - "source": [ - "$$p(y=1)=p(y=0)→\\sigma(\\alpha)=-\\sigma(\\alpha)→2\\sigma(\\alpha)=1→\\sigma(\\alpha)=0.5≡\\sigma(\\mathbf{w}^T\\mathbf{x})=0.5$$" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "750Hn0iC1DBA" - }, - "source": [ - "\n", - "\n", - "### Question 3 - Loss function (5 points)\n", - "\n", - "Write the expression of the loss as a function of $\\mathbf w$ that makes sense for you to use in this problem. \n", - "\n", - "NOTE: The loss will be a function that will include this function: \n", - "\n", - "$$\\sigma(a) = \\frac{1}{1+e^{-a}}$$\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "jxiR0jEh1DBD" - }, - "source": [ - "$$\n", - "\\begin{align}\n", - "L_{CE} = -[\\sum_{i=1}^m \\{y_i\\ln \\hat{y}_i + (1-y_i)\\ln(1-\\hat{y}_i)\\}]\\\\\n", - "= -[\\sum_{i=1}^m\\{y_i\\ln\\frac{1}{1+\\exp(-\\mathbf{w}^T\\mathbf{x})}+(1-y_i)\\ln(1-\\frac{1}{1+\\exp(-\\mathbf{w}^T\\mathbf{x})})\\}] \\\\\n", - "= -[\\sum_{i=1}^m\\{y_i[\\ln\\frac{1}{1+\\exp(-\\alpha)}-\\ln(1-\\frac{1}{1+\\exp(-\\alpha)})]+\\ln(1-\\frac{1}{1+\\exp(-\\alpha)})\\}] \\\\\n", - "= -[\\sum_{i=1}^m\\{y_i\\mathbf{w}^T\\mathbf{x}-\\ln(1+\\exp(\\mathbf{w}^T\\mathbf{x}))\\}]\n", - "\\end{align}\n", - "$$\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "AW4xA4221DBF" - }, - "source": [ - "\n", - "### Question 4 - Gradient (5 points)\n", - "\n", - "Write the expression of the gradient of the loss with respect to the parameters - show all your work.\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "bo0YDA0i1DBJ" - }, - "source": [ - "$$\n", - "\\begin{align}\n", - "\\nabla_\\mathbf w L_{CE} = \\nabla_\\mathbf{w}-[\\sum_{i=1}^m\\{y_i\\mathbf{w}^T\\mathbf{x}-ln(1+\\exp(\\mathbf{w}^T\\mathbf{x})\\}] \\\\\n", - "= [-\\sum_{i=1}^my_ix_i] + [\\sum_{i=1}^m\\frac{1}{1+\\exp(\\mathbf{w}^T\\mathbf{x})}\\exp(\\mathbf{w}^T\\mathbf{x})*x_i] \\\\\n", - "= [-\\sum_{i=1}^my_ix_i] + [\\sum_{i=1}^m(\\sigma(\\mathbf{w}^T\\mathbf{x}))*x_i] \\\\\n", - "= \\sum_{i=1}^m (\\sigma(\\mathbf{w}^T\\mathbf{x})-y_i)x_i= \\sum_{i=1}^m(\\hat{y}_i-y_i)x_i\n", - "\\end{align}\n", - "$$" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "BpUryvTT1DBM" - }, - "source": [ - "### Question 5 - Imbalanced dataset (10 points)\n", - "\n", - "You are now told that in the dataset \n", - "\n", - "$$p(y=0) >> p(y=1)$$\n", - "\n", - "Can you comment if the accuracy of Logistic Regression will be affected by such imbalance?\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "TqdImYQf1DBP" - }, - "source": [ - "We know that the loss function heavily penalizes confident wrong decisions. We expect then, that the model will be strongly incentivized to predict 0 more frequently than 1, regardless of the true outcome, as this minimizes loss. This will cause more false negatives, and will need to be considered with regards to our ROC curve. The accuracy will be affected, as there are so few positive examples that the model cannot accurately learn them." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "id": "FK1Su76R1DBS" - }, - "source": [ - "\n", - "### Question 6 - SGD (15 points)\n", - "\n", - "The interviewer was impressed with your answers and wants to test your programming skills. \n", - "\n", - "1. Use the dataset to train a logistic regressor that will predict the target variable $y$. \n", - "\n", - " 2. Report the harmonic mean of precision (p) and recall (r) i.e the [metric called $F_1$ score](https://en.wikipedia.org/wiki/F-score) that is calculated as shown below using a test dataset that is 20% of each group. Plot the $F_1$ score vs the iteration number $t$. \n", - "\n", - "$$F_1 = \\frac{2}{r^{-1} + p^{-1}}$$\n", - "\n", - "Your code includes hyperparameter optimization of the learning rate and mini batch size. Please learn about cross validation which is a splitting strategy for tuning models [here](https://scikit-learn.org/stable/modules/cross_validation.html).\n", - "\n", - "You are allowed to use any library you want to code this problem.\n", - "\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": { - "id": "cnxqYSvL1DBV" - }, - "outputs": [], - "source": [ - "# write your code here" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Preprocessing" - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
IDSEXTYPEAREASEX.REPROREPRO.STATUSAGEPARASITE_STATUSRBCHGBWBCEOS.CNTMONO.CNTNUT.CNTPL.CNTLYMP.CNT
0grls5ZUT2BYYMaleSuburbanIntactMaleIntact9Negative6.416.614.2142.0852.06390.0210.06816.0
1grls8DCONYUUFemaleRuralNeuteredFemaleNeutered6Negative4.812.510.0400.0300.04800.0209.04500.0
2grlsUC5R4PTTMaleSuburbanIntactMaleIntact14Negative6.217.39.5190.0475.07315.0164.01520.0
3grlsXUR2PY88MaleRuralIntactMaleIntact6Negative5.413.814.11692.0423.07755.0254.04230.0
4grlsTBZUF3GGFemaleRuralIntactFemaleIntact18Negative5.914.46.5390.0130.02795.0213.03185.0
\n", - "
" - ], - "text/plain": [ - " ID SEX TYPEAREA SEX.REPRO REPRO.STATUS AGE \\\n", - "0 grls5ZUT2BYY Male Suburban IntactMale Intact 9 \n", - "1 grls8DCONYUU Female Rural NeuteredFemale Neutered 6 \n", - "2 grlsUC5R4PTT Male Suburban IntactMale Intact 14 \n", - "3 grlsXUR2PY88 Male Rural IntactMale Intact 6 \n", - "4 grlsTBZUF3GG Female Rural IntactFemale Intact 18 \n", - "\n", - " PARASITE_STATUS RBC HGB WBC EOS.CNT MONO.CNT NUT.CNT PL.CNT \\\n", - "0 Negative 6.4 16.6 14.2 142.0 852.0 6390.0 210.0 \n", - "1 Negative 4.8 12.5 10.0 400.0 300.0 4800.0 209.0 \n", - "2 Negative 6.2 17.3 9.5 190.0 475.0 7315.0 164.0 \n", - "3 Negative 5.4 13.8 14.1 1692.0 423.0 7755.0 254.0 \n", - "4 Negative 5.9 14.4 6.5 390.0 130.0 2795.0 213.0 \n", - "\n", - " LYMP.CNT \n", - "0 6816.0 \n", - "1 4500.0 \n", - "2 1520.0 \n", - "3 4230.0 \n", - "4 3185.0 " - ] - }, - "execution_count": 2, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import pandas as pd\n", - "\n", - "df = pd.read_csv('../data/01_raw/CBC_data.csv')\n", - "df.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/tmp/ipykernel_79816/1867621695.py:5: SettingWithCopyWarning: \n", - "A value is trying to be set on a copy of a slice from a DataFrame.\n", - "Try using .loc[row_indexer,col_indexer] = value instead\n", - "\n", - "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", - " df2[x] = LabelEncoder().fit_transform(df2[x])\n", - "/tmp/ipykernel_79816/1867621695.py:5: SettingWithCopyWarning: \n", - "A value is trying to be set on a copy of a slice from a DataFrame.\n", - "Try using .loc[row_indexer,col_indexer] = value instead\n", - "\n", - "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", - " df2[x] = LabelEncoder().fit_transform(df2[x])\n", - "/tmp/ipykernel_79816/1867621695.py:5: SettingWithCopyWarning: \n", - "A value is trying to be set on a copy of a slice from a DataFrame.\n", - "Try using .loc[row_indexer,col_indexer] = value instead\n", - "\n", - "See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy\n", - " df2[x] = LabelEncoder().fit_transform(df2[x])\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SEXTYPEAREASEX.REPROREPRO.STATUSAGEPARASITE_STATUSRBCHGBWBCEOS.CNTMONO.CNTNUT.CNTPL.CNTLYMP.CNT
01SuburbanIntactMale0906.416.614.2142.0852.06390.0210.06816.0
10RuralNeuteredFemale1604.812.510.0400.0300.04800.0209.04500.0
21SuburbanIntactMale01406.217.39.5190.0475.07315.0164.01520.0
31RuralIntactMale0605.413.814.11692.0423.07755.0254.04230.0
40RuralIntactFemale01805.914.46.5390.0130.02795.0213.03185.0
\n", - "
" - ], - "text/plain": [ - " SEX TYPEAREA SEX.REPRO REPRO.STATUS AGE PARASITE_STATUS RBC \\\n", - "0 1 Suburban IntactMale 0 9 0 6.4 \n", - "1 0 Rural NeuteredFemale 1 6 0 4.8 \n", - "2 1 Suburban IntactMale 0 14 0 6.2 \n", - "3 1 Rural IntactMale 0 6 0 5.4 \n", - "4 0 Rural IntactFemale 0 18 0 5.9 \n", - "\n", - " HGB WBC EOS.CNT MONO.CNT NUT.CNT PL.CNT LYMP.CNT \n", - "0 16.6 14.2 142.0 852.0 6390.0 210.0 6816.0 \n", - "1 12.5 10.0 400.0 300.0 4800.0 209.0 4500.0 \n", - "2 17.3 9.5 190.0 475.0 7315.0 164.0 1520.0 \n", - "3 13.8 14.1 1692.0 423.0 7755.0 254.0 4230.0 \n", - "4 14.4 6.5 390.0 130.0 2795.0 213.0 3185.0 " - ] - }, - "execution_count": 3, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from sklearn.preprocessing import LabelEncoder\n", - "le = LabelEncoder()\n", - "df2 = df.loc[:, df.columns != 'ID']\n", - "for x in ['SEX', 'REPRO.STATUS', 'PARASITE_STATUS']:\n", - " df2[x] = LabelEncoder().fit_transform(df2[x])\n", - "df2.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "Index(['SEX', 'REPRO.STATUS', 'AGE', 'PARASITE_STATUS', 'TYPEAREA_Rural',\n", - " 'TYPEAREA_Suburban', 'TYPEAREA_Urban', 'SEX.REPRO_IntactFemale',\n", - " 'SEX.REPRO_IntactMale', 'SEX.REPRO_NeuteredFemale',\n", - " 'SEX.REPRO_NeuteredMale'],\n", - " dtype='object')" - ] - }, - "execution_count": 4, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df3 = pd.get_dummies(df2).dropna(how='any', axis=1)\n", - "df3.columns" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
SEXREPRO.STATUSAGEPARASITE_STATUSTYPEAREA_RuralTYPEAREA_SuburbanTYPEAREA_UrbanSEX.REPRO_IntactFemaleSEX.REPRO_IntactMaleSEX.REPRO_NeuteredFemaleSEX.REPRO_NeuteredMale
010900100100
101601000010
2101400100100
310601000100
4001801001000
\n", - "
" - ], - "text/plain": [ - " SEX REPRO.STATUS AGE PARASITE_STATUS TYPEAREA_Rural TYPEAREA_Suburban \\\n", - "0 1 0 9 0 0 1 \n", - "1 0 1 6 0 1 0 \n", - "2 1 0 14 0 0 1 \n", - "3 1 0 6 0 1 0 \n", - "4 0 0 18 0 1 0 \n", - "\n", - " TYPEAREA_Urban SEX.REPRO_IntactFemale SEX.REPRO_IntactMale \\\n", - "0 0 0 1 \n", - "1 0 0 0 \n", - "2 0 0 1 \n", - "3 0 0 1 \n", - "4 0 1 0 \n", - "\n", - " SEX.REPRO_NeuteredFemale SEX.REPRO_NeuteredMale \n", - "0 0 0 \n", - "1 1 0 \n", - "2 0 0 \n", - "3 0 0 \n", - "4 0 0 " - ] - }, - "execution_count": 5, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df3.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from sklearn.model_selection import train_test_split\n", - "X = df3.drop(['PARASITE_STATUS'], axis=1)\n", - "y = df3['PARASITE_STATUS']\n", - "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Logreg Model" - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "from sklearn.base import BaseEstimator, ClassifierMixin\n", - "from sklearn import metrics\n", - "\n", - "\n", - "class MyLogisticRegression(BaseEstimator, ClassifierMixin):\n", - " def __init__(self, lr=0.01, batch_size=None, max_iter=100, tol=1e-4):\n", - " self.lr = lr\n", - " self.batch_size = batch_size\n", - " self.max_iter = max_iter\n", - " self.tol = tol\n", - " self.f1_score_history = []\n", - "\n", - " def _sigmoid(self, z):\n", - " z = np.clip(z, -1e2, 1e2)\n", - " return 1 / (1 + np.exp(-z))\n", - "\n", - " def _add_intercept(self, X):\n", - " return np.concatenate((np.ones((X.shape[0], 1)), X), axis=1)\n", - "\n", - " def _compute_gradient(self, X, y):\n", - " y_pred = self._sigmoid(np.dot(X, self.coef_))\n", - " grad = np.dot(X.T, (y_pred - y)) / X.shape[0]\n", - " return grad\n", - "\n", - " def fit(self, X, y):\n", - " X = self._add_intercept(X)\n", - "\n", - " # Initialize weights to zeros\n", - " self.coef_ = np.zeros(X.shape[1])\n", - "\n", - " # Mini-batch gradient descent\n", - " for epoch in range(self.max_iter):\n", - " if self.batch_size is not None:\n", - " batch_indices = np.random.choice(X.shape[0], size=self.batch_size, replace=False)\n", - " X_batch = X[batch_indices]\n", - " y_batch = y[batch_indices]\n", - " else:\n", - " X_batch = X\n", - " y_batch = y\n", - "\n", - " # Compute gradient\n", - " grad = self._compute_gradient(X_batch, y_batch)\n", - "\n", - " # Update weights\n", - " self.coef_ -= self.lr * grad\n", - "\n", - " # Check for convergence\n", - " if np.abs(grad).max() < self.tol:\n", - " break\n", - " \n", - " # Compute f1 score\n", - " y_pred = self.predict(X_test)\n", - " f1_score = metrics.f1_score(y_test, y_pred, average='weighted')\n", - " self.f1_score_history.append({'epoch': epoch, 'f1 score': f1_score})\n", - "\n", - "\n", - " return self\n", - "\n", - " def predict(self, X):\n", - " X = self._add_intercept(X)\n", - " return np.round(self._sigmoid(np.dot(X, self.coef_)))" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "tags": [] - }, - "source": [ - "## Test" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "0.9290780141843972" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "import numpy as np\n", - "from sklearn.datasets import make_classification\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.metrics import accuracy_score\n", - "from sklearn.utils import check_random_state\n", - "\n", - "\n", - "#def test_minibatch_logistic_regression():\n", - "# Generate some random classification data\n", - "X, y = make_classification(n_samples=1000, n_features=10, random_state=42)\n", - "\n", - "# Split data into training and test sets\n", - "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)\n", - "\n", - "# Instantiate the MyLogisticRegression model\n", - "model = MyLogisticRegression(lr=0.01, batch_size=32, max_iter=100)\n", - "\n", - "# Fit the model on the training data\n", - "model.fit(X_train, y_train)\n", - "\n", - "# Predict the labels for the test data\n", - "y_pred = model.predict(X_test)\n", - "\n", - "# Check that the predicted labels are binary\n", - "assert set(np.unique(y_pred)) == {0, 1}\n", - "\n", - "# Calculate the accuracy of the predictions\n", - "accuracy = accuracy_score(y_test, y_pred)\n", - "\n", - "# Check that the accuracy is greater than chance level\n", - "assert accuracy > 0.5\n", - "\n", - "\n", - "from sklearn.metrics import precision_score, recall_score, f1_score\n", - "from sklearn.metrics import confusion_matrix\n", - "\n", - "y_pred = model.predict(X_test)\n", - "precision = precision_score(y_test, y_pred)\n", - "precision" - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "0.8533333333333334" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "accuracy" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[125, 10],\n", - " [ 34, 131]])" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "confusion_matrix(y_test, y_pred)" - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
epochf1 score
000.830236
110.846667
220.846844
330.846762
440.850135
\n", - "
" - ], - "text/plain": [ - " epoch f1 score\n", - "0 0 0.830236\n", - "1 1 0.846667\n", - "2 2 0.846844\n", - "3 3 0.846762\n", - "4 4 0.850135" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "f1s = pd.DataFrame(model.f1_score_history)\n", - "f1s.head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## plot F1 vs iteration number" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "import seaborn as sns\n", - "import matplotlib.pyplot as plt\n", - "\n", - "def plot_f1(f1s: pd.DataFrame):\n", - " sns.scatterplot(x=f1s['epoch'], y=f1s['f1 score'])\n", - " sns.lineplot(x=f1s['epoch'], y=f1s['f1 score'])\n", - " plt.show()\n", - "plot_f1(f1s)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Hyperparameter optimization" - ] - }, - { - "cell_type": "code", - "execution_count": 41, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "from sklearn.pipeline import Pipeline #make_pipeline\n", - "from sklearn import preprocessing\n", - "from imblearn.over_sampling import RandomOverSampler\n", - "#from imblearn.pipeline import Pipeline\n", - "#from sklearn.linear_model import LogisticRegression, SGDClassifier\n", - "\n", - "#i = 20000\n", - "\n", - "pipe = Pipeline([\n", - "# ('oversampler', RandomOverSampler()),\n", - " ('preprocessor', preprocessing.StandardScaler()),\n", - "# ('classifier', LogisticRegression(max_iter=i, class_weight='balanced')),\n", - " ('classifier', MyLogisticRegression())\n", - "])\n", - "\n", - "\n", - "\n", - "#model = LogisticRegression(penalty='l1', solver='saga', max_iter=i)\n", - "#pipe.fit(X_train, y_train)" - ] - }, - { - "cell_type": "code", - "execution_count": 42, - "metadata": { - "scrolled": true, - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "{'classifier__batch_size': 82, 'classifier__lr': 0.1}" - ] - }, - "execution_count": 42, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from sklearn.model_selection import GridSearchCV\n", - "import numpy as np\n", - "\n", - "# optimize hyperparameter batch size and learning rate\n", - "param_grid = {\n", - " #'classifier__penalty': ['l1', 'l2'],\n", - " #'classifier__C': [1e-4, 1e-3, 1e-2, 0.1, 1, 10],\n", - " #'classifier__solver': ['sag'],\n", - " #'classifier__class_weight': [None, 'balanced', *[{0: 1, 1:10**x} for x in range(-5, 5)]],\n", - " #'classifier__eta0': [10**x for x in range(-5, 5)],\n", - " 'classifier__batch_size': np.linspace(1, X_train.shape[0], 70, dtype=int)[:10],\n", - " 'classifier__lr': [10**x for x in range(-5, 5)],\n", - "}\n", - "\n", - "grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='f1', n_jobs=-1)\n", - "#grid_search = GridSearchCV(MyLogisticRegression(), param_grid, cv=5, scoring='f1', n_jobs=-1)\n", - "\n", - "grid_search.fit(X_train, y_train)\n", - "grid_search.best_params_" - ] - }, - { - "cell_type": "code", - "execution_count": 43, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "0.9166666666666666" - ] - }, - "execution_count": 43, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "from sklearn.metrics import precision_score, recall_score, f1_score\n", - "from sklearn.metrics import confusion_matrix\n", - "\n", - "y_pred = grid_search.predict(X_test)\n", - "precision = precision_score(y_test, y_pred)\n", - "precision" - ] - }, - { - "cell_type": "code", - "execution_count": 44, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "text/plain": [ - "array([[ 33, 12],\n", - " [123, 132]])" - ] - }, - "execution_count": 44, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "confusion_matrix(y_pred == y_test, y_pred)" - ] - }, - { - "cell_type": "code", - "execution_count": 45, - "metadata": { - "tags": [] - }, - "outputs": [], - "source": [ - "#from sklearn.model_selection import cross_val_score\n", - "#\n", - "#scores = cross_val_score(grid_search, X, y, cv=5)\n", - "#scores" - ] - }, - { - "cell_type": "code", - "execution_count": 46, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
epochf1 score
000.813799
110.823768
220.833749
330.837076
440.837068
\n", - "
" - ], - "text/plain": [ - " epoch f1 score\n", - "0 0 0.813799\n", - "1 1 0.823768\n", - "2 2 0.833749\n", - "3 3 0.837076\n", - "4 4 0.837068" - ] - }, - "execution_count": 46, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "best_f1 = pd.DataFrame(grid_search.best_estimator_.named_steps['classifier'].f1_score_history)\n", - "best_f1.head()" - ] - }, - { - "cell_type": "code", - "execution_count": 47, - "metadata": { - "tags": [] - }, - "outputs": [ - { - "data": { - "image/png": "", - "text/plain": [ - "
" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "plot_f1(best_f1)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.4" - }, - "vscode": { - "interpreter": { - "hash": "62556f7a043365a66e0918c892755cfafede529a87e97207556f006a109bade4" - } - } - }, - "nbformat": 4, - "nbformat_minor": 4 -}