{"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"name":"python","version":"3.6.6","mimetype":"text/x-python","codemirror_mode":{"name":"ipython","version":3},"pygments_lexer":"ipython3","nbconvert_exporter":"python","file_extension":".py"},"kaggle":{"accelerator":"none","dataSources":[{"sourceId":10683,"databundleVersionId":220065,"sourceType":"competition"},{"sourceId":254252,"sourceType":"datasetVersion","datasetId":106679},{"sourceId":262671,"sourceType":"datasetVersion","datasetId":109784}],"dockerImageVersionId":22557,"isInternetEnabled":false,"language":"python","sourceType":"notebook","isGpuEnabled":false}},"nbformat_minor":4,"nbformat":4,"cells":[{"cell_type":"markdown","source":"# Microsoft Malware Prediction Competition\n# How To Score Over 0.702 LB with LGBM!\nDuring the Microsoft Malware Prediction competition, everyone wanted to know how to score over 0.700 LB with a single LGBM. Since the competition is over, we can now discuss the secret 12 engineered features! This model scores 0.702 public LB and 0.775 private LB with a single LGBM. (As part of an ensemble, it scores 0.707 LB [here][2].)\n\n# Why did this model come in 1435th place?\nThe private dataset contains 33% outliers. Those outliers ruin every model's score. This is explained [here][1]. When those outliers are removed, this model scores an amazing 0.775 AUC !!  \n  \nEven if you fit the clean data (sine wave pattern below) perfectly, the presence of outliers adds so much error that your model's AUC is terrible. See model one below. The presence of outliers hurt the most accurate solutions the most. If your solution only fits the data (sine wave pattern) loosely, then you have a chance of being near the outliers and may receive a better private LB score as seen in model two below.\n  \n  \n![image](http://playagricola.com/Kaggle/errorone31919.png)  \n  \n![image](http://playagricola.com/Kaggle/errortwo31919.png)\n\n# Load Data and Fix Pandas Memory Leak  \n  \n[1]: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84745\n[2]: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84135","metadata":{}},{"cell_type":"code","source":"# monkeypatches.py\n\n# Solving memory leak problem in pandas\n# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083\nimport pandas as pd, sys\nfrom ctypes import cdll, CDLL\ntry:\n    cdll.LoadLibrary(\"libc.so.6\")\n    libc = CDLL(\"libc.so.6\")\n    libc.malloc_trim(0)\nexcept (OSError, AttributeError):\n    libc = None\n\n__old_del = getattr(pd.DataFrame, '__del__', None)\n\ndef __new_del(self):\n    if __old_del:\n        __old_del(self)\n    libc.malloc_trim(0)\n\nif libc:\n    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)\n    pd.DataFrame.__del__ = __new_del\nelse:\n    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)","metadata":{"_uuid":"8f2839f25d086af736a60e9eeb907d3b93b6e0e5","_cell_guid":"b1076dfc-b9ad-4769-8c92-a6c4dae69d19","_kg_hide-input":true,"execution":{"iopub.status.busy":"2024-07-11T20:54:34.397824Z","iopub.execute_input":"2024-07-11T20:54:34.398252Z","iopub.status.idle":"2024-07-11T20:54:34.425402Z","shell.execute_reply.started":"2024-07-11T20:54:34.398110Z","shell.execute_reply":"2024-07-11T20:54:34.424423Z"},"trusted":true},"execution_count":1,"outputs":[{"name":"stderr","text":"Applying monkeypatch for pd.DataFrame.__del__\n","output_type":"stream"}]},{"cell_type":"code","source":"# SET THIS VARIABLE TO TRUE TO RUN KERNEL QUICKLY AND FIND BUGS\n# ONLY 10000 ROWS OF DATA IS LOADED\nDebug = False\n\nimport numpy as np, pandas as pd, gc, random\nimport matplotlib.pyplot as plt\n\ndef load(x):\n    ignore = ['MachineIdentifier']\n    if x in ignore: return False\n    else: return True\n\n# LOAD TRAIN AND TEST\nif Debug:\n    df_train = pd.read_csv('../input/microsoft-malware-prediction/train.csv',dtype='category',usecols=load,nrows=10000)\nelse:\n    df_train = pd.read_csv('../input/microsoft-malware-prediction/train.csv',dtype='category',usecols=load, nrows=1000000)\ndf_train['HasDetections'] = df_train['HasDetections'].astype('int8')\nif 5244810 in df_train.index:\n    df_train.loc[5244810,'AvSigVersion'] = '1.273.1144.0'\n    df_train['AvSigVersion'].cat.remove_categories('1.2&#x17;3.1144.0',inplace=True)\n\nif Debug:\n    df_test = pd.read_csv('../input/microsoft-malware-prediction/test.csv',dtype='category',usecols=load,nrows=10000)\nelse:\n    df_test = pd.read_csv('../input/microsoft-malware-prediction/test.csv',dtype='category',usecols=load, nrows=1000000)\n    \nprint('Loaded',len(df_train),'rows of TRAIN and',len(df_test),'rows of TEST')","metadata":{"_kg_hide-input":true,"execution":{"iopub.status.busy":"2024-07-11T20:54:34.426895Z","iopub.execute_input":"2024-07-11T20:54:34.427333Z","iopub.status.idle":"2024-07-11T20:56:15.256715Z","shell.execute_reply.started":"2024-07-11T20:54:34.427246Z","shell.execute_reply":"2024-07-11T20:56:15.255594Z"},"trusted":true},"execution_count":2,"outputs":[{"name":"stdout","text":"Loaded 1000000 rows of TRAIN and 1000000 rows of TEST\n","output_type":"stream"}]},{"cell_type":"markdown","source":"# Define Encoding Functions","metadata":{}},{"cell_type":"code","source":"# FREQUENCY ENCODE SEPARATELY\ndef encode_FE(df,col):\n    vc = df[col].value_counts(dropna=False, normalize=True).to_dict()\n    nm = col+'_FE'\n    df[nm] = df[col].map(vc)\n    df[nm] = df[nm].astype('float32')\n    return [nm]\n\n# FREQUENCY ENCODE TOGETHER\ndef encode_FE2(df1, df2, col):\n    df = pd.concat([df1[col],df2[col]])\n    vc = df.value_counts(dropna=False, normalize=True).to_dict()\n    nm = col+'_FE2'\n    df1[nm] = df1[col].map(vc)\n    df1[nm] = df1[nm].astype('float32')\n    df2[nm] = df2[col].map(vc)\n    df2[nm] = df2[nm].astype('float32')\n    return [nm]\n\n# FACTORIZE\ndef factor_data(df_train, df_test, col):\n    df_comb = pd.concat([df_train[col],df_test[col]],axis=0)\n    df_comb,_ = df_comb.factorize(sort=True)\n    # MAKE SMALLEST LABEL 1, RESERVE 0\n    df_comb += 1\n    # MAKE NAN LARGEST LABEL (need to remove attype('str') above)\n    df_comb = np.where(df_comb==0, df_comb.max()+1, df_comb)\n    df_train[col] = df_comb[:len(df_train)]\n    df_test[col] = df_comb[len(df_train):]\n    del df_comb\n    \n# OPTIMIZE MEMORY\ndef reduce_memory(df,col):\n    mx = df[col].max()\n    if mx<256:\n            df[col] = df[col].astype('uint8')\n    elif mx<65536:\n        df[col] = df[col].astype('uint16')\n    else:\n        df[col] = df[col].astype('uint32')\n        \n# REDUCE CATEGORY CARDINALITY\ndef relax_data(df_train, df_test, col):\n    cv1 = pd.DataFrame(df_train[col].value_counts().reset_index().rename({col:'train'},axis=1))\n    cv2 = pd.DataFrame(df_test[col].value_counts().reset_index().rename({col:'test'},axis=1))\n    cv3 = pd.merge(cv1,cv2,on='index',how='outer')\n    factor = len(df_test)/len(df_train)\n    cv3['train'].fillna(0,inplace=True)\n    cv3['test'].fillna(0,inplace=True)\n    cv3['remove'] = False\n    cv3['remove'] = cv3['remove'] | (cv3['train'] < len(df_train)/10000)\n    cv3['remove'] = cv3['remove'] | (factor*cv3['train'] < cv3['test']/3)\n    cv3['remove'] = cv3['remove'] | (factor*cv3['train'] > 3*cv3['test'])\n    cv3['new'] = cv3.apply(lambda x: x['index'] if x['remove']==False else 0,axis=1)\n    cv3['new'],_ = cv3['new'].factorize(sort=True)\n    cv3.set_index('index',inplace=True)\n    cc = cv3['new'].to_dict()\n    df_train[col] = df_train[col].map(cc)\n    reduce_memory(df_train,col)\n    df_test[col] = df_test[col].map(cc)\n    reduce_memory(df_test,col)\n    \n# DISPLAY MEMORY STATISTICS\ndef display_memory(df_train, df_test):\n    print(len(df_train),'rows of training data use',df_train.memory_usage(deep=True).sum()//1e6,'Mb memory!')\n    print(len(df_test),'rows of test data use',df_test.memory_usage(deep=True).sum()//1e6,'Mb memory!')\n\n# CONVERT TO CATEGORIES\ndef categorize(df_train, df_test, cols):\n    for col in cols:\n        df_train[col] = df_train[col].astype('category')\n        df_test[col] = df_test[col].astype('category')","metadata":{"execution":{"iopub.status.busy":"2024-07-11T20:56:15.258369Z","iopub.execute_input":"2024-07-11T20:56:15.258679Z","iopub.status.idle":"2024-07-11T20:56:15.279917Z","shell.execute_reply.started":"2024-07-11T20:56:15.258620Z","shell.execute_reply":"2024-07-11T20:56:15.278520Z"},"trusted":true},"execution_count":3,"outputs":[]},{"cell_type":"markdown","source":"# Feature Engineering\nHere is half of the secret sauce. You must engineer the following five variables. (The other half is frequency encoding 7 variables described below.). These variables were discovered by trying hundreds of engineered variables to see which ones increased Time Split Validation (TSV explained [here][1]). Each variable was added to the model one at a time and validation score was recorded. Only the follow 5 variables increased validation score.\n\n* `AppVersion2` indicates whether your Windows Defender is up to date. This is the second number from AppVersion. Regardless of your operating system and version, you can always have AppVersion with second number equal 18. For example, you should have `4.18.1807.18075` instead of `4.12.xx.xx`.\n\n* `Lag1` is the difference between AvSigVersion_Date and Census_OSVersion_Date. Since AvSigVersion is the virus definitions for Windows Defender, this variable indicates whether Windows Defender is out-of-date by comparing it's last install with the date of the operating system. Out-of-date antivirus indicates that a user either has better antivirus or they don't use their computer often. In either case, they have less HasDetections.\n\n* `Lag5` is the difference between AvSigVersion_Date and July 26, 2018. The first observation in Microsoft's training data is July 26, 2018. Therefore if a computer has AvSigVersion_Date before this then their antivirus is out-of-date. (The first observation in the test data is September 27, so you use this difference when encoding the test data.)\n\n* `driveA` is the ratio of harddrive partition used for the operating system with the total hard drive. Savy users install multiple operating systems and have a lower ratio. Savy users have reduced HasDetections.\n\n* `driveB` is the difference between harddrive partition used for the operating system and total hard drive. Responsible users manager their hard drives well. Responsible users have reduced HasDetections.  \n  \n[1]: https://www.kaggle.com/cdeotte/time-split-validation-malware-0-68","metadata":{}},{"cell_type":"code","source":"from datetime import datetime, date, timedelta\n\n# AS timestamp\ndatedictAS = np.load('../input/malware-timestamps/AvSigVersionTimestamps.npy')[()]\ndf_train['DateAS'] = df_train['AvSigVersion'].map(datedictAS)\ndf_test['DateAS'] = df_test['AvSigVersion'].map(datedictAS)\n\n# OS timestamp\ndatedictOS = np.load('../input/malware-timestamps-2/OSVersionTimestamps.npy')[()]\ndf_train['DateOS'] = df_train['Census_OSVersion'].map(datedictOS)\ndf_test['DateOS'] = df_test['Census_OSVersion'].map(datedictOS)\n\n# ENGINEERED FEATURE #1\ndf_train['AppVersion2'] = df_train['AppVersion'].map(lambda x: np.int(x.split('.')[1]))\ndf_test['AppVersion2'] = df_test['AppVersion'].map(lambda x: np.int(x.split('.')[1]))\n\n# ENGINEERED FEATURE #2\ndf_train['Lag1'] = df_train['DateAS'] - df_train['DateOS']\ndf_train['Lag1'] = df_train['Lag1'].map(lambda x: x.days//7)\ndf_test['Lag1'] = df_test['DateAS'] - df_test['DateOS']\ndf_test['Lag1'] = df_test['Lag1'].map(lambda x: x.days//7)\n\n# ENGINEERED FEATURE #3\ndf_train['Lag5'] = datetime(2018,7,26) - df_train['DateAS']\ndf_train['Lag5'] = df_train['Lag5'].map(lambda x: x.days//1)\ndf_train.loc[ df_train['Lag5']<0, 'Lag5' ] = 0\ndf_test['Lag5'] = datetime(2018,9,27) - df_test['DateAS'] #PUBLIC TEST\ndf_test['Lag5'] = df_test['Lag5'].map(lambda x: x.days//1)\ndf_test.loc[ df_test['Lag5']<0, 'Lag5' ] = 0\ndf_train['Lag5'] = df_train['Lag5'].astype('float32') # allow for NAN\ndf_test['Lag5'] = df_test['Lag5'].astype('float32') # allow for NAN\n\n# ENGINEERED FEATURE #4\ndf_train['driveA'] = df_train['Census_SystemVolumeTotalCapacity'].astype('float')/df_train['Census_PrimaryDiskTotalCapacity'].astype('float')\ndf_test['driveA'] = df_test['Census_SystemVolumeTotalCapacity'].astype('float')/df_test['Census_PrimaryDiskTotalCapacity'].astype('float')\ndf_train['driveA'] = df_train['driveA'].astype('float32') \ndf_test['driveA'] = df_test['driveA'].astype('float32') \n\n# ENGINNERED FEATURE #5\ndf_train['driveB'] = df_train['Census_PrimaryDiskTotalCapacity'].astype('float') - df_train['Census_SystemVolumeTotalCapacity'].astype('float')\ndf_test['driveB'] = df_test['Census_PrimaryDiskTotalCapacity'].astype('float') - df_test['Census_SystemVolumeTotalCapacity'].astype('float')\ndf_train['driveB'] = df_train['driveB'].astype('float32') \ndf_test['driveB'] = df_test['driveB'].astype('float32') \n\ncols6=['Lag1']\ncols8=['Lag5','driveB','driveA']\n\ndel df_train['DateAS'], df_train['DateOS'] #, df_train['DateBL']\ndel df_test['DateAS'], df_test['DateOS'] #, df_test['DateBL']\ndel datedictAS, datedictOS\nx=gc.collect()","metadata":{"_cell_guid":"79c7e3d0-c299-4dcb-8224-4455121ee9b0","_uuid":"d629ff2d2480ee46fbb7e2d37f6b5fab8052498a","execution":{"iopub.status.busy":"2024-07-11T20:56:15.281743Z","iopub.execute_input":"2024-07-11T20:56:15.282086Z"},"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Frequency Engineering\nThis is the second half of the secret sauce. There is a correlation between the frequency of time variables (`EngineVersion`, `AvSigVersion`, `AppVersion`, `Census_OSVersion`, `Census_OSBuildRevision`) and `HasDetections`. This occurs for two reasons explained [here][1]. One reason is a leak and will lead to overfitting train and public test because I believe that private test is a random sample and therefore does not benefit from the leak. The second reason is because out-of-date antivirus produces a front tail with both reduced frequency and reduced HasDetections (explained and verified [here][1]). To take advantage of these two correlations, you need to frequency encode the time variables **separately** for train and test. (Also since EngineVersion and AppVersion have low cardinality, I found that frequency encoded them doesn't increase CV or LB.)  \n\nThere is an actual correlation between frequency of `CountryIdentifier` and `HasDetections`. This is probably because popular countries are targeted. Also there is a true correlation between frequency of `Census_InternalBatteryNumberOfCharges` and `HasDetections`. I have no idea why this is the case. To take advantage of these correlations, frequency encode these variables in train and test **together**. \n  \nThese frequency engineered variables were discovered by trying every variable as FE encoded **separately** and FE encoded **together** to see which ones increased Time Split Validation (TSV explained [here][1]). Each variable was added to the model one at a time and validation score was recorded. Only the follow 7 FE variables increased validation score.\n  \n[1]: https://www.kaggle.com/c/microsoft-malware-prediction/discussion/84287","metadata":{}},{"cell_type":"code","source":"cols3 = []\n# ENGINEERED FEATURES #6, #7, #8, #9, #10\nFE = ['Census_OSVersion', 'Census_OSBuildRevision', 'Census_InternalBatteryNumberOfCharges', 'AvSigVersion', 'Lag1']\nfor col in FE:\n    cols3 += encode_FE(df_train, col)\n    encode_FE(df_test, col)\n    \n# ENGINEERED FEATURES #11, #12\nFE2 = ['CountryIdentifier', 'Census_InternalBatteryNumberOfCharges']\nfor col in FE2:\n    cols3 += encode_FE2(df_train, df_test, col)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Categorical versus Numerical Encode\nAll 82 explanatory variables except `Census_PrimaryDiskTotalCapacity` and `Census_SystemVolumeTotalCapacity` are actually categorical and not numerical. However if you input them into LGBM as categorical, since their cardinality is so high, it leads to overfitting. Therefore validation shows that it is best to label encode variables (`pd.factorize(df[col],sort=True)`) into ordered integers and input them into LGBM as integers. Then set `num_leaves=2048` and LGBM has no trouble finding the categories anyway.  \n  \nAfter every variable was changed to dtype integer, each variable was tested one by one to see if making it categorical increases LGBM validation score. It was determined that the following 6 variables increased validation score. By making them categorical you prioritize them. So presumably these are important variables. Notice that they cover the variety of flavors of features. We have two geographical variables, one hardware variable, two software/virus variables, and one miscellenous. Flavors of variables are explained [here][1] called groupings.  \n  \n[1]: https://www.kaggle.com/cdeotte/embeddings-network-malware-0-697-0-773","metadata":{}},{"cell_type":"code","source":"CE = ['CountryIdentifier', 'SkuEdition', 'Firewall', 'Census_ProcessorCoreCount', 'Census_OSUILocaleIdentifier', 'Census_FlightRing']","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Remove Variables\nEvery variable was removed one at a time and validation score was recorded. Validation verified that we can remove the following 18 variables without decreasing model accuracy. `MachineIdentifier`, `ProductName`, `IsBeta`, `IsSxsPassiveMode`, `HasTpm`, `AutoSampleOptIn`, `PuaMode`, `UacLuaenable`, `Census_DeviceFamily`, `Census_ProcessorClass`, `Census_IsPortableOperatingSystem`, `Census_IsFlightsDisabled`, `Census_IsVirtualDevice`, `Census_OSSkuName`, `OsVer`, `Census_OSArchitecture`, `Census_OSInstallLanguageIdentifier`, `SMode`. The first one is obvious. The next 12 have more than 98% of their data in one category value. The next 4 correlate with another variable greater than `r=0.99`. And the last variable `SMode` changes is distribution radically over time and leads to decreased CV and LB. The other variables don't decrease validation per say but they don't increase it either, so we remove them.","metadata":{}},{"cell_type":"code","source":"cols = [x for x in df_train.columns if x not in ['HasDetections']+CE+cols3+cols6+cols8]\ncols2 = CE; ct = 1\n    \nfor col in cols.copy():\n    rate = df_train[col].value_counts(normalize=True, dropna=False).values[0]\n    if rate > 0.98:\n        del df_train[col]\n        del df_test[col]\n        cols.remove(col)\n        ct += 1\n\nrmv3=['Census_OSSkuName', 'OsVer', 'Census_OSArchitecture', 'Census_OSInstallLanguageIdentifier']\nrmv4=['SMode']\nfor col in rmv3+rmv4:\n    del df_train[col]\n    del df_test[col]\n    cols.remove(col)\n    ct +=1\n    \nprint('Removed',ct,'variables')\nx=gc.collect()","metadata":{"_kg_hide-input":true,"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"markdown","source":"# Reduce Feature Cardinality\nMany variables have over 1000 category values!! We must reduce cardinality. We will remove any category value that contains 0.01% or less of total data. Additionally to prevent overfitting the train data, we will remove any category value that appears 4x more frequenly in train than test. And we will remove any category value that appears 4x less frequency in train than test. Explained [here][1].\n  \nFor example, `EngineVersion` has many category values like `1.1.15100.1`, `1.1.15200.1`, `1.1.15300.6`, etc. The category value `1.1.15300.6` appears 4x more frequently in test than train, therefore we will remove it. We \"remove it\" by changing that computer's `EngineVersion` to a new category value of `TrashCan`.\n  \n[1]: https://www.kaggle.com/cdeotte/time-split-validation-malware-0-68","metadata":{}},{"cell_type":"code","source":"print('Factorizing...')\nfor col in cols+cols2+cols6:\n    factor_data(df_train, df_test, col)\nprint('Relaxing data...')\nfor col in cols+cols2: relax_data(df_train, df_test, col)\nprint('Optimizing memory...')\nfor col in cols+cols2+cols6:\n    reduce_memory(df_train, col)\n    reduce_memory(df_test, col)\n# Converting 6 variables to categorical\ncategorize(df_train, df_test, cols2)\n    \nprint('Number of variables is',len(cols+cols2+cols3+cols6+cols8))\ndisplay_memory(df_train, df_test)","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"microsoft_100w_train_data = df_train[cols+cols2+cols3+cols6+cols8]\nmicrosoft_100w_train_label = df_train['HasDetections']","metadata":{"trusted":true},"execution_count":null,"outputs":[]},{"cell_type":"code","source":"microsoft_100w_train_data.to_csv(\"/kaggle/working/microsoft_100w_train_data.csv\", index=False)\nmicrosoft_100w_train_label.to_csv(\"/kaggle/working/microsoft_100w_train_label.csv\", index=False)","metadata":{"trusted":true},"execution_count":null,"outputs":[]}]}