Priyanka-Kumavat-At-TE commited on
Commit
e03eaf2
1 Parent(s): 8c99283

Upload 7 files

Browse files
matumizi/__init__.py ADDED
File without changes
matumizi/daexp.py ADDED
@@ -0,0 +1,3121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/local/bin/python3
2
+
3
+ # Author: Pranab Ghosh
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License"); you
6
+ # may not use this file except in compliance with the License. You may
7
+ # obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
14
+ # implied. See the License for the specific language governing
15
+ # permissions and limitations under the License.
16
+
17
+ # Package imports
18
+ import os
19
+ import sys
20
+ import numpy as np
21
+ import pandas as pd
22
+ import sklearn as sk
23
+ from sklearn import preprocessing
24
+ from sklearn import metrics
25
+ import random
26
+ from math import *
27
+ from decimal import Decimal
28
+ import pprint
29
+ from statsmodels.graphics import tsaplots
30
+ from statsmodels.tsa import stattools as stt
31
+ from statsmodels.stats import stattools as sstt
32
+ from sklearn.linear_model import LinearRegression
33
+ from matplotlib import pyplot as plt
34
+ from scipy import stats as sta
35
+ from statsmodels.tsa.seasonal import seasonal_decompose
36
+ import statsmodels.api as sm
37
+ from sklearn.ensemble import IsolationForest
38
+ from sklearn.neighbors import LocalOutlierFactor
39
+ from sklearn.svm import OneClassSVM
40
+ from sklearn.covariance import EllipticEnvelope
41
+ from sklearn.mixture import GaussianMixture
42
+ from sklearn.cluster import KMeans
43
+ from sklearn.decomposition import PCA
44
+ import hurst
45
+ from .util import *
46
+ from .mlutil import *
47
+ from .sampler import *
48
+ from .stats import *
49
+
50
+ """
51
+ Load data from a CSV file, data frame, numpy array or list
52
+ Each data set (array like) is given a name while loading
53
+ Perform various data exploration operation refering to the data sets by name
54
+ Save and restore workspace if needed
55
+ """
56
+ class DataSetMetaData:
57
+ """
58
+ data set meta data
59
+ """
60
+ dtypeNum = 1
61
+ dtypeCat = 2
62
+ dtypeBin = 3
63
+ def __init__(self, dtype):
64
+ self.notes = list()
65
+ self.dtype = dtype
66
+
67
+ def addNote(self, note):
68
+ """
69
+ add note
70
+ """
71
+ self.notes.append(note)
72
+
73
+
74
+ class DataExplorer:
75
+ """
76
+ various data exploration functions
77
+ """
78
+ def __init__(self, verbose=True):
79
+ """
80
+ initialize
81
+
82
+ Parameters
83
+ verbose : True for verbosity
84
+ """
85
+ self.dataSets = dict()
86
+ self.metaData = dict()
87
+ self.pp = pprint.PrettyPrinter(indent=4)
88
+ self.verbose = verbose
89
+
90
+ def setVerbose(self, verbose):
91
+ """
92
+ sets verbose
93
+
94
+ Parameters
95
+ verbose : True for verbosity
96
+ """
97
+ self.verbose = verbose
98
+
99
+ def save(self, filePath):
100
+ """
101
+ save checkpoint
102
+
103
+ Parameters
104
+ filePath : path of file where saved
105
+ """
106
+ self.__printBanner("saving workspace")
107
+ ws = dict()
108
+ ws["data"] = self.dataSets
109
+ ws["metaData"] = self.metaData
110
+ saveObject(ws, filePath)
111
+ self.__printDone()
112
+
113
+ def restore(self, filePath):
114
+ """
115
+ restore checkpoint
116
+
117
+ Parameters
118
+ filePath : path of file from where to store
119
+ """
120
+ self.__printBanner("restoring workspace")
121
+ ws = restoreObject(filePath)
122
+ self.dataSets = ws["data"]
123
+ self.metaData = ws["metaData"]
124
+ self.__printDone()
125
+
126
+
127
+ def queryFileData(self, filePath, *columns):
128
+ """
129
+ query column data type from a data file
130
+
131
+ Parameters
132
+ filePath : path of file with data
133
+ columns : indexes followed by column names or column names
134
+ """
135
+ self.__printBanner("querying column data type from a data frame")
136
+ lcolumns = list(columns)
137
+ noHeader = type(lcolumns[0]) == int
138
+ if noHeader:
139
+ df = pd.read_csv(filePath, header=None)
140
+ else:
141
+ df = pd.read_csv(filePath, header=0)
142
+ return self.queryDataFrameData(df, *columns)
143
+
144
+ def queryDataFrameData(self, df, *columns):
145
+ """
146
+ query column data type from a data frame
147
+
148
+ Parameters
149
+ df : data frame with data
150
+ columns : indexes followed by column name or column names
151
+ """
152
+ self.__printBanner("querying column data type from a data frame")
153
+ columns = list(columns)
154
+ noHeader = type(columns[0]) == int
155
+ dtypes = list()
156
+ if noHeader:
157
+ nCols = int(len(columns) / 2)
158
+ colIndexes = columns[:nCols]
159
+ cnames = columns[nCols:]
160
+ nColsDf = len(df.columns)
161
+ for i in range(nCols):
162
+ ci = colIndexes[i]
163
+ assert ci < nColsDf, "col index {} outside range".format(ci)
164
+ col = df.loc[ : , ci]
165
+ dtypes.append(self.getDataType(col))
166
+ else:
167
+ cnames = columns
168
+ for c in columns:
169
+ col = df[c]
170
+ dtypes.append(self.getDataType(col))
171
+
172
+ nt = list(zip(cnames, dtypes))
173
+ result = self.__printResult("columns and data types", nt)
174
+ return result
175
+
176
+ def getDataType(self, col):
177
+ """
178
+ get data type
179
+
180
+ Parameters
181
+ col : contains data array like
182
+ """
183
+ if isBinary(col):
184
+ dtype = "binary"
185
+ elif isInteger(col):
186
+ dtype = "integer"
187
+ elif isFloat(col):
188
+ dtype = "float"
189
+ elif isCategorical(col):
190
+ dtype = "categorical"
191
+ else:
192
+ dtype = "mixed"
193
+ return dtype
194
+
195
+
196
+ def addFileNumericData(self,filePath, *columns):
197
+ """
198
+ add numeric columns from a file
199
+
200
+ Parameters
201
+ filePath : path of file with data
202
+ columns : indexes followed by column names or column names
203
+ """
204
+ self.__printBanner("adding numeric columns from a file")
205
+ self.addFileData(filePath, True, *columns)
206
+ self.__printDone()
207
+
208
+
209
+ def addFileBinaryData(self,filePath, *columns):
210
+ """
211
+ add binary columns from a file
212
+
213
+ Parameters
214
+ filePath : path of file with data
215
+ columns : indexes followed by column names or column names
216
+ """
217
+ self.__printBanner("adding binary columns from a file")
218
+ self.addFileData(filePath, False, *columns)
219
+ self.__printDone()
220
+
221
+ def addFileData(self, filePath, numeric, *columns):
222
+ """
223
+ add columns from a file
224
+
225
+ Parameters
226
+ filePath : path of file with data
227
+ numeric : True if numeric False in binary
228
+ columns : indexes followed by column names or column names
229
+ """
230
+ columns = list(columns)
231
+ noHeader = type(columns[0]) == int
232
+ if noHeader:
233
+ df = pd.read_csv(filePath, header=None)
234
+ else:
235
+ df = pd.read_csv(filePath, header=0)
236
+ self.addDataFrameData(df, numeric, *columns)
237
+
238
+ def addDataFrameNumericData(self,filePath, *columns):
239
+ """
240
+ add numeric columns from a data frame
241
+
242
+ Parameters
243
+ filePath : path of file with data
244
+ columns : indexes followed by column names or column names
245
+ """
246
+ self.__printBanner("adding numeric columns from a data frame")
247
+ self.addDataFrameData(filePath, True, *columns)
248
+
249
+
250
+ def addDataFrameBinaryData(self,filePath, *columns):
251
+ """
252
+ add binary columns from a data frame
253
+
254
+ Parameters
255
+ filePath : path of file with data
256
+ columns : indexes followed by column names or column names
257
+ """
258
+ self.__printBanner("adding binary columns from a data frame")
259
+ self.addDataFrameData(filePath, False, *columns)
260
+
261
+
262
+ def addDataFrameData(self, df, numeric, *columns):
263
+ """
264
+ add columns from a data frame
265
+
266
+ Parameters
267
+ df : data frame with data
268
+ numeric : True if numeric False in binary
269
+ columns : indexes followed by column names or column names
270
+ """
271
+ columns = list(columns)
272
+ noHeader = type(columns[0]) == int
273
+ if noHeader:
274
+ nCols = int(len(columns) / 2)
275
+ colIndexes = columns[:nCols]
276
+ nColsDf = len(df.columns)
277
+ for i in range(nCols):
278
+ ci = colIndexes[i]
279
+ assert ci < nColsDf, "col index {} outside range".format(ci)
280
+ col = df.loc[ : , ci]
281
+ if numeric:
282
+ assert isNumeric(col), "data is not numeric"
283
+ else:
284
+ assert isBinary(col), "data is not binary"
285
+ col = col.to_numpy()
286
+ cn = columns[i + nCols]
287
+ dtype = DataSetMetaData.dtypeNum if numeric else DataSetMetaData.dtypeBin
288
+ self.__addDataSet(cn, col, dtype)
289
+ else:
290
+ for c in columns:
291
+ col = df[c]
292
+ if numeric:
293
+ assert isNumeric(col), "data is not numeric"
294
+ else:
295
+ assert isBinary(col), "data is not binary"
296
+ col = col.to_numpy()
297
+ dtype = DataSetMetaData.dtypeNum if numeric else DataSetMetaData.dtypeBin
298
+ self.__addDataSet(c, col, dtype)
299
+
300
+ def __addDataSet(self, dsn, data, dtype):
301
+ """
302
+ add dada set
303
+
304
+ Parameters
305
+ dsn: data set name
306
+ data : numpy array data
307
+ """
308
+ self.dataSets[dsn] = data
309
+ self.metaData[dsn] = DataSetMetaData(dtype)
310
+
311
+
312
+ def addListNumericData(self, ds, name):
313
+ """
314
+ add numeric data from a list
315
+
316
+ Parameters
317
+ ds : list with data
318
+ name : name of data set
319
+ """
320
+ self.__printBanner("add numeric data from a list")
321
+ self.addListData(ds, True, name)
322
+ self.__printDone()
323
+
324
+
325
+ def addListBinaryData(self, ds, name):
326
+ """
327
+ add binary data from a list
328
+
329
+ Parameters
330
+ ds : list with data
331
+ name : name of data set
332
+ """
333
+ self.__printBanner("adding binary data from a list")
334
+ self.addListData(ds, False, name)
335
+ self.__printDone()
336
+
337
+ def addListData(self, ds, numeric, name):
338
+ """
339
+ adds list data
340
+
341
+ Parameters
342
+ ds : list with data
343
+ numeric : True if numeric False in binary
344
+ name : name of data set
345
+ """
346
+ assert type(ds) == list, "data not a list"
347
+ if numeric:
348
+ assert isNumeric(ds), "data is not numeric"
349
+ else:
350
+ assert isBinary(ds), "data is not binary"
351
+ dtype = DataSetMetaData.dtypeNum if numeric else DataSetMetaData.dtypeBin
352
+ self.dataSets[name] = np.array(ds)
353
+ self.metaData[name] = DataSetMetaData(dtype)
354
+
355
+
356
+ def addFileCatData(self, filePath, *columns):
357
+ """
358
+ add categorical columns from a file
359
+
360
+ Parameters
361
+ filePath : path of file with data
362
+ columns : indexes followed by column names or column names
363
+ """
364
+ self.__printBanner("adding categorical columns from a file")
365
+ columns = list(columns)
366
+ noHeader = type(columns[0]) == int
367
+ if noHeader:
368
+ df = pd.read_csv(filePath, header=None)
369
+ else:
370
+ df = pd.read_csv(filePath, header=0)
371
+
372
+ self.addDataFrameCatData(df, *columns)
373
+ self.__printDone()
374
+
375
+ def addDataFrameCatData(self, df, *columns):
376
+ """
377
+ add categorical columns from a data frame
378
+
379
+ Parameters
380
+ df : data frame with data
381
+ columns : indexes followed by column names or column names
382
+ """
383
+ self.__printBanner("adding categorical columns from a data frame")
384
+ columns = list(columns)
385
+ noHeader = type(columns[0]) == int
386
+ if noHeader:
387
+ nCols = int(len(columns) / 2)
388
+ colIndexes = columns[:nCols]
389
+ nColsDf = len(df.columns)
390
+ for i in range(nCols):
391
+ ci = colIndexes[i]
392
+ assert ci < nColsDf, "col index {} outside range".format(ci)
393
+ col = df.loc[ : , ci]
394
+ assert isCategorical(col), "data is not categorical"
395
+ col = col.tolist()
396
+ cn = columns[i + nCols]
397
+ self.__addDataSet(cn, col, DataSetMetaData.dtypeCat)
398
+ else:
399
+ for c in columns:
400
+ col = df[c].tolist()
401
+ self.__addDataSet(c, col, DataSetMetaData.dtypeCat)
402
+
403
+ def addListCatData(self, ds, name):
404
+ """
405
+ add categorical list data
406
+
407
+ Parameters
408
+ ds : list with data
409
+ name : name of data set
410
+ """
411
+ self.__printBanner("adding categorical list data")
412
+ assert type(ds) == list, "data not a list"
413
+ assert isCategorical(ds), "data is not categorical"
414
+ self.__addDataSet(name, ds, DataSetMetaData.dtypeCat)
415
+ self.__printDone()
416
+
417
+ def remData(self, ds):
418
+ """
419
+ removes data set
420
+
421
+ Parameters
422
+ ds : data set name
423
+ """
424
+ self.__printBanner("removing data set", ds)
425
+ assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
426
+ self.dataSets.pop(ds)
427
+ self.metaData.pop(ds)
428
+ names = self.showNames()
429
+ self.__printDone()
430
+ return names
431
+
432
+ def addNote(self, ds, note):
433
+ """
434
+ get data
435
+
436
+ Parameters
437
+ ds : data set name or list or numpy array with data
438
+ note: note text
439
+ """
440
+ self.__printBanner("adding note")
441
+ assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
442
+ mdata = self.metaData[ds]
443
+ mdata.addNote(note)
444
+ self.__printDone()
445
+
446
+ def getNotes(self, ds):
447
+ """
448
+ get data
449
+
450
+ Parameters
451
+ ds : data set name or list or numpy array with data
452
+ """
453
+ self.__printBanner("getting notes")
454
+ assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
455
+ mdata = self.metaData[ds]
456
+ dnotes = mdata.notes
457
+ if self.verbose:
458
+ for dn in dnotes:
459
+ print(dn)
460
+ return dnotes
461
+
462
+ def getNumericData(self, ds):
463
+ """
464
+ get numeric data
465
+
466
+ Parameters
467
+ ds : data set name or list or numpy array with data
468
+ """
469
+ if type(ds) == str:
470
+ assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
471
+ assert self.metaData[ds].dtype == DataSetMetaData.dtypeNum, "data set {} is expected to be numerical type for this operation".format(ds)
472
+ data = self.dataSets[ds]
473
+ elif type(ds) == list:
474
+ assert isNumeric(ds), "data is not numeric"
475
+ data = np.array(ds)
476
+ elif type(ds) == np.ndarray:
477
+ data = ds
478
+ else:
479
+ raise "invalid type, expecting data set name, list or ndarray"
480
+ return data
481
+
482
+
483
+ def getCatData(self, ds):
484
+ """
485
+ get categorical data
486
+
487
+ Parameters
488
+ ds : data set name or list with data
489
+ """
490
+ if type(ds) == str:
491
+ assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
492
+ assert self.metaData[ds].dtype == DataSetMetaData.dtypeCat, "data set {} is expected to be categorical type for this operation".format(ds)
493
+ data = self.dataSets[ds]
494
+ elif type(ds) == list:
495
+ assert isCategorical(ds), "data is not categorical"
496
+ data = ds
497
+ else:
498
+ raise "invalid type, expecting data set name or list"
499
+ return data
500
+
501
+ def getAnyData(self, ds):
502
+ """
503
+ get any data
504
+
505
+ Parameters
506
+ ds : data set name or list with data
507
+ """
508
+ if type(ds) == str:
509
+ assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
510
+ data = self.dataSets[ds]
511
+ elif type(ds) == list:
512
+ data = ds
513
+ else:
514
+ raise "invalid type, expecting data set name or list"
515
+ return data
516
+
517
+ def loadCatFloatDataFrame(self, ds1, ds2):
518
+ """
519
+ loads float and cat data into data frame
520
+
521
+ Parameters
522
+ ds1: data set name or list
523
+ ds2: data set name or list or numpy array
524
+ """
525
+ data1 = self.getCatData(ds1)
526
+ data2 = self.getNumericData(ds2)
527
+ self.ensureSameSize([data1, data2])
528
+ df1 = pd.DataFrame(data=data1)
529
+ df2 = pd.DataFrame(data=data2)
530
+ df = pd.concat([df1,df2], axis=1)
531
+ df.columns = range(df.shape[1])
532
+ return df
533
+
534
+ def showNames(self):
535
+ """
536
+ lists data set names
537
+ """
538
+ self.__printBanner("listing data set names")
539
+ names = self.dataSets.keys()
540
+ if self.verbose:
541
+ print("data sets")
542
+ for ds in names:
543
+ print(ds)
544
+ self.__printDone()
545
+ return names
546
+
547
+ def plot(self, ds, yscale=None):
548
+ """
549
+ plots data
550
+
551
+ Parameters
552
+ ds: data set name or list or numpy array
553
+ yscale: y scale
554
+ """
555
+ self.__printBanner("plotting data", ds)
556
+ data = self.getNumericData(ds)
557
+ drawLine(data, yscale)
558
+
559
+ def plotZoomed(self, ds, beg, end, yscale=None):
560
+ """
561
+ plots zoomed data
562
+
563
+ Parameters
564
+ ds: data set name or list or numpy array
565
+ beg: begin offset
566
+ end: end offset
567
+ yscale: y scale
568
+ """
569
+ self.__printBanner("plotting data", ds)
570
+ data = self.getNumericData(ds)
571
+ drawLine(data[beg:end], yscale)
572
+
573
+ def scatterPlot(self, ds1, ds2):
574
+ """
575
+ scatter plots data
576
+
577
+ Parameters
578
+ ds1: data set name or list or numpy array
579
+ ds2: data set name or list or numpy array
580
+ """
581
+ self.__printBanner("scatter plotting data", ds1, ds2)
582
+ data1 = self.getNumericData(ds1)
583
+ data2 = self.getNumericData(ds2)
584
+ self.ensureSameSize([data1, data2])
585
+ x = np.arange(1, len(data1)+1, 1)
586
+ plt.scatter(x, data1 ,color="red")
587
+ plt.scatter(x, data2 ,color="blue")
588
+ plt.show()
589
+
590
+ def print(self, ds):
591
+ """
592
+ prunt data
593
+
594
+ Parameters
595
+ ds: data set name or list or numpy array
596
+ """
597
+ self.__printBanner("printing data", ds)
598
+ assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
599
+ data = self.dataSets[ds]
600
+ if self.verbore:
601
+ print(formatAny(len(data), "size"))
602
+ print("showing first 50 elements" )
603
+ print(data[:50])
604
+
605
+ def plotHist(self, ds, cumulative, density, nbins=20):
606
+ """
607
+ plots histogram
608
+
609
+ Parameters
610
+ ds: data set name or list or numpy array
611
+ cumulative : True if cumulative
612
+ density : True to normalize for probability density
613
+ nbins : no of bins
614
+ """
615
+ self.__printBanner("plotting histogram", ds)
616
+ data = self.getNumericData(ds)
617
+ plt.hist(data, bins=nbins, cumulative=cumulative, density=density)
618
+ plt.show()
619
+
620
+ def isMonotonicallyChanging(self, ds):
621
+ """
622
+ checks if monotonically increasing or decreasing
623
+
624
+ Parameters
625
+ ds: data set name or list or numpy array
626
+ """
627
+ self.__printBanner("checking monotonic change", ds)
628
+ data = self.getNumericData(ds)
629
+ monoIncreasing = all(list(map(lambda i : data[i] >= data[i-1], range(1, len(data), 1))))
630
+ monoDecreasing = all(list(map(lambda i : data[i] <= data[i-1], range(1, len(data), 1))))
631
+ result = self.__printResult("monoIncreasing", monoIncreasing, "monoDecreasing", monoDecreasing)
632
+ return result
633
+
634
+ def getFreqDistr(self, ds, nbins=20):
635
+ """
636
+ get histogram
637
+
638
+ Parameters
639
+ ds: data set name or list or numpy array
640
+ nbins: num of bins
641
+ """
642
+ self.__printBanner("getting histogram", ds)
643
+ data = self.getNumericData(ds)
644
+ frequency, lowLimit, binsize, extraPoints = sta.relfreq(data, numbins=nbins)
645
+ result = self.__printResult("frequency", frequency, "lowLimit", lowLimit, "binsize", binsize, "extraPoints", extraPoints)
646
+ return result
647
+
648
+
649
+ def getCumFreqDistr(self, ds, nbins=20):
650
+ """
651
+ get cumulative freq distribution
652
+
653
+ Parameters
654
+ ds: data set name or list or numpy array
655
+ nbins: num of bins
656
+ """
657
+ self.__printBanner("getting cumulative freq distribution", ds)
658
+ data = self.getNumericData(ds)
659
+ cumFrequency, lowLimit, binsize, extraPoints = sta.cumfreq(data, numbins=nbins)
660
+ result = self.__printResult("cumFrequency", cumFrequency, "lowLimit", lowLimit, "binsize", binsize, "extraPoints", extraPoints)
661
+ return result
662
+
663
+ def getExtremeValue(self, ds, ensamp, nsamp, polarity, doPlotDistr, nbins=20):
664
+ """
665
+ get extreme values
666
+
667
+ Parameters
668
+ ds: data set name or list or numpy array
669
+ ensamp: num of samples for extreme values
670
+ nsamp: num of samples
671
+ polarity: max or min
672
+ doPlotDistr: plot distr
673
+ nbins: num of bins
674
+ """
675
+ self.__printBanner("getting extreme values", ds)
676
+ data = self.getNumericData(ds)
677
+ evalues = list()
678
+ for _ in range(ensamp):
679
+ values = selectRandomSubListFromListWithRepl(data, nsamp)
680
+ if polarity == "max":
681
+ evalues.append(max(values))
682
+ else:
683
+ evalues.append(min(values))
684
+ if doPlotDistr:
685
+ plt.hist(evalues, bins=nbins, cumulative=False, density=True)
686
+ plt.show()
687
+ result = self.__printResult("extremeValues", evalues)
688
+ return result
689
+
690
+
691
+ def getEntropy(self, ds, nbins=20):
692
+ """
693
+ get entropy
694
+
695
+ Parameters
696
+ ds: data set name or list or numpy array
697
+ nbins: num of bins
698
+ """
699
+ self.__printBanner("getting entropy", ds)
700
+ data = self.getNumericData(ds)
701
+ result = self.getFreqDistr(data, nbins)
702
+ entropy = sta.entropy(result["frequency"])
703
+ result = self.__printResult("entropy", entropy)
704
+ return result
705
+
706
+ def getRelEntropy(self, ds1, ds2, nbins=20):
707
+ """
708
+ get relative entropy or KL divergence with both data sets numeric
709
+
710
+ Parameters
711
+ ds1: data set name or list or numpy array
712
+ ds2: data set name or list or numpy array
713
+ nbins: num of bins
714
+ """
715
+ self.__printBanner("getting relative entropy or KL divergence", ds1, ds2)
716
+ data1 = self.getNumericData(ds1)
717
+ data2 = self.getNumericData(ds2)
718
+ result1 = self .getFeqDistr(data1, nbins)
719
+ freq1 = result1["frequency"]
720
+ result2 = self .getFeqDistr(data2, nbins)
721
+ freq2 = result2["frequency"]
722
+ entropy = sta.entropy(freq1, freq2)
723
+ result = self.__printResult("relEntropy", entropy)
724
+ return result
725
+
726
+ def getAnyEntropy(self, ds, dt, nbins=20):
727
+ """
728
+ get entropy of any data typr numeric or categorical
729
+
730
+ Parameters
731
+ ds: data set name or list or numpy array
732
+ dt : data type num or cat
733
+ nbins: num of bins
734
+ """
735
+ entropy = self.getEntropy(ds, nbins)["entropy"] if dt == "num" else self.getStatsCat(ds)["entropy"]
736
+ result = self.__printResult("entropy", entropy)
737
+ return result
738
+
739
+ def getJointEntropy(self, ds1, ds2, nbins=20):
740
+ """
741
+ get joint entropy with both data sets numeric
742
+
743
+ Parameters
744
+ ds1: data set name or list or numpy array
745
+ ds2: data set name or list or numpy array
746
+ nbins: num of bins
747
+ """
748
+ self.__printBanner("getting join entropy", ds1, ds2)
749
+ data1 = self.getNumericData(ds1)
750
+ data2 = self.getNumericData(ds2)
751
+ self.ensureSameSize([data1, data2])
752
+ hist, xedges, yedges = np.histogram2d(data1, data2, bins=nbins)
753
+ hist = hist.flatten()
754
+ ssize = len(data1)
755
+ hist = hist / ssize
756
+ entropy = sta.entropy(hist)
757
+ result = self.__printResult("jointEntropy", entropy)
758
+ return result
759
+
760
+
761
+ def getAllNumMutualInfo(self, ds1, ds2, nbins=20):
762
+ """
763
+ get mutual information for both numeric data
764
+
765
+ Parameters
766
+ ds1: data set name or list or numpy array
767
+ ds2: data set name or list or numpy array
768
+ nbins: num of bins
769
+ """
770
+ self.__printBanner("getting mutual information", ds1, ds2)
771
+ en1 = self.getEntropy(ds1,nbins)
772
+ en2 = self.getEntropy(ds2,nbins)
773
+ en = self.getJointEntropy(ds1, ds2, nbins)
774
+
775
+ mutInfo = en1["entropy"] + en2["entropy"] - en["jointEntropy"]
776
+ result = self.__printResult("mutInfo", mutInfo)
777
+ return result
778
+
779
+
780
+ def getNumCatMutualInfo(self, nds, cds ,nbins=20):
781
+ """
782
+ get mutiual information between numeric and categorical data
783
+
784
+ Parameters
785
+ nds: numeric data set name or list or numpy array
786
+ cds: categoric data set name or list
787
+ nbins: num of bins
788
+ """
789
+ self.__printBanner("getting mutual information of numerical and categorical data", nds, cds)
790
+ ndata = self.getNumericData(nds)
791
+ cds = self.getCatData(cds)
792
+ nentr = self.getEntropy(nds)["entropy"]
793
+
794
+ #conditional entropy
795
+ cdistr = self.getStatsCat(cds)["distr"]
796
+ grdata = self.getGroupByData(nds, cds, True)["groupedData"]
797
+ cnentr = 0
798
+ for gr, data in grdata.items():
799
+ self.addListNumericData(data, "grdata")
800
+ gnentr = self.getEntropy("grdata")["entropy"]
801
+ cnentr += gnentr * cdistr[gr]
802
+
803
+ mutInfo = nentr - cnentr
804
+ result = self.__printResult("mutInfo", mutInfo, "entropy", nentr, "condEntropy", cnentr)
805
+ return result
806
+
807
+ def getTwoCatMutualInfo(self, cds1, cds2):
808
+ """
809
+ get mutiual information between 2 categorical data sets
810
+
811
+ Parameters
812
+ cds1 : categoric data set name or list
813
+ cds2 : categoric data set name or list
814
+ """
815
+ self.__printBanner("getting mutual information of two categorical data sets", cds1, cds2)
816
+ cdata1 = self.getCatData(cds1)
817
+ cdata2 = self.getCatData(cds1)
818
+ centr = self.getStatsCat(cds1)["entropy"]
819
+
820
+ #conditional entropy
821
+ cdistr = self.getStatsCat(cds2)["distr"]
822
+ grdata = self.getGroupByData(cds1, cds2, True)["groupedData"]
823
+ ccentr = 0
824
+ for gr, data in grdata.items():
825
+ self.addListCatData(data, "grdata")
826
+ gcentr = self.getStatsCat("grdata")["entropy"]
827
+ ccentr += gcentr * cdistr[gr]
828
+
829
+ mutInfo = centr - ccentr
830
+ result = self.__printResult("mutInfo", mutInfo, "entropy", centr, "condEntropy", ccentr)
831
+ return result
832
+
833
+ def getMutualInfo(self, dst, nbins=20):
834
+ """
835
+ get mutiual information between 2 data sets,any combination numerical and categorical
836
+
837
+ Parameters
838
+ dst : data source , data type, data source , data type
839
+ nbins : num of bins
840
+ """
841
+ assertEqual(len(dst), 4, "invalid data source and data type list size")
842
+ dtypes = ["num", "cat"]
843
+ assertInList(dst[1], dtypes, "invalid data type")
844
+ assertInList(dst[3], dtypes, "invalid data type")
845
+ self.__printBanner("getting mutual information of any mix numerical and categorical data", dst[0], dst[2])
846
+
847
+ if dst[1] == "num":
848
+ mutInfo = self.getAllNumMutualInfo(dst[0], dst[2], nbins)["mutInfo"] if dst[3] == "num" \
849
+ else self.getNumCatMutualInfo(dst[0], dst[2], nbins)["mutInfo"]
850
+ else:
851
+ mutInfo = self.getNumCatMutualInfo(dst[2], dst[0], nbins)["mutInfo"] if dst[3] == "num" \
852
+ else self.getTwoCatMutualInfo(dst[2], dst[0])["mutInfo"]
853
+
854
+ result = self.__printResult("mutInfo", mutInfo)
855
+ return result
856
+
857
+
858
+ def getCondMutualInfo(self, dst, nbins=20):
859
+ """
860
+ get conditional mutiual information between 2 data sets,any combination numerical and categorical
861
+
862
+ Parameters
863
+ dst : data source , data type, data source , data type, data source , data type
864
+ nbins : num of bins
865
+ """
866
+ assertEqual(len(dst), 6, "invalid data source and data type list size")
867
+ dtypes = ["num", "cat"]
868
+ assertInList(dst[1], dtypes, "invalid data type")
869
+ assertInList(dst[3], dtypes, "invalid data type")
870
+ assertInList(dst[5], dtypes, "invalid data type")
871
+ self.__printBanner("getting conditional mutual information of any mix numerical and categorical data", dst[0], dst[2])
872
+
873
+ if dst[5] == "cat":
874
+ cdistr = self.getStatsCat(dst[4])["distr"]
875
+ grdata1 = self.getGroupByData(dst[0], dst[4], True)["groupedData"]
876
+ grdata2 = self.getGroupByData(dst[2], dst[4], True)["groupedData"]
877
+
878
+ else:
879
+ gdata = self.getNumericData(dst[4])
880
+ hist = Histogram.createWithNumBins(gdata, nbins)
881
+ cdistr = hist.distr()
882
+ grdata1 = self.getGroupByData(dst[0], dst[4], False)["groupedData"]
883
+ grdata2 = self.getGroupByData(dst[2], dst[4], False)["groupedData"]
884
+
885
+
886
+ cminfo = 0
887
+ for gr in grdata1.keys():
888
+ data1 = grdata1[gr]
889
+ data2 = grdata2[gr]
890
+ if dst[1] == "num":
891
+ self.addListNumericData(data1, "grdata1")
892
+ else:
893
+ self.addListCatData(data1, "grdata1")
894
+
895
+ if dst[3] == "num":
896
+ self.addListNumericData(data2, "grdata2")
897
+ else:
898
+ self.addListCatData(data2, "grdata2")
899
+ gdst = ["grdata1", dst[1], "grdata2", dst[3]]
900
+ minfo = self.getMutualInfo(gdst, nbins)["mutInfo"]
901
+ cminfo += minfo * cdistr[gr]
902
+
903
+ result = self.__printResult("condMutInfo", cminfo)
904
+ return result
905
+
906
+ def getPercentile(self, ds, value):
907
+ """
908
+ gets percentile
909
+
910
+ Parameters
911
+ ds: data set name or list or numpy array
912
+ value: the value
913
+ """
914
+ self.__printBanner("getting percentile", ds)
915
+ data = self.getNumericData(ds)
916
+ percent = sta.percentileofscore(data, value)
917
+ result = self.__printResult("value", value, "percentile", percent)
918
+ return result
919
+
920
+ def getValueRangePercentile(self, ds, value1, value2):
921
+ """
922
+ gets percentile
923
+
924
+ Parameters
925
+ ds: data set name or list or numpy array
926
+ value1: first value
927
+ value2: second value
928
+ """
929
+ self.__printBanner("getting percentile difference for value range", ds)
930
+ if value1 < value2:
931
+ v1 = value1
932
+ v2 = value2
933
+ else:
934
+ v1 = value2
935
+ v2 = value1
936
+ data = self.getNumericData(ds)
937
+ per1 = sta.percentileofscore(data, v1)
938
+ per2 = sta.percentileofscore(data, v2)
939
+ result = self.__printResult("valueFirst", value1, "valueSecond", value2, "percentileDiff", per2 - per1)
940
+ return result
941
+
942
+ def getValueAtPercentile(self, ds, percent):
943
+ """
944
+ gets value at percentile
945
+
946
+ Parameters
947
+ ds: data set name or list or numpy array
948
+ percent: percentile
949
+ """
950
+ self.__printBanner("getting value at percentile", ds)
951
+ data = self.getNumericData(ds)
952
+ assert isInRange(percent, 0, 100), "percent should be between 0 and 100"
953
+ value = sta.scoreatpercentile(data, percent)
954
+ result = self.__printResult("value", value, "percentile", percent)
955
+ return result
956
+
957
+ def getLessThanValues(self, ds, cvalue):
958
+ """
959
+ gets values less than given value
960
+
961
+ Parameters
962
+ ds: data set name or list or numpy array
963
+ cvalue: condition value
964
+ """
965
+ self.__printBanner("getting values less than", ds)
966
+ fdata = self.__getCondValues(ds, cvalue, "lt")
967
+ result = self.__printResult("count", len(fdata), "lessThanvalues", fdata )
968
+ return result
969
+
970
+
971
+ def getGreaterThanValues(self, ds, cvalue):
972
+ """
973
+ gets values greater than given value
974
+
975
+ Parameters
976
+ ds: data set name or list or numpy array
977
+ cvalue: condition value
978
+ """
979
+ self.__printBanner("getting values greater than", ds)
980
+ fdata = self.__getCondValues(ds, cvalue, "gt")
981
+ result = self.__printResult("count", len(fdata), "greaterThanvalues", fdata )
982
+ return result
983
+
984
+ def __getCondValues(self, ds, cvalue, cond):
985
+ """
986
+ gets cinditional values
987
+
988
+ Parameters
989
+ ds: data set name or list or numpy array
990
+ cvalue: condition value
991
+ cond: condition
992
+ """
993
+ data = self.getNumericData(ds)
994
+ if cond == "lt":
995
+ ind = np.where(data < cvalue)
996
+ else:
997
+ ind = np.where(data > cvalue)
998
+ fdata = data[ind]
999
+ return fdata
1000
+
1001
+ def getUniqueValueCounts(self, ds, maxCnt=10):
1002
+ """
1003
+ gets unique values and counts
1004
+
1005
+ Parameters
1006
+ ds: data set name or list or numpy array
1007
+ maxCnt; max value count pairs to return
1008
+ """
1009
+ self.__printBanner("getting unique values and counts", ds)
1010
+ data = self.getNumericData(ds)
1011
+ values, counts = sta.find_repeats(data)
1012
+ cardinality = len(values)
1013
+ vc = list(zip(values, counts))
1014
+ vc.sort(key = lambda v : v[1], reverse = True)
1015
+ result = self.__printResult("cardinality", cardinality, "vunique alues and repeat counts", vc[:maxCnt])
1016
+ return result
1017
+
1018
+ def getCatUniqueValueCounts(self, ds, maxCnt=10):
1019
+ """
1020
+ gets unique categorical values and counts
1021
+
1022
+ Parameters
1023
+ ds: data set name or list or numpy array
1024
+ maxCnt: max value count pairs to return
1025
+ """
1026
+ self.__printBanner("getting unique categorical values and counts", ds)
1027
+ data = self.getCatData(ds)
1028
+ series = pd.Series(data)
1029
+ uvalues = series.value_counts()
1030
+ values = uvalues.index.tolist()
1031
+ counts = uvalues.tolist()
1032
+ vc = list(zip(values, counts))
1033
+ vc.sort(key = lambda v : v[1], reverse = True)
1034
+ result = self.__printResult("cardinality", len(values), "unique values and repeat counts", vc[:maxCnt])
1035
+ return result
1036
+
1037
+ def getCatAlphaValueCounts(self, ds):
1038
+ """
1039
+ gets alphabetic value count
1040
+
1041
+ Parameters
1042
+ ds: data set name or list or numpy array
1043
+ """
1044
+ self.__printBanner("getting alphabetic value counts", ds)
1045
+ data = self.getCatData(ds)
1046
+ series = pd.Series(data)
1047
+ flags = series.str.isalpha().tolist()
1048
+ count = sum(flags)
1049
+ result = self.__printResult("alphabeticValueCount", count)
1050
+ return result
1051
+
1052
+
1053
+ def getCatNumValueCounts(self, ds):
1054
+ """
1055
+ gets numeric value count
1056
+
1057
+ Parameters
1058
+ ds: data set name or list or numpy array
1059
+ """
1060
+ self.__printBanner("getting numeric value counts", ds)
1061
+ data = self.getCatData(ds)
1062
+ series = pd.Series(data)
1063
+ flags = series.str.isnumeric().tolist()
1064
+ count = sum(flags)
1065
+ result = self.__printResult("numericValueCount", count)
1066
+ return result
1067
+
1068
+
1069
+ def getCatAlphaNumValueCounts(self, ds):
1070
+ """
1071
+ gets alpha numeric value count
1072
+
1073
+ Parameters
1074
+ ds: data set name or list or numpy array
1075
+ """
1076
+ self.__printBanner("getting alpha numeric value counts", ds)
1077
+ data = self.getCatData(ds)
1078
+ series = pd.Series(data)
1079
+ flags = series.str.isalnum().tolist()
1080
+ count = sum(flags)
1081
+ result = self.__printResult("alphaNumericValueCount", count)
1082
+ return result
1083
+
1084
+ def getCatAllCharCounts(self, ds):
1085
+ """
1086
+ gets alphabetic, numeric and special char count list
1087
+
1088
+ Parameters
1089
+ ds: data set name or list or numpy array
1090
+ """
1091
+ self.__printBanner("getting alphabetic, numeric and special char counts", ds)
1092
+ data = self.getCatData(ds)
1093
+ counts = list()
1094
+ for d in data:
1095
+ r = getAlphaNumCharCount(d)
1096
+ counts.append(r)
1097
+ result = self.__printResult("allTypeCharCounts", counts)
1098
+ return result
1099
+
1100
+ def getCatAlphaCharCounts(self, ds):
1101
+ """
1102
+ gets alphabetic char count list
1103
+
1104
+ Parameters
1105
+ ds: data set name or list or numpy array
1106
+ """
1107
+ self.__printBanner("getting alphabetic char counts", ds)
1108
+ data = self.getCatData(ds)
1109
+ counts = self.getCatAllCharCounts(ds)["allTypeCharCounts"]
1110
+ counts = list(map(lambda r : r[0], counts))
1111
+ result = self.__printResult("alphaCharCounts", counts)
1112
+ return result
1113
+
1114
+ def getCatNumCharCounts(self, ds):
1115
+ """
1116
+ gets numeric char count list
1117
+
1118
+ Parameters
1119
+ ds: data set name or list or numpy array
1120
+ """
1121
+ self.__printBanner("getting numeric char counts", ds)
1122
+ data = self.getCatData(ds)
1123
+ counts = self.getCatAllCharCounts(ds)["allTypeCharCounts"]
1124
+ counts = list(map(lambda r : r[1], counts))
1125
+ result = self.__printResult("numCharCounts", counts)
1126
+ return result
1127
+
1128
+ def getCatSpecialCharCounts(self, ds):
1129
+ """
1130
+ gets special char count list
1131
+
1132
+ Parameters
1133
+ ds: data set name or list or numpy array
1134
+ """
1135
+ self.__printBanner("getting special char counts", ds)
1136
+ counts = self.getCatAllCharCounts(ds)["allTypeCharCounts"]
1137
+ counts = list(map(lambda r : r[2], counts))
1138
+ result = self.__printResult("specialCharCounts", counts)
1139
+ return result
1140
+
1141
+ def getCatAlphaCharCountStats(self, ds):
1142
+ """
1143
+ gets alphabetic char count stats
1144
+
1145
+ Parameters
1146
+ ds: data set name or list or numpy array
1147
+ """
1148
+ self.__printBanner("getting alphabetic char count stats", ds)
1149
+ counts = self.getCatAlphaCharCounts(ds)["alphaCharCounts"]
1150
+ nz = counts.count(0)
1151
+ st = self.__getBasicStats(np.array(counts))
1152
+ result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3], "zeroCount", nz)
1153
+ return result
1154
+
1155
+ def getCatNumCharCountStats(self, ds):
1156
+ """
1157
+ gets numeric char count stats
1158
+
1159
+ Parameters
1160
+ ds: data set name or list or numpy array
1161
+ """
1162
+ self.__printBanner("getting numeric char count stats", ds)
1163
+ counts = self.getCatNumCharCounts(ds)["numCharCounts"]
1164
+ nz = counts.count(0)
1165
+ st = self.__getBasicStats(np.array(counts))
1166
+ result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3], "zeroCount", nz)
1167
+ return result
1168
+
1169
+ def getCatSpecialCharCountStats(self, ds):
1170
+ """
1171
+ gets special char count stats
1172
+
1173
+ Parameters
1174
+ ds: data set name or list or numpy array
1175
+ """
1176
+ self.__printBanner("getting special char count stats", ds)
1177
+ counts = self.getCatSpecialCharCounts(ds)["specialCharCounts"]
1178
+ nz = counts.count(0)
1179
+ st = self.__getBasicStats(np.array(counts))
1180
+ result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3], "zeroCount", nz)
1181
+ return result
1182
+
1183
+ def getCatFldLenStats(self, ds):
1184
+ """
1185
+ gets field length stats
1186
+
1187
+ Parameters
1188
+ ds: data set name or list or numpy array
1189
+ """
1190
+ self.__printBanner("getting field length stats", ds)
1191
+ data = self.getCatData(ds)
1192
+ le = list(map(lambda d: len(d), data))
1193
+ st = self.__getBasicStats(np.array(le))
1194
+ result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3])
1195
+ return result
1196
+
1197
+ def getCatCharCountStats(self, ds, ch):
1198
+ """
1199
+ gets specified char ocuurence count stats
1200
+
1201
+ Parameters
1202
+ ds: data set name or list or numpy array
1203
+ ch : character
1204
+ """
1205
+ self.__printBanner("getting field length stats", ds)
1206
+ data = self.getCatData(ds)
1207
+ counts = list(map(lambda d: d.count(ch), data))
1208
+ nz = counts.count(0)
1209
+ st = self.__getBasicStats(np.array(counts))
1210
+ result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3], "zeroCount", nz)
1211
+ return result
1212
+
1213
+ def getStats(self, ds, nextreme=5):
1214
+ """
1215
+ gets summary statistics
1216
+
1217
+ Parameters
1218
+ ds: data set name or list or numpy array
1219
+ nextreme: num of extreme values
1220
+ """
1221
+ self.__printBanner("getting summary statistics", ds)
1222
+ data = self.getNumericData(ds)
1223
+ stat = dict()
1224
+ stat["length"] = len(data)
1225
+ stat["min"] = data.min()
1226
+ stat["max"] = data.max()
1227
+ series = pd.Series(data)
1228
+ stat["n smallest"] = series.nsmallest(n=nextreme).tolist()
1229
+ stat["n largest"] = series.nlargest(n=nextreme).tolist()
1230
+ stat["mean"] = data.mean()
1231
+ stat["median"] = np.median(data)
1232
+ mode, modeCnt = sta.mode(data)
1233
+ stat["mode"] = mode[0]
1234
+ stat["mode count"] = modeCnt[0]
1235
+ stat["std"] = np.std(data)
1236
+ stat["skew"] = sta.skew(data)
1237
+ stat["kurtosis"] = sta.kurtosis(data)
1238
+ stat["mad"] = sta.median_absolute_deviation(data)
1239
+ self.pp.pprint(stat)
1240
+ return stat
1241
+
1242
+ def getStatsCat(self, ds):
1243
+ """
1244
+ gets summary statistics for categorical data
1245
+
1246
+ Parameters
1247
+ ds: data set name or list or numpy array
1248
+ """
1249
+ self.__printBanner("getting summary statistics for categorical data", ds)
1250
+ data = self.getCatData(ds)
1251
+ ch = CatHistogram()
1252
+ for d in data:
1253
+ ch.add(d)
1254
+ mode = ch.getMode()
1255
+ entr = ch.getEntropy()
1256
+ uvalues = ch.getUniqueValues()
1257
+ distr = ch.getDistr()
1258
+ result = self.__printResult("entropy", entr, "mode", mode, "uniqueValues", uvalues, "distr", distr)
1259
+ return result
1260
+
1261
+
1262
+ def getGroupByData(self, ds, gds, gdtypeCat, numBins=20):
1263
+ """
1264
+ group by
1265
+
1266
+ Parameters
1267
+ ds: data set name or list or numpy array
1268
+ gds: group by data set name or list or numpy array
1269
+ gdtpe : group by data type
1270
+ """
1271
+ self.__printBanner("getting group by data", ds)
1272
+ data = self.getAnyData(ds)
1273
+ if gdtypeCat:
1274
+ gdata = self.getCatData(gds)
1275
+ else:
1276
+ gdata = self.getNumericData(gds)
1277
+ hist = Histogram.createWithNumBins(gdata, numBins)
1278
+ gdata = list(map(lambda d : hist.bin(d), gdata))
1279
+
1280
+ self.ensureSameSize([data, gdata])
1281
+ groups = dict()
1282
+ for g,d in zip(gdata, data):
1283
+ appendKeyedList(groups, g, d)
1284
+
1285
+ ve = self.verbose
1286
+ self.verbose = False
1287
+ result = self.__printResult("groupedData", groups)
1288
+ self.verbose = ve
1289
+ return result
1290
+
1291
+ def getDifference(self, ds, order, doPlot=False):
1292
+ """
1293
+ gets difference of given order
1294
+
1295
+ Parameters
1296
+ ds: data set name or list or numpy array
1297
+ order: order of difference
1298
+ doPlot : True for plot
1299
+ """
1300
+ self.__printBanner("getting difference of given order", ds)
1301
+ data = self.getNumericData(ds)
1302
+ diff = difference(data, order)
1303
+ if doPlot:
1304
+ drawLine(diff)
1305
+ return diff
1306
+
1307
+ def getTrend(self, ds, doPlot=False):
1308
+ """
1309
+ get trend
1310
+
1311
+ Parameters
1312
+ ds: data set name or list or numpy array
1313
+ doPlot: true if plotting needed
1314
+ """
1315
+ self.__printBanner("getting trend")
1316
+ data = self.getNumericData(ds)
1317
+ sz = len(data)
1318
+ X = list(range(0, sz))
1319
+ X = np.reshape(X, (sz, 1))
1320
+ model = LinearRegression()
1321
+ model.fit(X, data)
1322
+ trend = model.predict(X)
1323
+ sc = model.score(X, data)
1324
+ coef = model.coef_
1325
+ intc = model.intercept_
1326
+ result = self.__printResult("coeff", coef, "intercept", intc, "r square error", sc, "trend", trend)
1327
+
1328
+ if doPlot:
1329
+ plt.plot(data)
1330
+ plt.plot(trend)
1331
+ plt.show()
1332
+ return result
1333
+
1334
+ def getDiffSdNoisiness(self, ds):
1335
+ """
1336
+ get noisiness based on std dev of first order difference
1337
+
1338
+ Parameters
1339
+ ds: data set name or list or numpy array
1340
+ """
1341
+ diff = self.getDifference(ds, 1)
1342
+ noise = np.std(np.array(diff))
1343
+ result = self.__printResult("noisiness", noise)
1344
+ return result
1345
+
1346
+ def getMaRmseNoisiness(self, ds, wsize=5):
1347
+ """
1348
+ gets noisiness based on RMSE with moving average
1349
+
1350
+ Parameters
1351
+ ds: data set name or list or numpy array
1352
+ wsize : window size
1353
+ """
1354
+ assert wsize % 2 == 1, "window size must be odd"
1355
+ data = self.getNumericData(ds)
1356
+ wind = data[:wsize]
1357
+ wstat = SlidingWindowStat.initialize(wind.tolist())
1358
+
1359
+ whsize = int(wsize / 2)
1360
+ beg = whsize
1361
+ end = len(data) - whsize - 1
1362
+ sumSq = 0.0
1363
+ mean = wstat.getStat()[0]
1364
+ diff = data[beg] - mean
1365
+ sumSq += diff * diff
1366
+ for i in range(beg + 1, end, 1):
1367
+ mean = wstat.addGetStat(data[i + whsize])[0]
1368
+ diff = data[i] - mean
1369
+ sumSq += (diff * diff)
1370
+
1371
+ noise = math.sqrt(sumSq / (len(data) - 2 * whsize))
1372
+ result = self.__printResult("noisiness", noise)
1373
+ return result
1374
+
1375
+
1376
+ def deTrend(self, ds, trend, doPlot=False):
1377
+ """
1378
+ de trend
1379
+
1380
+ Parameters
1381
+ ds: data set name or list or numpy array
1382
+ ternd : trend data
1383
+ doPlot: true if plotting needed
1384
+ """
1385
+ self.__printBanner("doing de trend", ds)
1386
+ data = self.getNumericData(ds)
1387
+ sz = len(data)
1388
+ detrended = list(map(lambda i : data[i]-trend[i], range(sz)))
1389
+ if doPlot:
1390
+ drawLine(detrended)
1391
+ return detrended
1392
+
1393
+ def getTimeSeriesComponents(self, ds, model, freq, summaryOnly, doPlot=False):
1394
+ """
1395
+ extracts trend, cycle and residue components of time series
1396
+
1397
+ Parameters
1398
+ ds: data set name or list or numpy array
1399
+ model : model type
1400
+ freq : seasnality period
1401
+ summaryOnly : True if only summary needed in output
1402
+ doPlot: true if plotting needed
1403
+ """
1404
+ self.__printBanner("extracting trend, cycle and residue components of time series", ds)
1405
+ assert model == "additive" or model == "multiplicative", "model must be additive or multiplicative"
1406
+ data = self.getNumericData(ds)
1407
+ res = seasonal_decompose(data, model=model, period=freq)
1408
+ if doPlot:
1409
+ res.plot()
1410
+ plt.show()
1411
+
1412
+ #summar of componenets
1413
+ trend = np.array(removeNan(res.trend))
1414
+ trendMean = trend.mean()
1415
+ trendSlope = (trend[-1] - trend[0]) / (len(trend) - 1)
1416
+ seasonal = np.array(removeNan(res.seasonal))
1417
+ seasonalAmp = (seasonal.max() - seasonal.min()) / 2
1418
+ resid = np.array(removeNan(res.resid))
1419
+ residueMean = resid.mean()
1420
+ residueStdDev = np.std(resid)
1421
+
1422
+ if summaryOnly:
1423
+ result = self.__printResult("trendMean", trendMean, "trendSlope", trendSlope, "seasonalAmp", seasonalAmp,
1424
+ "residueMean", residueMean, "residueStdDev", residueStdDev)
1425
+ else:
1426
+ result = self.__printResult("trendMean", trendMean, "trendSlope", trendSlope, "seasonalAmp", seasonalAmp,
1427
+ "residueMean", residueMean, "residueStdDev", residueStdDev, "trend", res.trend, "seasonal", res.seasonal,
1428
+ "residual", res.resid)
1429
+ return result
1430
+
1431
+ def getGausianMixture(self, ncomp, cvType, ninit, *dsl):
1432
+ """
1433
+ finds gaussian mixture parameters
1434
+
1435
+ Parameters
1436
+ ncomp : num of gaussian componenets
1437
+ cvType : co variance type
1438
+ ninit: num of intializations
1439
+ dsl: list of data set name or list or numpy array
1440
+ """
1441
+ self.__printBanner("getting gaussian mixture parameters", *dsl)
1442
+ assertInList(cvType, ["full", "tied", "diag", "spherical"], "invalid covariance type")
1443
+ dmat = self.__stackData(*dsl)
1444
+
1445
+ gm = GaussianMixture(n_components=ncomp, covariance_type=cvType, n_init=ninit)
1446
+ gm.fit(dmat)
1447
+ weights = gm.weights_
1448
+ means = gm.means_
1449
+ covars = gm.covariances_
1450
+ converged = gm.converged_
1451
+ niter = gm.n_iter_
1452
+ aic = gm.aic(dmat)
1453
+ result = self.__printResult("weights", weights, "mean", means, "covariance", covars, "converged", converged, "num iterations", niter, "aic", aic)
1454
+ return result
1455
+
1456
+ def getKmeansCluster(self, nclust, ninit, *dsl):
1457
+ """
1458
+ gets cluster parameters
1459
+
1460
+ Parameters
1461
+ nclust : num of clusters
1462
+ ninit: num of intializations
1463
+ dsl: list of data set name or list or numpy array
1464
+ """
1465
+ self.__printBanner("getting kmean cluster parameters", *dsl)
1466
+ dmat = self.__stackData(*dsl)
1467
+ nsamp = dmat.shape[0]
1468
+
1469
+ km = KMeans(n_clusters=nclust, n_init=ninit)
1470
+ km.fit(dmat)
1471
+ centers = km.cluster_centers_
1472
+ avdist = sqrt(km.inertia_ / nsamp)
1473
+ niter = km.n_iter_
1474
+ score = km.score(dmat)
1475
+ result = self.__printResult("centers", centers, "average distance", avdist, "num iterations", niter, "score", score)
1476
+ return result
1477
+
1478
+ def getPrincComp(self, ncomp, *dsl):
1479
+ """
1480
+ finds pricipal componenet parameters
1481
+
1482
+ Parameters
1483
+ ncomp : num of pricipal componenets
1484
+ dsl: list of data set name or list or numpy array
1485
+ """
1486
+ self.__printBanner("getting principal componenet parameters", *dsl)
1487
+ dmat = self.__stackData(*dsl)
1488
+ nfeat = dmat.shape[1]
1489
+ assertGreater(nfeat, 1, "requires multiple features")
1490
+ assertLesserEqual(ncomp, nfeat, "num of componenets greater than num of features")
1491
+
1492
+ pca = PCA(n_components=ncomp)
1493
+ pca.fit(dmat)
1494
+ comps = pca.components_
1495
+ var = pca.explained_variance_
1496
+ varr = pca.explained_variance_ratio_
1497
+ svalues = pca.singular_values_
1498
+ result = self.__printResult("componenets", comps, "variance", var, "variance ratio", varr, "singular values", svalues)
1499
+ return result
1500
+
1501
+ def getOutliersWithIsoForest(self, contamination, *dsl):
1502
+ """
1503
+ finds outliers using isolation forest
1504
+
1505
+ Parameters
1506
+ contamination : proportion of outliers in the data set
1507
+ dsl: list of data set name or list or numpy array
1508
+ """
1509
+ self.__printBanner("getting outliers using isolation forest", *dsl)
1510
+ assert contamination >= 0 and contamination <= 0.5, "contamination outside valid range"
1511
+ dmat = self.__stackData(*dsl)
1512
+
1513
+ isf = IsolationForest(contamination=contamination, behaviour="new")
1514
+ ypred = isf.fit_predict(dmat)
1515
+ mask = ypred == -1
1516
+ doul = dmat[mask, :]
1517
+ mask = ypred != -1
1518
+ dwoul = dmat[mask, :]
1519
+ result = self.__printResult("numOutliers", doul.shape[0], "outliers", doul, "dataWithoutOutliers", dwoul)
1520
+ return result
1521
+
1522
+ def getOutliersWithLocalFactor(self, contamination, *dsl):
1523
+ """
1524
+ gets outliers using local outlier factor
1525
+
1526
+ Parameters
1527
+ contamination : proportion of outliers in the data set
1528
+ dsl: list of data set name or list or numpy array
1529
+ """
1530
+ self.__printBanner("getting outliers using local outlier factor", *dsl)
1531
+ assert contamination >= 0 and contamination <= 0.5, "contamination outside valid range"
1532
+ dmat = self.__stackData(*dsl)
1533
+
1534
+ lof = LocalOutlierFactor(contamination=contamination)
1535
+ ypred = lof.fit_predict(dmat)
1536
+ mask = ypred == -1
1537
+ doul = dmat[mask, :]
1538
+ mask = ypred != -1
1539
+ dwoul = dmat[mask, :]
1540
+ result = self.__printResult("numOutliers", doul.shape[0], "outliers", doul, "dataWithoutOutliers", dwoul)
1541
+ return result
1542
+
1543
+ def getOutliersWithSupVecMach(self, nu, *dsl):
1544
+ """
1545
+ gets outliers using one class svm
1546
+
1547
+ Parameters
1548
+ nu : upper bound on the fraction of training errors and a lower bound of the fraction of support vectors
1549
+ dsl: list of data set name or list or numpy array
1550
+ """
1551
+ self.__printBanner("getting outliers using one class svm", *dsl)
1552
+ assert nu >= 0 and nu <= 0.5, "error upper bound outside valid range"
1553
+ dmat = self.__stackData(*dsl)
1554
+
1555
+ svm = OneClassSVM(nu=nu)
1556
+ ypred = svm.fit_predict(dmat)
1557
+ mask = ypred == -1
1558
+ doul = dmat[mask, :]
1559
+ mask = ypred != -1
1560
+ dwoul = dmat[mask, :]
1561
+ result = self.__printResult("numOutliers", doul.shape[0], "outliers", doul, "dataWithoutOutliers", dwoul)
1562
+ return result
1563
+
1564
+ def getOutliersWithCovarDeterminant(self, contamination, *dsl):
1565
+ """
1566
+ gets outliers using covariance determinan
1567
+
1568
+ Parameters
1569
+ contamination : proportion of outliers in the data set
1570
+ dsl: list of data set name or list or numpy array
1571
+ """
1572
+ self.__printBanner("getting outliers using using covariance determinant", *dsl)
1573
+ assert contamination >= 0 and contamination <= 0.5, "contamination outside valid range"
1574
+ dmat = self.__stackData(*dsl)
1575
+
1576
+ lof = EllipticEnvelope(contamination=contamination)
1577
+ ypred = lof.fit_predict(dmat)
1578
+ mask = ypred == -1
1579
+ doul = dmat[mask, :]
1580
+ mask = ypred != -1
1581
+ dwoul = dmat[mask, :]
1582
+ result = self.__printResult("numOutliers", doul.shape[0], "outliers", doul, "dataWithoutOutliers", dwoul)
1583
+ return result
1584
+
1585
+ def getOutliersWithZscore(self, ds, zthreshold, stats=None):
1586
+ """
1587
+ gets outliers using zscore
1588
+
1589
+ Parameters
1590
+ ds: data set name or list or numpy array
1591
+ zthreshold : z score threshold
1592
+ stats : tuple cintaining mean and std dev
1593
+ """
1594
+ self.__printBanner("getting outliers using zscore", ds)
1595
+ data = self.getNumericData(ds)
1596
+ if stats is None:
1597
+ mean = data.mean()
1598
+ sd = np.std(data)
1599
+ else:
1600
+ mean = stats[0]
1601
+ sd = stats[1]
1602
+
1603
+ zs = list(map(lambda d : abs((d - mean) / sd), data))
1604
+ outliers = list(filter(lambda r : r[1] > zthreshold, enumerate(zs)))
1605
+ result = self.__printResult("outliers", outliers)
1606
+ return result
1607
+
1608
+ def getOutliersWithRobustZscore(self, ds, zthreshold, stats=None):
1609
+ """
1610
+ gets outliers using robust zscore
1611
+
1612
+ Parameters
1613
+ ds: data set name or list or numpy array
1614
+ zthreshold : z score threshold
1615
+ stats : tuple containing median and median absolute deviation
1616
+ """
1617
+ self.__printBanner("getting outliers using robust zscore", ds)
1618
+ data = self.getNumericData(ds)
1619
+ if stats is None:
1620
+ med = np.median(data)
1621
+ dev = np.array(list(map(lambda d : abs(d - med), data)))
1622
+ mad = 1.4296 * np.median(dev)
1623
+ else:
1624
+ med = stats[0]
1625
+ mad = stats[1]
1626
+
1627
+ rzs = list(map(lambda d : abs((d - med) / mad), data))
1628
+ outliers = list(filter(lambda r : r[1] > zthreshold, enumerate(rzs)))
1629
+ result = self.__printResult("outliers", outliers)
1630
+ return result
1631
+
1632
+
1633
+ def getSubsequenceOutliersWithDissimilarity(self, subSeqSize, ds):
1634
+ """
1635
+ gets subsequence outlier with subsequence pairwise disimilarity
1636
+
1637
+ Parameters
1638
+ subSeqSize : sub sequence size
1639
+ ds: data set name or list or numpy array
1640
+ """
1641
+ self.__printBanner("doing sub sequence anomaly detection with dissimilarity", ds)
1642
+ data = self.getNumericData(ds)
1643
+ sz = len(data)
1644
+ dist = dict()
1645
+ minDist = dict()
1646
+ for i in range(sz - subSeqSize):
1647
+ #first window
1648
+ w1 = data[i : i + subSeqSize]
1649
+ dmin = None
1650
+ for j in range(sz - subSeqSize):
1651
+ #second window not overlapping with the first
1652
+ if j + subSeqSize <=i or j >= i + subSeqSize:
1653
+ w2 = data[j : j + subSeqSize]
1654
+ k = (j,i)
1655
+ if k in dist:
1656
+ d = dist[k]
1657
+ else:
1658
+ d = euclideanDistance(w1,w2)
1659
+ k = (i,j)
1660
+ dist[k] = d
1661
+ if dmin is None:
1662
+ dmin = d
1663
+ else:
1664
+ dmin = d if d < dmin else dmin
1665
+ minDist[i] = dmin
1666
+
1667
+ #find max of min
1668
+ dmax = None
1669
+ offset = None
1670
+ for k in minDist.keys():
1671
+ d = minDist[k]
1672
+ if dmax is None:
1673
+ dmax = d
1674
+ offset = k
1675
+ else:
1676
+ if d > dmax:
1677
+ dmax = d
1678
+ offset = k
1679
+ result = self.__printResult("subSeqOffset", offset, "outlierScore", dmax)
1680
+ return result
1681
+
1682
+ def getNullCount(self, ds):
1683
+ """
1684
+ get count of null fields
1685
+
1686
+ Parameters
1687
+ ds : data set name or list or numpy array with data
1688
+ """
1689
+ self.__printBanner("getting null value count", ds)
1690
+ if type(ds) == str:
1691
+ assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
1692
+ data = self.dataSets[ds]
1693
+ ser = pd.Series(data)
1694
+ elif type(ds) == list or type(ds) == np.ndarray:
1695
+ ser = pd.Series(ds)
1696
+ data = ds
1697
+ else:
1698
+ raise ValueError("invalid data type")
1699
+ nv = ser.isnull().tolist()
1700
+ nullCount = nv.count(True)
1701
+ nullFraction = nullCount / len(data)
1702
+ result = self.__printResult("nullFraction", nullFraction, "nullCount", nullCount)
1703
+ return result
1704
+
1705
+
1706
+ def fitLinearReg(self, dsx, ds, doPlot=False):
1707
+ """
1708
+ fit linear regression
1709
+
1710
+ Parameters
1711
+ dsx: x data set name or None
1712
+ ds: data set name or list or numpy array
1713
+ doPlot: true if plotting needed
1714
+ """
1715
+ self.__printBanner("fitting linear regression", ds)
1716
+ data = self.getNumericData(ds)
1717
+ if dsx is None:
1718
+ x = np.arange(len(data))
1719
+ else:
1720
+ x = self.getNumericData(dsx)
1721
+ slope, intercept, rvalue, pvalue, stderr = sta.linregress(x, data)
1722
+ result = self.__printResult("slope", slope, "intercept", intercept, "rvalue", rvalue, "pvalue", pvalue, "stderr", stderr)
1723
+ if doPlot:
1724
+ self.regFitPlot(x, data, slope, intercept)
1725
+ return result
1726
+
1727
+ def fitSiegelRobustLinearReg(self, ds, doPlot=False):
1728
+ """
1729
+ siegel robust linear regression fit based on median
1730
+
1731
+ Parameters
1732
+ ds: data set name or list or numpy array
1733
+ doPlot: true if plotting needed
1734
+ """
1735
+ self.__printBanner("fitting siegel robust linear regression based on median", ds)
1736
+ data = self.getNumericData(ds)
1737
+ slope , intercept = sta.siegelslopes(data)
1738
+ result = self.__printResult("slope", slope, "intercept", intercept)
1739
+ if doPlot:
1740
+ x = np.arange(len(data))
1741
+ self.regFitPlot(x, data, slope, intercept)
1742
+ return result
1743
+
1744
+ def fitTheilSenRobustLinearReg(self, ds, doPlot=False):
1745
+ """
1746
+ thiel sen robust linear fit regression based on median
1747
+
1748
+ Parameters
1749
+ ds: data set name or list or numpy array
1750
+ doPlot: true if plotting needed
1751
+ """
1752
+ self.__printBanner("fitting thiel sen robust linear regression based on median", ds)
1753
+ data = self.getNumericData(ds)
1754
+ slope, intercept, loSlope, upSlope = sta.theilslopes(data)
1755
+ result = self.__printResult("slope", slope, "intercept", intercept, "lower slope", loSlope, "upper slope", upSlope)
1756
+ if doPlot:
1757
+ x = np.arange(len(data))
1758
+ self.regFitPlot(x, data, slope, intercept)
1759
+ return result
1760
+
1761
+ def plotRegFit(self, x, y, slope, intercept):
1762
+ """
1763
+ plot linear rgeression fit line
1764
+
1765
+ Parameters
1766
+ x : x values
1767
+ y : y values
1768
+ slope : slope
1769
+ intercept : intercept
1770
+ """
1771
+ self.__printBanner("plotting linear rgeression fit line")
1772
+ fig = plt.figure()
1773
+ ax = fig.add_subplot(111)
1774
+ ax.plot(x, y, "b.")
1775
+ ax.plot(x, intercept + slope * x, "r-")
1776
+ plt.show()
1777
+
1778
+ def getRegFit(self, xvalues, yvalues, slope, intercept):
1779
+ """
1780
+ gets fitted line and residue
1781
+
1782
+ Parameters
1783
+ x : x values
1784
+ y : y values
1785
+ slope : regression slope
1786
+ intercept : regressiob intercept
1787
+ """
1788
+ yfit = list()
1789
+ residue = list()
1790
+ for x,y in zip(xvalues, yvalues):
1791
+ yf = x * slope + intercept
1792
+ yfit.append(yf)
1793
+ r = y - yf
1794
+ residue.append(r)
1795
+ result = self.__printResult("fitted line", yfit, "residue", residue)
1796
+ return result
1797
+
1798
+ def getInfluentialPoints(self, dsx, dsy):
1799
+ """
1800
+ gets influential points in regression model with Cook's distance
1801
+
1802
+ Parameters
1803
+ dsx : data set name or list or numpy array for x
1804
+ dsy : data set name or list or numpy array for y
1805
+ """
1806
+ self.__printBanner("finding influential points for linear regression", dsx, dsy)
1807
+ y = self.getNumericData(dsy)
1808
+ x = np.arange(len(data)) if dsx is None else self.getNumericData(dsx)
1809
+ model = sm.OLS(y, x).fit()
1810
+ np.set_printoptions(suppress=True)
1811
+ influence = model.get_influence()
1812
+ cooks = influence.cooks_distance
1813
+ result = self.__printResult("Cook distance", cooks)
1814
+ return result
1815
+
1816
+ def getCovar(self, *dsl):
1817
+ """
1818
+ gets covariance
1819
+
1820
+ Parameters
1821
+ dsl: list of data set name or list or numpy array
1822
+ """
1823
+ self.__printBanner("getting covariance", *dsl)
1824
+ data = list(map(lambda ds : self.getNumericData(ds), dsl))
1825
+ self.ensureSameSize(data)
1826
+ data = np.vstack(data)
1827
+ cv = np.cov(data)
1828
+ print(cv)
1829
+ return cv
1830
+
1831
+ def getPearsonCorr(self, ds1, ds2, sigLev=.05):
1832
+ """
1833
+ gets pearson correlation coefficient
1834
+
1835
+ Parameters
1836
+ ds1: data set name or list or numpy array
1837
+ ds2: data set name or list or numpy array
1838
+ """
1839
+ self.__printBanner("getting pearson correlation coefficient ", ds1, ds2)
1840
+ data1 = self.getNumericData(ds1)
1841
+ data2 = self.getNumericData(ds2)
1842
+ self.ensureSameSize([data1, data2])
1843
+ stat, pvalue = sta.pearsonr(data1, data2)
1844
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
1845
+ self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1846
+ return result
1847
+
1848
+
1849
+ def getSpearmanRankCorr(self, ds1, ds2, sigLev=.05):
1850
+ """
1851
+ gets spearman correlation coefficient
1852
+
1853
+ Parameters
1854
+ ds1: data set name or list or numpy array
1855
+ ds2: data set name or list or numpy array
1856
+ sigLev: statistical significance level
1857
+ """
1858
+ self.__printBanner("getting spearman correlation coefficient",ds1, ds2)
1859
+ data1 = self.getNumericData(ds1)
1860
+ data2 = self.getNumericData(ds2)
1861
+ self.ensureSameSize([data1, data2])
1862
+ stat, pvalue = sta.spearmanr(data1, data2)
1863
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
1864
+ self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1865
+ return result
1866
+
1867
+ def getKendalRankCorr(self, ds1, ds2, sigLev=.05):
1868
+ """
1869
+ kendall’s tau, a correlation measure for ordinal data
1870
+
1871
+ Parameters
1872
+ ds1: data set name or list or numpy array
1873
+ ds2: data set name or list or numpy array
1874
+ sigLev: statistical significance level
1875
+ """
1876
+ self.__printBanner("getting kendall’s tau, a correlation measure for ordinal data", ds1, ds2)
1877
+ data1 = self.getNumericData(ds1)
1878
+ data2 = self.getNumericData(ds2)
1879
+ self.ensureSameSize([data1, data2])
1880
+ stat, pvalue = sta.kendalltau(data1, data2)
1881
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
1882
+ self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1883
+ return result
1884
+
1885
+ def getPointBiserialCorr(self, ds1, ds2, sigLev=.05):
1886
+ """
1887
+ point biserial correlation between binary and numeric
1888
+
1889
+ Parameters
1890
+ ds1: data set name or list or numpy array
1891
+ ds2: data set name or list or numpy array
1892
+ sigLev: statistical significance level
1893
+ """
1894
+ self.__printBanner("getting point biserial correlation between binary and numeric", ds1, ds2)
1895
+ data1 = self.getNumericData(ds1)
1896
+ data2 = self.getNumericData(ds2)
1897
+ assert isBinary(data1), "first data set is not binary"
1898
+ self.ensureSameSize([data1, data2])
1899
+ stat, pvalue = sta.pointbiserialr(data1, data2)
1900
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
1901
+ self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1902
+ return result
1903
+
1904
+ def getConTab(self, ds1, ds2):
1905
+ """
1906
+ get contingency table for categorical data pair
1907
+
1908
+ Parameters
1909
+ ds1: data set name or list or numpy array
1910
+ ds2: data set name or list or numpy array
1911
+ """
1912
+ self.__printBanner("getting contingency table for categorical data", ds1, ds2)
1913
+ data1 = self.getCatData(ds1)
1914
+ data2 = self.getCatData(ds2)
1915
+ self.ensureSameSize([data1, data2])
1916
+ crosstab = pd.crosstab(pd.Series(data1), pd.Series(data2), margins = False)
1917
+ ctab = crosstab.values
1918
+ print("contingency table")
1919
+ print(ctab)
1920
+ return ctab
1921
+
1922
+ def getChiSqCorr(self, ds1, ds2, sigLev=.05):
1923
+ """
1924
+ chi square correlation for categorical data pair
1925
+
1926
+ Parameters
1927
+ ds1: data set name or list or numpy array
1928
+ ds2: data set name or list or numpy array
1929
+ sigLev: statistical significance level
1930
+ """
1931
+ self.__printBanner("getting chi square correlation for two categorical", ds1, ds2)
1932
+ ctab = self.getConTab(ds1, ds2)
1933
+ stat, pvalue, dof, expctd = sta.chi2_contingency(ctab)
1934
+ result = self.__printResult("stat", stat, "pvalue", pvalue, "dof", dof, "expected", expctd)
1935
+ self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1936
+ return result
1937
+
1938
+ def getSizeCorrectChiSqCorr(self, ds1, ds2, chisq):
1939
+ """
1940
+ cramerV size corrected chi square correlation for categorical data pair
1941
+
1942
+ Parameters
1943
+ ds1: data set name or list or numpy array
1944
+ ds2: data set name or list or numpy array
1945
+ chisq: chisq stat
1946
+ """
1947
+ self.__printBanner("getting size corrected chi square correlation for two categorical", ds1, ds2)
1948
+ c1 = self.getCatUniqueValueCounts(ds1)["cardinality"]
1949
+ c2 = self.getCatUniqueValueCounts(ds2)["cardinality"]
1950
+ c = min(c1,c2)
1951
+ assertGreater(c, 1, "min cardinality should be greater than 1")
1952
+ l = len(self.getCatData(ds1))
1953
+ t = l * (c - 1)
1954
+ stat = math.sqrt(chisq / t)
1955
+ result = self.__printResult("stat", stat)
1956
+ return result
1957
+
1958
+ def getAnovaCorr(self, ds1, ds2, grByCol, sigLev=.05):
1959
+ """
1960
+ anova correlation for numerical categorical
1961
+
1962
+ Parameters
1963
+ ds1: data set name or list or numpy array
1964
+ ds2: data set name or list or numpy array
1965
+ grByCol : group by column
1966
+ sigLev: statistical significance level
1967
+ """
1968
+ self.__printBanner("anova correlation for numerical categorical", ds1, ds2)
1969
+ df = self.loadCatFloatDataFrame(ds1, ds2) if grByCol == 0 else self.loadCatFloatDataFrame(ds2, ds1)
1970
+ grByCol = 0
1971
+ dCol = 1
1972
+ grouped = df.groupby([grByCol])
1973
+ dlist = list(map(lambda v : v[1].loc[:, dCol].values, grouped))
1974
+ stat, pvalue = sta.f_oneway(*dlist)
1975
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
1976
+ self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1977
+ return result
1978
+
1979
+
1980
+ def plotAutoCorr(self, ds, lags, alpha, diffOrder=0):
1981
+ """
1982
+ plots auto correlation
1983
+
1984
+ Parameters
1985
+ ds: data set name or list or numpy array
1986
+ lags: num of lags
1987
+ alpha: confidence level
1988
+ """
1989
+ self.__printBanner("plotting auto correlation", ds)
1990
+ data = self.getNumericData(ds)
1991
+ ddata = difference(data, diffOrder) if diffOrder > 0 else data
1992
+ tsaplots.plot_acf(ddata, lags = lags, alpha = alpha)
1993
+ plt.show()
1994
+
1995
+ def getAutoCorr(self, ds, lags, alpha=.05):
1996
+ """
1997
+ gets auts correlation
1998
+
1999
+ Parameters
2000
+ ds: data set name or list or numpy array
2001
+ lags: num of lags
2002
+ alpha: confidence level
2003
+ """
2004
+ self.__printBanner("getting auto correlation", ds)
2005
+ data = self.getNumericData(ds)
2006
+ autoCorr, confIntv = stt.acf(data, nlags=lags, fft=False, alpha=alpha)
2007
+ result = self.__printResult("autoCorr", autoCorr, "confIntv", confIntv)
2008
+ return result
2009
+
2010
+
2011
+ def plotParAcf(self, ds, lags, alpha):
2012
+ """
2013
+ partial auto correlation
2014
+
2015
+ Parameters
2016
+ ds: data set name or list or numpy array
2017
+ lags: num of lags
2018
+ alpha: confidence level
2019
+ """
2020
+ self.__printBanner("plotting partial auto correlation", ds)
2021
+ data = self.getNumericData(ds)
2022
+ tsaplots.plot_pacf(data, lags = lags, alpha = alpha)
2023
+ plt.show()
2024
+
2025
+ def getParAutoCorr(self, ds, lags, alpha=.05):
2026
+ """
2027
+ gets partial auts correlation
2028
+
2029
+ Parameters
2030
+ ds: data set name or list or numpy array
2031
+ lags: num of lags
2032
+ alpha: confidence level
2033
+ """
2034
+ self.__printBanner("getting partial auto correlation", ds)
2035
+ data = self.getNumericData(ds)
2036
+ partAutoCorr, confIntv = stt.pacf(data, nlags=lags, alpha=alpha)
2037
+ result = self.__printResult("partAutoCorr", partAutoCorr, "confIntv", confIntv)
2038
+ return result
2039
+
2040
+ def getHurstExp(self, ds, kind, doPlot=True):
2041
+ """
2042
+ gets Hurst exponent of time series
2043
+
2044
+ Parameters
2045
+ ds: data set name or list or numpy array
2046
+ kind: kind of data change, random_walk, price
2047
+ doPlot: True for plot
2048
+ """
2049
+ self.__printBanner("getting Hurst exponent", ds)
2050
+ data = self.getNumericData(ds)
2051
+ h, c, odata = hurst.compute_Hc(data, kind=kind, simplified=False)
2052
+ if doPlot:
2053
+ f, ax = plt.subplots()
2054
+ ax.plot(odata[0], c * odata[0] ** h, color="deepskyblue")
2055
+ ax.scatter(odata[0], odata[1], color="purple")
2056
+ ax.set_xscale("log")
2057
+ ax.set_yscale("log")
2058
+ ax.set_xlabel("time interval")
2059
+ ax.set_ylabel("cum dev range and std dev ratio")
2060
+ ax.grid(True)
2061
+ plt.show()
2062
+
2063
+ result = self.__printResult("hurstExponent", h, "hurstConstant", c)
2064
+ return result
2065
+
2066
+ def approxEntropy(self, ds, m, r):
2067
+ """
2068
+ gets apprx entroty of time series (ref: wikipedia)
2069
+
2070
+ Parameters
2071
+ ds: data set name or list or numpy array
2072
+ m: length of compared run of data
2073
+ r: filtering level
2074
+ """
2075
+ self.__printBanner("getting approximate entropy", ds)
2076
+ ldata = self.getNumericData(ds)
2077
+ aent = abs(self.__phi(ldata, m + 1, r) - self.__phi(ldata, m, r))
2078
+ result = self.__printResult("approxEntropy", aent)
2079
+ return result
2080
+
2081
+ def __phi(self, ldata, m, r):
2082
+ """
2083
+ phi function for approximate entropy
2084
+
2085
+ Parameters
2086
+ ldata: data array
2087
+ m: length of compared run of data
2088
+ r: filtering level
2089
+ """
2090
+ le = len(ldata)
2091
+ x = [[ldata[j] for j in range(i, i + m - 1 + 1)] for i in range(le - m + 1)]
2092
+ lex = len(x)
2093
+ c = list()
2094
+ for i in range(lex):
2095
+ cnt = 0
2096
+ for j in range(lex):
2097
+ cnt += (1 if maxListDist(x[i], x[j]) <= r else 0)
2098
+ cnt /= (le - m + 1.0)
2099
+ c.append(cnt)
2100
+ return sum(np.log(c)) / (le - m + 1.0)
2101
+
2102
+
2103
+ def oneSpaceEntropy(self, ds, scaMethod="zscale"):
2104
+ """
2105
+ gets one space entroty (ref: Estimating mutual information by Kraskov)
2106
+
2107
+ Parameters
2108
+ ds: data set name or list or numpy array
2109
+ """
2110
+ self.__printBanner("getting one space entropy", ds)
2111
+ data = self.getNumericData(ds)
2112
+ sdata = sorted(data)
2113
+ sdata = scaleData(sdata, scaMethod)
2114
+ su = 0
2115
+ n = len(sdata)
2116
+ for i in range(1, n, 1):
2117
+ t = abs(sdata[i] - sdata[i-1])
2118
+ if t > 0:
2119
+ su += log(t)
2120
+ su /= (n -1)
2121
+ #print(su)
2122
+ ose = digammaFun(n) - digammaFun(1) + su
2123
+ result = self.__printResult("entropy", ose)
2124
+ return result
2125
+
2126
+
2127
+ def plotCrossCorr(self, ds1, ds2, normed, lags):
2128
+ """
2129
+ plots cross correlation
2130
+
2131
+ Parameters
2132
+ ds1: data set name or list or numpy array
2133
+ ds2: data set name or list or numpy array
2134
+ normed: If True, input vectors are normalised to unit
2135
+ lags: num of lags
2136
+ """
2137
+ self.__printBanner("plotting cross correlation between two numeric", ds1, ds2)
2138
+ data1 = self.getNumericData(ds1)
2139
+ data2 = self.getNumericData(ds2)
2140
+ self.ensureSameSize([data1, data2])
2141
+ plt.xcorr(data1, data2, normed=normed, maxlags=lags)
2142
+ plt.show()
2143
+
2144
+ def getCrossCorr(self, ds1, ds2):
2145
+ """
2146
+ gets cross correlation
2147
+
2148
+ Parameters
2149
+ ds1: data set name or list or numpy array
2150
+ ds2: data set name or list or numpy array
2151
+ """
2152
+ self.__printBanner("getting cross correlation", ds1, ds2)
2153
+ data1 = self.getNumericData(ds1)
2154
+ data2 = self.getNumericData(ds2)
2155
+ self.ensureSameSize([data1, data2])
2156
+ crossCorr = stt.ccf(data1, data2)
2157
+ result = self.__printResult("crossCorr", crossCorr)
2158
+ return result
2159
+
2160
+ def getFourierTransform(self, ds):
2161
+ """
2162
+ gets fast fourier transform
2163
+
2164
+ Parameters
2165
+ ds: data set name or list or numpy array
2166
+ """
2167
+ self.__printBanner("getting fourier transform", ds)
2168
+ data = self.getNumericData(ds)
2169
+ ft = np.fft.rfft(data)
2170
+ result = self.__printResult("fourierTransform", ft)
2171
+ return result
2172
+
2173
+
2174
+ def testStationaryAdf(self, ds, regression, autolag, sigLev=.05):
2175
+ """
2176
+ Adf stationary test null hyp not stationary
2177
+
2178
+ Parameters
2179
+ ds: data set name or list or numpy array
2180
+ regression: constant and trend order to include in regression
2181
+ autolag: method to use when automatically determining the lag
2182
+ sigLev: statistical significance level
2183
+ """
2184
+ self.__printBanner("doing ADF stationary test", ds)
2185
+ relist = ["c","ct","ctt","nc"]
2186
+ assert regression in relist, "invalid regression value"
2187
+ alList = ["AIC", "BIC", "t-stat", None]
2188
+ assert autolag in alList, "invalid autolag value"
2189
+
2190
+ data = self.getNumericData(ds)
2191
+ re = stt.adfuller(data, regression=regression, autolag=autolag)
2192
+ result = self.__printResult("stat", re[0], "pvalue", re[1] , "num lags", re[2] , "num observation for regression", re[3],
2193
+ "critial values", re[4])
2194
+ self.__printStat(re[0], re[1], "probably not stationary", "probably stationary", sigLev)
2195
+ return result
2196
+
2197
+ def testStationaryKpss(self, ds, regression, nlags, sigLev=.05):
2198
+ """
2199
+ Kpss stationary test null hyp stationary
2200
+
2201
+ Parameters
2202
+ ds: data set name or list or numpy array
2203
+ regression: constant and trend order to include in regression
2204
+ nlags : no of lags
2205
+ sigLev: statistical significance level
2206
+ """
2207
+ self.__printBanner("doing KPSS stationary test", ds)
2208
+ relist = ["c","ct"]
2209
+ assert regression in relist, "invalid regression value"
2210
+ nlList =[None, "auto", "legacy"]
2211
+ assert nlags in nlList or type(nlags) == int, "invalid nlags value"
2212
+
2213
+
2214
+ data = self.getNumericData(ds)
2215
+ stat, pvalue, nLags, criticalValues = stt.kpss(data, regression=regression, lags=nlags)
2216
+ result = self.__printResult("stat", stat, "pvalue", pvalue, "num lags", nLags, "critial values", criticalValues)
2217
+ self.__printStat(stat, pvalue, "probably stationary", "probably not stationary", sigLev)
2218
+ return result
2219
+
2220
+ def testNormalJarqBera(self, ds, sigLev=.05):
2221
+ """
2222
+ jarque bera normalcy test
2223
+
2224
+ Parameters
2225
+ ds: data set name or list or numpy array
2226
+ sigLev: statistical significance level
2227
+ """
2228
+ self.__printBanner("doing ajrque bera normalcy test", ds)
2229
+ data = self.getNumericData(ds)
2230
+ jb, jbpv, skew, kurtosis = sstt.jarque_bera(data)
2231
+ result = self.__printResult("stat", jb, "pvalue", jbpv, "skew", skew, "kurtosis", kurtosis)
2232
+ self.__printStat(jb, jbpv, "probably gaussian", "probably not gaussian", sigLev)
2233
+ return result
2234
+
2235
+
2236
+ def testNormalShapWilk(self, ds, sigLev=.05):
2237
+ """
2238
+ shapiro wilks normalcy test
2239
+
2240
+ Parameters
2241
+ ds: data set name or list or numpy array
2242
+ sigLev: statistical significance level
2243
+ """
2244
+ self.__printBanner("doing shapiro wilks normalcy test", ds)
2245
+ data = self.getNumericData(ds)
2246
+ stat, pvalue = sta.shapiro(data)
2247
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2248
+ self.__printStat(stat, pvalue, "probably gaussian", "probably not gaussian", sigLev)
2249
+ return result
2250
+
2251
+ def testNormalDagast(self, ds, sigLev=.05):
2252
+ """
2253
+ D’Agostino’s K square normalcy test
2254
+
2255
+ Parameters
2256
+ ds: data set name or list or numpy array
2257
+ sigLev: statistical significance level
2258
+ """
2259
+ self.__printBanner("doing D’Agostino’s K square normalcy test", ds)
2260
+ data = self.getNumericData(ds)
2261
+ stat, pvalue = sta.normaltest(data)
2262
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2263
+ self.__printStat(stat, pvalue, "probably gaussian", "probably not gaussian", sigLev)
2264
+ return result
2265
+
2266
+ def testDistrAnderson(self, ds, dist, sigLev=.05):
2267
+ """
2268
+ Anderson test for normal, expon, logistic, gumbel, gumbel_l, gumbel_r
2269
+
2270
+ Parameters
2271
+ ds: data set name or list or numpy array
2272
+ dist: type of distribution
2273
+ sigLev: statistical significance level
2274
+ """
2275
+ self.__printBanner("doing Anderson test for for various distributions", ds)
2276
+ diList = ["norm", "expon", "logistic", "gumbel", "gumbel_l", "gumbel_r", "extreme1"]
2277
+ assert dist in diList, "invalid distribution"
2278
+
2279
+ data = self.getNumericData(ds)
2280
+ re = sta.anderson(data)
2281
+ slAlpha = int(100 * sigLev)
2282
+ msg = "significnt value not found"
2283
+ for i in range(len(re.critical_values)):
2284
+ sl, cv = re.significance_level[i], re.critical_values[i]
2285
+ if int(sl) == slAlpha:
2286
+ if re.statistic < cv:
2287
+ msg = "probably {} at the {:.3f} siginificance level".format(dist, sl)
2288
+ else:
2289
+ msg = "probably not {} at the {:.3f} siginificance level".format(dist, sl)
2290
+ result = self.__printResult("stat", re.statistic, "test", msg)
2291
+ print(msg)
2292
+ return result
2293
+
2294
+ def testSkew(self, ds, sigLev=.05):
2295
+ """
2296
+ test skew wrt normal distr
2297
+
2298
+ Parameters
2299
+ ds: data set name or list or numpy array
2300
+ sigLev: statistical significance level
2301
+ """
2302
+ self.__printBanner("testing skew wrt normal distr", ds)
2303
+ data = self.getNumericData(ds)
2304
+ stat, pvalue = sta.skewtest(data)
2305
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2306
+ self.__printStat(stat, pvalue, "probably same skew as normal distribution", "probably not same skew as normal distribution", sigLev)
2307
+ return result
2308
+
2309
+ def testTwoSampleStudent(self, ds1, ds2, sigLev=.05):
2310
+ """
2311
+ student t 2 sample test
2312
+
2313
+ Parameters
2314
+ ds1: data set name or list or numpy array
2315
+ ds2: data set name or list or numpy array
2316
+ sigLev: statistical significance level
2317
+ """
2318
+ self.__printBanner("doing student t 2 sample test", ds1, ds2)
2319
+ data1 = self.getNumericData(ds1)
2320
+ data2 = self.getNumericData(ds2)
2321
+ stat, pvalue = sta.ttest_ind(data1, data2)
2322
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2323
+ self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2324
+ return result
2325
+
2326
+ def testTwoSampleKs(self, ds1, ds2, sigLev=.05):
2327
+ """
2328
+ Kolmogorov Sminov 2 sample statistic
2329
+
2330
+ Parameters
2331
+ ds1: data set name or list or numpy array
2332
+ ds2: data set name or list or numpy array
2333
+ sigLev: statistical significance level
2334
+ """
2335
+ self.__printBanner("doing Kolmogorov Sminov 2 sample test", ds1, ds2)
2336
+ data1 = self.getNumericData(ds1)
2337
+ data2 = self.getNumericData(ds2)
2338
+ stat, pvalue = sta.ks_2samp(data1, data2)
2339
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2340
+ self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2341
+
2342
+
2343
+ def testTwoSampleMw(self, ds1, ds2, sigLev=.05):
2344
+ """
2345
+ Mann-Whitney 2 sample statistic
2346
+
2347
+ Parameters
2348
+ ds1: data set name or list or numpy array
2349
+ ds2: data set name or list or numpy array
2350
+ sigLev: statistical significance level
2351
+ """
2352
+ self.__printBanner("doing Mann-Whitney 2 sample test", ds1, ds2)
2353
+ data1 = self.getNumericData(ds1)
2354
+ data2 = self.getNumericData(ds2)
2355
+ stat, pvalue = sta.mannwhitneyu(data1, data2)
2356
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2357
+ self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2358
+
2359
+ def testTwoSampleWilcox(self, ds1, ds2, sigLev=.05):
2360
+ """
2361
+ Wilcoxon Signed-Rank 2 sample statistic
2362
+
2363
+ Parameters
2364
+ ds1: data set name or list or numpy array
2365
+ ds2: data set name or list or numpy array
2366
+ sigLev: statistical significance level
2367
+ """
2368
+ self.__printBanner("doing Wilcoxon Signed-Rank 2 sample test", ds1, ds2)
2369
+ data1 = self.getNumericData(ds1)
2370
+ data2 = self.getNumericData(ds2)
2371
+ stat, pvalue = sta.wilcoxon(data1, data2)
2372
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2373
+ self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2374
+
2375
+
2376
+ def testTwoSampleKw(self, ds1, ds2, sigLev=.05):
2377
+ """
2378
+ Kruskal-Wallis 2 sample statistic
2379
+
2380
+ Parameters
2381
+ ds1: data set name or list or numpy array
2382
+ ds2: data set name or list or numpy array
2383
+ sigLev: statistical significance level
2384
+ """
2385
+ self.__printBanner("doing Kruskal-Wallis 2 sample test", ds1, ds2)
2386
+ data1 = self.getNumericData(ds1)
2387
+ data2 = self.getNumericData(ds2)
2388
+ stat, pvalue = sta.kruskal(data1, data2)
2389
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2390
+ self.__printStat(stat, pvalue, "probably same distribution", "probably snot ame distribution", sigLev)
2391
+
2392
+ def testTwoSampleFriedman(self, ds1, ds2, ds3, sigLev=.05):
2393
+ """
2394
+ Friedman 2 sample statistic
2395
+
2396
+ Parameters
2397
+ ds1: data set name or list or numpy array
2398
+ ds2: data set name or list or numpy array
2399
+ sigLev: statistical significance level
2400
+ """
2401
+ self.__printBanner("doing Friedman 2 sample test", ds1, ds2)
2402
+ data1 = self.getNumericData(ds1)
2403
+ data2 = self.getNumericData(ds2)
2404
+ data3 = self.getNumericData(ds3)
2405
+ stat, pvalue = sta.friedmanchisquare(data1, data2, data3)
2406
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2407
+ self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2408
+
2409
+ def testTwoSampleEs(self, ds1, ds2, sigLev=.05):
2410
+ """
2411
+ Epps Singleton 2 sample statistic
2412
+
2413
+ Parameters
2414
+ ds1: data set name or list or numpy array
2415
+ ds2: data set name or list or numpy array
2416
+ sigLev: statistical significance level
2417
+ """
2418
+ self.__printBanner("doing Epps Singleton 2 sample test", ds1, ds2)
2419
+ data1 = self.getNumericData(ds1)
2420
+ data2 = self.getNumericData(ds2)
2421
+ stat, pvalue = sta.epps_singleton_2samp(data1, data2)
2422
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2423
+ self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2424
+
2425
+ def testTwoSampleAnderson(self, ds1, ds2, sigLev=.05):
2426
+ """
2427
+ Anderson 2 sample statistic
2428
+
2429
+ Parameters
2430
+ ds1: data set name or list or numpy array
2431
+ ds2: data set name or list or numpy array
2432
+ sigLev: statistical significance level
2433
+ """
2434
+ self.__printBanner("doing Anderson 2 sample test", ds1, ds2)
2435
+ data1 = self.getNumericData(ds1)
2436
+ data2 = self.getNumericData(ds2)
2437
+ dseq = (data1, data2)
2438
+ stat, critValues, sLev = sta.anderson_ksamp(dseq)
2439
+ slAlpha = 100 * sigLev
2440
+
2441
+ if slAlpha == 10:
2442
+ cv = critValues[1]
2443
+ elif slAlpha == 5:
2444
+ cv = critValues[2]
2445
+ elif slAlpha == 2.5:
2446
+ cv = critValues[3]
2447
+ elif slAlpha == 1:
2448
+ cv = critValues[4]
2449
+ else:
2450
+ cv = None
2451
+
2452
+ result = self.__printResult("stat", stat, "critValues", critValues, "critValue", cv, "significanceLevel", sLev)
2453
+ print("stat: {:.3f}".format(stat))
2454
+ if cv is None:
2455
+ msg = "critical values value not found for provided siginificance level"
2456
+ else:
2457
+ if stat < cv:
2458
+ msg = "probably same distribution at the {:.3f} siginificance level".format(sigLev)
2459
+ else:
2460
+ msg = "probably not same distribution at the {:.3f} siginificance level".format(sigLev)
2461
+ print(msg)
2462
+ return result
2463
+
2464
+
2465
+ def testTwoSampleScaleAb(self, ds1, ds2, sigLev=.05):
2466
+ """
2467
+ Ansari Bradley 2 sample scale statistic
2468
+
2469
+ Parameters
2470
+ ds1: data set name or list or numpy array
2471
+ ds2: data set name or list or numpy array
2472
+ sigLev: statistical significance level
2473
+ """
2474
+ self.__printBanner("doing Ansari Bradley 2 sample scale test", ds1, ds2)
2475
+ data1 = self.getNumericData(ds1)
2476
+ data2 = self.getNumericData(ds2)
2477
+ stat, pvalue = sta.ansari(data1, data2)
2478
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2479
+ self.__printStat(stat, pvalue, "probably same scale", "probably not same scale", sigLev)
2480
+ return result
2481
+
2482
+ def testTwoSampleScaleMood(self, ds1, ds2, sigLev=.05):
2483
+ """
2484
+ Mood 2 sample scale statistic
2485
+
2486
+ Parameters
2487
+ ds1: data set name or list or numpy array
2488
+ ds2: data set name or list or numpy array
2489
+ sigLev: statistical significance level
2490
+ """
2491
+ self.__printBanner("doing Mood 2 sample scale test", ds1, ds2)
2492
+ data1 = self.getNumericData(ds1)
2493
+ data2 = self.getNumericData(ds2)
2494
+ stat, pvalue = sta.mood(data1, data2)
2495
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2496
+ self.__printStat(stat, pvalue, "probably same scale", "probably not same scale", sigLev)
2497
+ return result
2498
+
2499
+ def testTwoSampleVarBartlet(self, ds1, ds2, sigLev=.05):
2500
+ """
2501
+ Ansari Bradley 2 sample scale statistic
2502
+
2503
+ Parameters
2504
+ ds1: data set name or list or numpy array
2505
+ ds2: data set name or list or numpy array
2506
+ sigLev: statistical significance level
2507
+ """
2508
+ self.__printBanner("doing Ansari Bradley 2 sample scale test", ds1, ds2)
2509
+ data1 = self.getNumericData(ds1)
2510
+ data2 = self.getNumericData(ds2)
2511
+ stat, pvalue = sta.bartlett(data1, data2)
2512
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2513
+ self.__printStat(stat, pvalue, "probably same variance", "probably not same variance", sigLev)
2514
+ return result
2515
+
2516
+ def testTwoSampleVarLevene(self, ds1, ds2, sigLev=.05):
2517
+ """
2518
+ Levene 2 sample variance statistic
2519
+
2520
+ Parameters
2521
+ ds1: data set name or list or numpy array
2522
+ ds2: data set name or list or numpy array
2523
+ sigLev: statistical significance level
2524
+ """
2525
+ self.__printBanner("doing Levene 2 sample variance test", ds1, ds2)
2526
+ data1 = self.getNumericData(ds1)
2527
+ data2 = self.getNumericData(ds2)
2528
+ stat, pvalue = sta.levene(data1, data2)
2529
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2530
+ self.__printStat(stat, pvalue, "probably same variance", "probably not same variance", sigLev)
2531
+ return result
2532
+
2533
+ def testTwoSampleVarFk(self, ds1, ds2, sigLev=.05):
2534
+ """
2535
+ Fligner-Killeen 2 sample variance statistic
2536
+
2537
+ Parameters
2538
+ ds1: data set name or list or numpy array
2539
+ ds2: data set name or list or numpy array
2540
+ sigLev: statistical significance level
2541
+ """
2542
+ self.__printBanner("doing Fligner-Killeen 2 sample variance test", ds1, ds2)
2543
+ data1 = self.getNumericData(ds1)
2544
+ data2 = self.getNumericData(ds2)
2545
+ stat, pvalue = sta.fligner(data1, data2)
2546
+ result = self.__printResult("stat", stat, "pvalue", pvalue)
2547
+ self.__printStat(stat, pvalue, "probably same variance", "probably not same variance", sigLev)
2548
+ return result
2549
+
2550
+ def testTwoSampleMedMood(self, ds1, ds2, sigLev=.05):
2551
+ """
2552
+ Mood 2 sample median statistic
2553
+
2554
+ Parameters
2555
+ ds1: data set name or list or numpy array
2556
+ ds2: data set name or list or numpy array
2557
+ sigLev: statistical significance level
2558
+ """
2559
+ self.__printBanner("doing Mood 2 sample median test", ds1, ds2)
2560
+ data1 = self.getNumericData(ds1)
2561
+ data2 = self.getNumericData(ds2)
2562
+ stat, pvalue, median, ctable = sta.median_test(data1, data2)
2563
+ result = self.__printResult("stat", stat, "pvalue", pvalue, "median", median, "contigencyTable", ctable)
2564
+ self.__printStat(stat, pvalue, "probably same median", "probably not same median", sigLev)
2565
+ return result
2566
+
2567
+ def testTwoSampleZc(self, ds1, ds2, sigLev=.05):
2568
+ """
2569
+ Zhang-C 2 sample statistic
2570
+
2571
+ Parameters
2572
+ ds1: data set name or list or numpy array
2573
+ ds2: data set name or list or numpy array
2574
+ sigLev: statistical significance level
2575
+ """
2576
+ self.__printBanner("doing Zhang-C 2 sample test", ds1, ds2)
2577
+ data1 = self.getNumericData(ds1)
2578
+ data2 = self.getNumericData(ds2)
2579
+ l1 = len(data1)
2580
+ l2 = len(data2)
2581
+ l = l1 + l2
2582
+
2583
+ #find ranks
2584
+ pooled = np.concatenate([data1, data2])
2585
+ ranks = findRanks(data1, pooled)
2586
+ ranks.extend(findRanks(data2, pooled))
2587
+
2588
+ s1 = 0.0
2589
+ for i in range(1, l1+1):
2590
+ s1 += math.log(l1 / (i - 0.5) - 1.0) * math.log(l / (ranks[i-1] - 0.5) - 1.0)
2591
+
2592
+ s2 = 0.0
2593
+ for i in range(1, l2+1):
2594
+ s2 += math.log(l2 / (i - 0.5) - 1.0) * math.log(l / (ranks[l1 + i - 1] - 0.5) - 1.0)
2595
+ stat = (s1 + s2) / l
2596
+ print(formatFloat(3, stat, "stat:"))
2597
+ return stat
2598
+
2599
+ def testTwoSampleZa(self, ds1, ds2, sigLev=.05):
2600
+ """
2601
+ Zhang-A 2 sample statistic
2602
+
2603
+ Parameters
2604
+ ds1: data set name or list or numpy array
2605
+ ds2: data set name or list or numpy array
2606
+ sigLev: statistical significance level
2607
+ """
2608
+ self.__printBanner("doing Zhang-A 2 sample test", ds1, ds2)
2609
+ data1 = self.getNumericData(ds1)
2610
+ data2 = self.getNumericData(ds2)
2611
+ l1 = len(data1)
2612
+ l2 = len(data2)
2613
+ l = l1 + l2
2614
+ pooled = np.concatenate([data1, data2])
2615
+ cd1 = CumDistr(data1)
2616
+ cd2 = CumDistr(data2)
2617
+ sum = 0.0
2618
+ for i in range(1, l+1):
2619
+ v = pooled[i-1]
2620
+ f1 = cd1.getDistr(v)
2621
+ f2 = cd2.getDistr(v)
2622
+
2623
+ t1 = f1 * math.log(f1)
2624
+ t2 = 0 if f1 == 1.0 else (1.0 - f1) * math.log(1.0 - f1)
2625
+ sum += l1 * (t1 + t2) / ((i - 0.5) * (l - i + 0.5))
2626
+ t1 = f2 * math.log(f2)
2627
+ t2 = 0 if f2 == 1.0 else (1.0 - f2) * math.log(1.0 - f2)
2628
+ sum += l2 * (t1 + t2) / ((i - 0.5) * (l - i + 0.5))
2629
+ stat = -sum
2630
+ print(formatFloat(3, stat, "stat:"))
2631
+ return stat
2632
+
2633
+ def testTwoSampleZk(self, ds1, ds2, sigLev=.05):
2634
+ """
2635
+ Zhang-K 2 sample statistic
2636
+
2637
+ Parameters
2638
+ ds1: data set name or list or numpy array
2639
+ ds2: data set name or list or numpy array
2640
+ sigLev: statistical significance level
2641
+ """
2642
+ self.__printBanner("doing Zhang-K 2 sample test", ds1, ds2)
2643
+ data1 = self.getNumericData(ds1)
2644
+ data2 = self.getNumericData(ds2)
2645
+ l1 = len(data1)
2646
+ l2 = len(data2)
2647
+ l = l1 + l2
2648
+ pooled = np.concatenate([data1, data2])
2649
+ cd1 = CumDistr(data1)
2650
+ cd2 = CumDistr(data2)
2651
+ cd = CumDistr(pooled)
2652
+
2653
+ maxStat = None
2654
+ for i in range(1, l+1):
2655
+ v = pooled[i-1]
2656
+ f1 = cd1.getDistr(v)
2657
+ f2 = cd2.getDistr(v)
2658
+ f = cd.getDistr(v)
2659
+
2660
+ t1 = 0 if f1 == 0 else f1 * math.log(f1 / f)
2661
+ t2 = 0 if f1 == 1.0 else (1.0 - f1) * math.log((1.0 - f1) / (1.0 - f))
2662
+ stat = l1 * (t1 + t2)
2663
+ t1 = 0 if f2 == 0 else f2 * math.log(f2 / f)
2664
+ t2 = 0 if f2 == 1.0 else (1.0 - f2) * math.log((1.0 - f2) / (1.0 - f))
2665
+ stat += l2 * (t1 + t2)
2666
+ if maxStat is None or stat > maxStat:
2667
+ maxStat = stat
2668
+ print(formatFloat(3, maxStat, "stat:"))
2669
+ return maxStat
2670
+
2671
+
2672
+ def testTwoSampleCvm(self, ds1, ds2, sigLev=.05):
2673
+ """
2674
+ 2 sample cramer von mises
2675
+
2676
+ Parameters
2677
+ ds1: data set name or list or numpy array
2678
+ ds2: data set name or list or numpy array
2679
+ sigLev: statistical significance level
2680
+ """
2681
+ self.__printBanner("doing 2 sample CVM test", ds1, ds2)
2682
+ data1 = self.getNumericData(ds1)
2683
+ data2 = self.getNumericData(ds2)
2684
+ data = np.concatenate((data1,data2))
2685
+ rdata = sta.rankdata(data)
2686
+ n = len(data1)
2687
+ m = len(data2)
2688
+ l = n + m
2689
+
2690
+ s1 = 0
2691
+ for i in range(n):
2692
+ t = rdata[i] - (i+1)
2693
+ s1 += (t * t)
2694
+ s1 *= n
2695
+
2696
+ s2 = 0
2697
+ for i in range(m):
2698
+ t = rdata[i + n] - (i+1)
2699
+ s2 += (t * t)
2700
+ s2 *= m
2701
+
2702
+ u = s1 + s2
2703
+ stat = u / (n * m * l) - (4 * m * n - 1) / (6 * l)
2704
+ result = self.__printResult("stat", stat)
2705
+ return result
2706
+
2707
+ def ensureSameSize(self, dlist):
2708
+ """
2709
+ ensures all data sets are of same size
2710
+
2711
+ Parameters
2712
+ dlist : data source list
2713
+ """
2714
+ le = None
2715
+ for d in dlist:
2716
+ cle = len(d)
2717
+ if le is None:
2718
+ le = cle
2719
+ else:
2720
+ assert cle == le, "all data sets need to be of same size"
2721
+
2722
+
2723
+ def testTwoSampleWasserstein(self, ds1, ds2):
2724
+ """
2725
+ Wasserstein 2 sample statistic
2726
+
2727
+ Parameters
2728
+ ds1: data set name or list or numpy array
2729
+ ds2: data set name or list or numpy array
2730
+ """
2731
+ self.__printBanner("doing Wasserstein distance2 sample test", ds1, ds2)
2732
+ data1 = self.getNumericData(ds1)
2733
+ data2 = self.getNumericData(ds2)
2734
+ stat = sta.wasserstein_distance(data1, data2)
2735
+ sd = np.std(np.concatenate([data1, data2]))
2736
+ nstat = stat / sd
2737
+ result = self.__printResult("stat", stat, "normalizedStat", nstat)
2738
+ return result
2739
+
2740
+ def getMaxRelMinRedFeatures(self, fdst, tdst, nfeatures, nbins=20):
2741
+ """
2742
+ get top n features based on max relevance and min redudancy algorithm
2743
+
2744
+ Parameters
2745
+ fdst: list of pair of data set name or list or numpy array and data type
2746
+ tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2747
+ nfeatures : desired no of features
2748
+ nbins : no of bins for numerical data
2749
+ """
2750
+ self.__printBanner("doing max relevance min redundancy feature selection")
2751
+ return self.getMutInfoFeatures(fdst, tdst, nfeatures, "mrmr", nbins)
2752
+
2753
+ def getJointMutInfoFeatures(self, fdst, tdst, nfeatures, nbins=20):
2754
+ """
2755
+ get top n features based on joint mutual infoormation algorithm
2756
+
2757
+ Parameters
2758
+ fdst: list of pair of data set name or list or numpy array and data type
2759
+ tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2760
+ nfeatures : desired no of features
2761
+ nbins : no of bins for numerical data
2762
+ """
2763
+ self.__printBanner("doingjoint mutual info feature selection")
2764
+ return self.getMutInfoFeatures(fdst, tdst, nfeatures, "jmi", nbins)
2765
+
2766
+ def getCondMutInfoMaxFeatures(self, fdst, tdst, nfeatures, nbins=20):
2767
+ """
2768
+ get top n features based on condition mutual information maximization algorithm
2769
+
2770
+ Parameters
2771
+ fdst: list of pair of data set name or list or numpy array and data type
2772
+ tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2773
+ nfeatures : desired no of features
2774
+ nbins : no of bins for numerical data
2775
+ """
2776
+ self.__printBanner("doing conditional mutual info max feature selection")
2777
+ return self.getMutInfoFeatures(fdst, tdst, nfeatures, "cmim", nbins)
2778
+
2779
+ def getInteractCapFeatures(self, fdst, tdst, nfeatures, nbins=20):
2780
+ """
2781
+ get top n features based on interaction capping algorithm
2782
+
2783
+ Parameters
2784
+ fdst: list of pair of data set name or list or numpy array and data type
2785
+ tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2786
+ nfeatures : desired no of features
2787
+ nbins : no of bins for numerical data
2788
+ """
2789
+ self.__printBanner("doing interaction capped feature selection")
2790
+ return self.getMutInfoFeatures(fdst, tdst, nfeatures, "icap", nbins)
2791
+
2792
+ def getMutInfoFeatures(self, fdst, tdst, nfeatures, algo, nbins=20):
2793
+ """
2794
+ get top n features based on various mutual information based algorithm
2795
+ ref: Conditional likelihood maximisation : A unifying framework for information
2796
+ theoretic feature selection, Gavin Brown
2797
+
2798
+ Parameters
2799
+ fdst: list of pair of data set name or list or numpy array and data type
2800
+ tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2801
+ nfeatures : desired no of features
2802
+ algo: mi based feature selection algorithm
2803
+ nbins : no of bins for numerical data
2804
+ """
2805
+ #verify data source types types
2806
+ le = len(fdst)
2807
+ nfeatGiven = int(le / 2)
2808
+ assertGreater(nfeatGiven, nfeatures, "no of features should be greater than no of features to be selected")
2809
+ fds = list()
2810
+ types = ["num", "cat"]
2811
+ for i in range (0, le, 2):
2812
+ ds = fdst[i]
2813
+ dt = fdst[i+1]
2814
+ assertInList(dt, types, "invalid type for data source " + dt)
2815
+ data = self.getNumericData(ds) if dt == "num" else self.getCatData(ds)
2816
+ p =(ds, dt)
2817
+ fds.append(p)
2818
+ algos = ["mrmr", "jmi", "cmim", "icap"]
2819
+ assertInList(algo, algos, "invalid feature selection algo " + algo)
2820
+
2821
+ assertInList(tdst[1], types, "invalid type for data source " + tdst[1])
2822
+ data = self.getNumericData(tdst[0]) if tdst[1] == "num" else self.getCatData(tdst[0])
2823
+ #print(fds)
2824
+
2825
+ sfds = list()
2826
+ selected = set()
2827
+ relevancies = dict()
2828
+ for i in range(nfeatures):
2829
+ #print(i)
2830
+ scorem = None
2831
+ dsm = None
2832
+ dsmt = None
2833
+ for ds, dt in fds:
2834
+ #print(ds, dt)
2835
+ if ds not in selected:
2836
+ #relevancy
2837
+ if ds in relevancies:
2838
+ mutInfo = relevancies[ds]
2839
+ else:
2840
+ mutInfo = self.getMutualInfo([ds, dt, tdst[0], tdst[1]], nbins)["mutInfo"]
2841
+ relevancies[ds] = mutInfo
2842
+ relev = mutInfo
2843
+ #print("relev", relev)
2844
+
2845
+ #redundancy
2846
+ smi = 0
2847
+ reds = list()
2848
+ for sds, sdt, _ in sfds:
2849
+ #print(sds, sdt)
2850
+ mutInfo = self.getMutualInfo([ds, dt, sds, sdt], nbins)["mutInfo"]
2851
+ mutInfoCnd = self.getCondMutualInfo([ds, dt, sds, sdt, tdst[0], tdst[1]], nbins)["condMutInfo"] \
2852
+ if algo != "mrmr" else 0
2853
+
2854
+ red = mutInfo - mutInfoCnd
2855
+ reds.append(red)
2856
+
2857
+ if algo == "mrmr" or algo == "jmi":
2858
+ redun = sum(reds) / len(sfds) if len(sfds) > 0 else 0
2859
+ elif algo == "cmim" or algo == "icap":
2860
+ redun = max(reds) if len(sfds) > 0 else 0
2861
+ if algo == "icap":
2862
+ redun = max(0, redun)
2863
+ #print("redun", redun)
2864
+ score = relev - redun
2865
+ if scorem is None or score > scorem:
2866
+ scorem = score
2867
+ dsm = ds
2868
+ dsmt = dt
2869
+
2870
+ pa = (dsm, dsmt, scorem)
2871
+ #print(pa)
2872
+ sfds.append(pa)
2873
+ selected.add(dsm)
2874
+
2875
+ selFeatures = list(map(lambda r : (r[0], r[2]), sfds))
2876
+ result = self.__printResult("selFeatures", selFeatures)
2877
+ return result
2878
+
2879
+
2880
+ def getFastCorrFeatures(self, fdst, tdst, delta, nbins=20):
2881
+ """
2882
+ get top features based on Fast Correlation Based Filter (FCBF)
2883
+ ref: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution
2884
+ Lei Yu
2885
+
2886
+ Parameters
2887
+ fdst: list of pair of data set name or list or numpy array and data type
2888
+ tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2889
+ delta : feature, target correlation threshold
2890
+ nbins : no of bins for numerical data
2891
+ """
2892
+ le = len(fdst)
2893
+ nfeatGiven = int(le / 2)
2894
+ fds = list()
2895
+ types = ["num", "cat"]
2896
+ for i in range (0, le, 2):
2897
+ ds = fdst[i]
2898
+ dt = fdst[i+1]
2899
+ assertInList(dt, types, "invalid type for data source " + dt)
2900
+ data = self.getNumericData(ds) if dt == "num" else self.getCatData(ds)
2901
+ p =(ds, dt)
2902
+ fds.append(p)
2903
+
2904
+ assertInList(tdst[1], types, "invalid type for data source " + tdst[1])
2905
+ data = self.getNumericData(tdst[0]) if tdst[1] == "num" else self.getCatData(tdst[0])
2906
+
2907
+ # get features with symetric uncertainty above threshold
2908
+ tentr = self.getAnyEntropy(tdst[0], tdst[1], nbins)["entropy"]
2909
+ rfeatures = list()
2910
+ fentrs = dict()
2911
+ for ds, dt in fds:
2912
+ mutInfo = self.getMutualInfo([ds, dt, tdst[0], tdst[1]], nbins)["mutInfo"]
2913
+ fentr = self.getAnyEntropy(ds, dt, nbins)["entropy"]
2914
+ sunc = 2 * mutInfo / (tentr + fentr)
2915
+ #print("ds {} sunc {:.3f}".format(ds, sunc))
2916
+ if sunc >= delta:
2917
+ f = [ds, dt, sunc, False]
2918
+ rfeatures.append(f)
2919
+ fentrs[ds] = fentr
2920
+
2921
+ # sort descending of sym uncertainty
2922
+ rfeatures.sort(key=lambda e : e[2], reverse=True)
2923
+
2924
+ #disccard redundant features
2925
+ le = len(rfeatures)
2926
+ for i in range(le):
2927
+ if rfeatures[i][3]:
2928
+ continue
2929
+ for j in range(i+1, le, 1):
2930
+ if rfeatures[j][3]:
2931
+ continue
2932
+ mutInfo = self.getMutualInfo([rfeatures[i][0], rfeatures[i][1], rfeatures[j][0], rfeatures[j][1]], nbins)["mutInfo"]
2933
+ sunc = 2 * mutInfo / (fentrs[rfeatures[i][0]] + fentrs[rfeatures[j][0]])
2934
+ if sunc >= rfeatures[j][2]:
2935
+ rfeatures[j][3] = True
2936
+
2937
+ frfeatures = list(filter(lambda f : not f[3], rfeatures))
2938
+ selFeatures = list(map(lambda f : [f[0], f[2]], frfeatures))
2939
+ result = self.__printResult("selFeatures", selFeatures)
2940
+ return result
2941
+
2942
+ def getInfoGainFeatures(self, fdst, tdst, nfeatures, nsplit, nbins=20):
2943
+ """
2944
+ get top n features based on information gain or entropy loss
2945
+
2946
+ Parameters
2947
+ fdst: list of pair of data set name or list or numpy array and data type
2948
+ tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2949
+ nsplit : num of splits
2950
+ nfeatures : desired no of features
2951
+ nbins : no of bins for numerical data
2952
+ """
2953
+ le = len(fdst)
2954
+ nfeatGiven = int(le / 2)
2955
+ assertGreater(nfeatGiven, nfeatures, "available features should be greater than desired")
2956
+ fds = list()
2957
+ types = ["num", "cat"]
2958
+ for i in range (0, le, 2):
2959
+ ds = fdst[i]
2960
+ dt = fdst[i+1]
2961
+ assertInList(dt, types, "invalid type for data source " + dt)
2962
+ data = self.getNumericData(ds) if dt == "num" else self.getCatData(ds)
2963
+ p =(ds, dt)
2964
+ fds.append(p)
2965
+
2966
+ assertInList(tdst[1], types, "invalid type for data source " + tdst[1])
2967
+ assertGreater(nsplit, 3, "minimum 4 splits necessary")
2968
+ tdata = self.getNumericData(tdst[0]) if tdst[1] == "num" else self.getCatData(tdst[0])
2969
+ tentr = self.getAnyEntropy(tdst[0], tdst[1], nbins)["entropy"]
2970
+ sz =len(tdata)
2971
+
2972
+ sfds = list()
2973
+ for ds, dt in fds:
2974
+ #print(ds, dt)
2975
+ if dt == "num":
2976
+ fd = self.getNumericData(ds)
2977
+ _ , _ , vmax, vmin = self.__getBasicStats(fd)
2978
+ intv = (vmax - vmin) / nsplit
2979
+ maxig = None
2980
+ spmin = vmin + intv
2981
+ spmax = vmax - 0.9 * intv
2982
+
2983
+ #iterate all splits
2984
+ for sp in np.arange(spmin, spmax, intv):
2985
+ ltvals = list()
2986
+ gevals = list()
2987
+ for i in range(len(fd)):
2988
+ if fd[i] < sp:
2989
+ ltvals.append(tdata[i])
2990
+ else:
2991
+ gevals.append(tdata[i])
2992
+
2993
+ self.addListNumericData(ltvals, "spds") if tdst[1] == "num" else self.addListCatData(ltvals, "spds")
2994
+ lten = self.getAnyEntropy("spds", tdst[1], nbins)["entropy"]
2995
+ self.addListNumericData(gevals, "spds") if tdst[1] == "num" else self.addListCatData(gevals, "spds")
2996
+ geen = self.getAnyEntropy("spds", tdst[1], nbins)["entropy"]
2997
+
2998
+ #info gain
2999
+ ig = tentr - (len(ltvals) * lten / sz + len(gevals) * geen / sz)
3000
+ if maxig is None or ig > maxig:
3001
+ maxig = ig
3002
+
3003
+ pa = (ds, maxig)
3004
+ sfds.append(pa)
3005
+ else:
3006
+ fd = self.getCatData(ds)
3007
+ fds = set(fd)
3008
+ fdps = genPowerSet(fds)
3009
+ maxig = None
3010
+
3011
+ #iterate all subsets
3012
+ for s in fdps:
3013
+ if len(s) == len(fds):
3014
+ continue
3015
+ invals = list()
3016
+ exvals = list()
3017
+ for i in range(len(fd)):
3018
+ if fd[i] in s:
3019
+ invals.append(tdata[i])
3020
+ else:
3021
+ exvals.append(tdata[i])
3022
+
3023
+ self.addListNumericData(invals, "spds") if tdst[1] == "num" else self.addListCatData(invals, "spds")
3024
+ inen = self.getAnyEntropy("spds", tdst[1], nbins)["entropy"]
3025
+ self.addListNumericData(exvals, "spds") if tdst[1] == "num" else self.addListCatData(exvals, "spds")
3026
+ exen = self.getAnyEntropy("spds", tdst[1], nbins)["entropy"]
3027
+
3028
+ ig = tentr - (len(invals) * inen / sz + len(exvals) * exen / sz)
3029
+ if maxig is None or ig > maxig:
3030
+ maxig = ig
3031
+
3032
+ pa = (ds, maxig)
3033
+ sfds.append(pa)
3034
+
3035
+ #sort of info gain
3036
+ sfds.sort(key = lambda v : v[1], reverse = True)
3037
+
3038
+ result = self.__printResult("selFeatures", sfds[:nfeatures])
3039
+ return result
3040
+
3041
+ def __stackData(self, *dsl):
3042
+ """
3043
+ stacks collumd to create matrix
3044
+
3045
+ Parameters
3046
+ dsl: data source list
3047
+ """
3048
+ dlist = tuple(map(lambda ds : self.getNumericData(ds), dsl))
3049
+ self.ensureSameSize(dlist)
3050
+ dmat = np.column_stack(dlist)
3051
+ return dmat
3052
+
3053
+ def __printBanner(self, msg, *dsl):
3054
+ """
3055
+ print banner for any function
3056
+
3057
+ Parameters
3058
+ msg: message
3059
+ dsl: list of data set name or list or numpy array
3060
+ """
3061
+ tags = list(map(lambda ds : ds if type(ds) == str else "annoynymous", dsl))
3062
+ forData = " for data sets " if tags else ""
3063
+ msg = msg + forData + " ".join(tags)
3064
+ if self.verbose:
3065
+ print("\n== " + msg + " ==")
3066
+
3067
+
3068
+ def __printDone(self):
3069
+ """
3070
+ print banner for any function
3071
+ """
3072
+ if self.verbose:
3073
+ print("done")
3074
+
3075
+ def __printStat(self, stat, pvalue, nhMsg, ahMsg, sigLev=.05):
3076
+ """
3077
+ generic stat and pvalue output
3078
+
3079
+ Parameters
3080
+ stat : stat value
3081
+ pvalue : p value
3082
+ nhMsg : null hypothesis violation message
3083
+ ahMsg : null hypothesis message
3084
+ sigLev : significance level
3085
+ """
3086
+ if self.verbose:
3087
+ print("\ntest result:")
3088
+ print("stat: {:.3f}".format(stat))
3089
+ print("pvalue: {:.3f}".format(pvalue))
3090
+ print("significance level: {:.3f}".format(sigLev))
3091
+ print(nhMsg if pvalue > sigLev else ahMsg)
3092
+
3093
+ def __printResult(self, *values):
3094
+ """
3095
+ print results
3096
+
3097
+ Parameters
3098
+ values : flattened kay and value pairs
3099
+ """
3100
+ result = dict()
3101
+ assert len(values) % 2 == 0, "key value list should have even number of items"
3102
+ for i in range(0, len(values), 2):
3103
+ result[values[i]] = values[i+1]
3104
+ if self.verbose:
3105
+ print("result details:")
3106
+ self.pp.pprint(result)
3107
+ return result
3108
+
3109
+ def __getBasicStats(self, data):
3110
+ """
3111
+ get mean and std dev
3112
+
3113
+ Parameters
3114
+ data : numpy array
3115
+ """
3116
+ mean = np.average(data)
3117
+ sd = np.std(data)
3118
+ r = (mean, sd, np.max(data), np.min(data))
3119
+ return r
3120
+
3121
+
matumizi/mcsim.py ADDED
@@ -0,0 +1,552 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/local/bin/python3
2
+
3
+ # avenir-python: Machine Learning
4
+ # Author: Pranab Ghosh
5
+ #
6
+ # Licensed under the Apache License, Version 2.0 (the "License"); you
7
+ # may not use this file except in compliance with the License. You may
8
+ # obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
15
+ # implied. See the License for the specific language governing
16
+ # permissions and limitations under the License.
17
+
18
+ # Package imports
19
+ import os
20
+ import sys
21
+ import matplotlib.pyplot as plt
22
+ import numpy as np
23
+ import matplotlib
24
+ import random
25
+ import jprops
26
+ import statistics
27
+ from matplotlib import pyplot
28
+ from .util import *
29
+ from .mlutil import *
30
+ from .sampler import *
31
+
32
+ class MonteCarloSimulator(object):
33
+ """
34
+ monte carlo simulator for intergation, various statistic for complex fumctions
35
+ """
36
+ def __init__(self, numIter, callback, logFilePath, logLevName):
37
+ """
38
+ constructor
39
+
40
+ Parameters
41
+ numIter :num of iterations
42
+ callback : call back method
43
+ logFilePath : log file path
44
+ logLevName : log level
45
+ """
46
+ self.samplers = list()
47
+ self.numIter = numIter;
48
+ self.callback = callback
49
+ self.extraArgs = None
50
+ self.output = list()
51
+ self.sum = None
52
+ self.mean = None
53
+ self.sd = None
54
+ self.replSamplers = dict()
55
+ self.prSamples = None
56
+
57
+ self.logger = None
58
+ if logFilePath is not None:
59
+ self.logger = createLogger(__name__, logFilePath, logLevName)
60
+ self.logger.info("******** stating new session of MonteCarloSimulator")
61
+
62
+
63
+ def registerBernoulliTrialSampler(self, pr):
64
+ """
65
+ bernoulli trial sampler
66
+
67
+ Parameters
68
+ pr : probability
69
+ """
70
+ self.samplers.append(BernoulliTrialSampler(pr))
71
+
72
+ def registerPoissonSampler(self, rateOccur, maxSamp):
73
+ """
74
+ poisson sampler
75
+
76
+ Parameters
77
+ rateOccur : rate of occurence
78
+ maxSamp : max limit on no of samples
79
+ """
80
+ self.samplers.append(PoissonSampler(rateOccur, maxSamp))
81
+
82
+ def registerUniformSampler(self, minv, maxv):
83
+ """
84
+ uniform sampler
85
+
86
+ Parameters
87
+ minv : min value
88
+ maxv : max value
89
+ """
90
+ self.samplers.append(UniformNumericSampler(minv, maxv))
91
+
92
+ def registerTriangularSampler(self, min, max, vertexValue, vertexPos=None):
93
+ """
94
+ triangular sampler
95
+
96
+ Parameters
97
+ xmin : min value
98
+ xmax : max value
99
+ vertexValue : distr value at vertex
100
+ vertexPos : vertex pposition
101
+ """
102
+ self.samplers.append(TriangularRejectSampler(min, max, vertexValue, vertexPos))
103
+
104
+ def registerGaussianSampler(self, mean, sd):
105
+ """
106
+ gaussian sampler
107
+
108
+ Parameters
109
+ mean : mean
110
+ sd : std deviation
111
+ """
112
+ self.samplers.append(GaussianRejectSampler(mean, sd))
113
+
114
+ def registerNormalSampler(self, mean, sd):
115
+ """
116
+ gaussian sampler using numpy
117
+
118
+ Parameters
119
+ mean : mean
120
+ sd : std deviation
121
+ """
122
+ self.samplers.append(NormalSampler(mean, sd))
123
+
124
+ def registerLogNormalSampler(self, mean, sd):
125
+ """
126
+ log normal sampler using numpy
127
+
128
+ Parameters
129
+ mean : mean
130
+ sd : std deviation
131
+ """
132
+ self.samplers.append(LogNormalSampler(mean, sd))
133
+
134
+ def registerParetoSampler(self, mode, shape):
135
+ """
136
+ pareto sampler using numpy
137
+
138
+ Parameters
139
+ mode : mode
140
+ shape : shape
141
+ """
142
+ self.samplers.append(ParetoSampler(mode, shape))
143
+
144
+ def registerGammaSampler(self, shape, scale):
145
+ """
146
+ gamma sampler using numpy
147
+
148
+ Parameters
149
+ shape : shape
150
+ scale : scale
151
+ """
152
+ self.samplers.append(GammaSampler(shape, scale))
153
+
154
+ def registerDiscreteRejectSampler(self, xmin, xmax, step, *values):
155
+ """
156
+ disccrete int sampler
157
+
158
+ Parameters
159
+ xmin : min value
160
+ xmax : max value
161
+ step : discrete step
162
+ values : distr values
163
+ """
164
+ self.samplers.append(DiscreteRejectSampler(xmin, xmax, step, *values))
165
+
166
+ def registerNonParametricSampler(self, minv, binWidth, *values):
167
+ """
168
+ nonparametric sampler
169
+
170
+ Parameters
171
+ xmin : min value
172
+ binWidth : bin width
173
+ values : distr values
174
+ """
175
+ sampler = NonParamRejectSampler(minv, binWidth, *values)
176
+ sampler.sampleAsFloat()
177
+ self.samplers.append(sampler)
178
+
179
+ def registerMultiVarNormalSampler(self, numVar, *values):
180
+ """
181
+ multi var gaussian sampler using numpy
182
+
183
+ Parameters
184
+ numVar : no of variables
185
+ values : numVar mean values followed by numVar x numVar values for covar matrix
186
+ """
187
+ self.samplers.append(MultiVarNormalSampler(numVar, *values))
188
+
189
+ def registerJointNonParamRejectSampler(self, xmin, xbinWidth, xnbin, ymin, ybinWidth, ynbin, *values):
190
+ """
191
+ joint nonparametric sampler
192
+
193
+ Parameters
194
+ xmin : min value for x
195
+ xbinWidth : bin width for x
196
+ xnbin : no of bins for x
197
+ ymin : min value for y
198
+ ybinWidth : bin width for y
199
+ ynbin : no of bins for y
200
+ values : distr values
201
+ """
202
+ self.samplers.append(JointNonParamRejectSampler(xmin, xbinWidth, xnbin, ymin, ybinWidth, ynbin, *values))
203
+
204
+ def registerRangePermutationSampler(self, minv, maxv, *numShuffles):
205
+ """
206
+ permutation sampler with range
207
+
208
+ Parameters
209
+ minv : min of range
210
+ maxv : max of range
211
+ numShuffles : no of shuffles or range of no of shuffles
212
+ """
213
+ self.samplers.append(PermutationSampler.createSamplerWithRange(minv, maxv, *numShuffles))
214
+
215
+ def registerValuesPermutationSampler(self, values, *numShuffles):
216
+ """
217
+ permutation sampler with values
218
+
219
+ Parameters
220
+ values : list data
221
+ numShuffles : no of shuffles or range of no of shuffles
222
+ """
223
+ self.samplers.append(PermutationSampler.createSamplerWithValues(values, *numShuffles))
224
+
225
+ def registerNormalSamplerWithTrendCycle(self, mean, stdDev, trend, cycle, step=1):
226
+ """
227
+ normal sampler with trend and cycle
228
+
229
+ Parameters
230
+ mean : mean
231
+ stdDev : std deviation
232
+ dmean : trend delta
233
+ cycle : cycle values wrt base mean
234
+ step : adjustment step for cycle and trend
235
+ """
236
+ self.samplers.append(NormalSamplerWithTrendCycle(mean, stdDev, trend, cycle, step))
237
+
238
+ def registerCustomSampler(self, sampler):
239
+ """
240
+ eventsampler
241
+
242
+ Parameters
243
+ sampler : sampler with sample() method
244
+ """
245
+ self.samplers.append(sampler)
246
+
247
+ def registerEventSampler(self, intvSampler, valSampler=None):
248
+ """
249
+ event sampler
250
+
251
+ Parameters
252
+ intvSampler : interval sampler
253
+ valSampler : value sampler
254
+ """
255
+ self.samplers.append(EventSampler(intvSampler, valSampler))
256
+
257
+ def registerMetropolitanSampler(self, propStdDev, minv, binWidth, values):
258
+ """
259
+ metropolitan sampler
260
+
261
+ Parameters
262
+ propStdDev : proposal distr std dev
263
+ minv : min domain value for target distr
264
+ binWidth : bin width
265
+ values : target distr values
266
+ """
267
+ self.samplers.append(MetropolitanSampler(propStdDev, minv, binWidth, values))
268
+
269
+ def setSampler(self, var, iter, sampler):
270
+ """
271
+ set sampler for some variable when iteration reaches certain point
272
+
273
+ Parameters
274
+ var : sampler index
275
+ iter : iteration count
276
+ sampler : new sampler
277
+ """
278
+ key = (var, iter)
279
+ self.replSamplers[key] = sampler
280
+
281
+ def registerExtraArgs(self, *args):
282
+ """
283
+ extra args
284
+
285
+ Parameters
286
+ args : extra argument list
287
+ """
288
+ self.extraArgs = args
289
+
290
+ def replSampler(self, iter):
291
+ """
292
+ replace samper for this iteration
293
+
294
+ Parameters
295
+ iter : iteration number
296
+ """
297
+ if len(self.replSamplers) > 0:
298
+ for v in range(self.numVars):
299
+ key = (v, iter)
300
+ if key in self.replSamplers:
301
+ sampler = self.replSamplers[key]
302
+ self.samplers[v] = sampler
303
+
304
+ def run(self):
305
+ """
306
+ run simulator
307
+ """
308
+ self.sum = None
309
+ self.mean = None
310
+ self.sd = None
311
+ self.numVars = len(self.samplers)
312
+ vOut = 0
313
+
314
+ #print(formatAny(self.numIter, "num iterations"))
315
+ for i in range(self.numIter):
316
+ self.replSampler(i)
317
+ args = list()
318
+ for s in self.samplers:
319
+ arg = s.sample()
320
+ if type(arg) is list:
321
+ args.extend(arg)
322
+ else:
323
+ args.append(arg)
324
+
325
+ slen = len(args)
326
+ if self.extraArgs:
327
+ args.extend(self.extraArgs)
328
+ args.append(self)
329
+ args.append(i)
330
+ vOut = self.callback(args)
331
+ self.output.append(vOut)
332
+ self.prSamples = args[:slen]
333
+
334
+ def getOutput(self):
335
+ """
336
+ get raw output
337
+ """
338
+ return self.output
339
+
340
+ def setOutput(self, values):
341
+ """
342
+ set raw output
343
+
344
+ Parameters
345
+ values : output values
346
+ """
347
+ self.output = values
348
+ self.numIter = len(values)
349
+
350
+ def drawHist(self, myTitle, myXlabel, myYlabel):
351
+ """
352
+ draw histogram
353
+
354
+ Parameters
355
+ myTitle : title
356
+ myXlabel : label for x
357
+ myYlabel : label for y
358
+ """
359
+ pyplot.hist(self.output, density=True)
360
+ pyplot.title(myTitle)
361
+ pyplot.xlabel(myXlabel)
362
+ pyplot.ylabel(myYlabel)
363
+ pyplot.show()
364
+
365
+ def getSum(self):
366
+ """
367
+ get sum
368
+ """
369
+ if not self.sum:
370
+ self.sum = sum(self.output)
371
+ return self.sum
372
+
373
+ def getMean(self):
374
+ """
375
+ get average
376
+ """
377
+ if self.mean is None:
378
+ self.mean = statistics.mean(self.output)
379
+ return self.mean
380
+
381
+ def getStdDev(self):
382
+ """
383
+ get std dev
384
+ """
385
+ if self.sd is None:
386
+ self.sd = statistics.stdev(self.output, xbar=self.mean) if self.mean else statistics.stdev(self.output)
387
+ return self.sd
388
+
389
+
390
+ def getMedian(self):
391
+ """
392
+ get average
393
+ """
394
+ med = statistics.median(self.output)
395
+ return med
396
+
397
+ def getMax(self):
398
+ """
399
+ get max
400
+ """
401
+ return max(self.output)
402
+
403
+ def getMin(self):
404
+ """
405
+ get min
406
+ """
407
+ return min(self.output)
408
+
409
+ def getIntegral(self, bounds):
410
+ """
411
+ integral
412
+
413
+ Parameters
414
+ bounds : bound on sum
415
+ """
416
+ if not self.sum:
417
+ self.sum = sum(self.output)
418
+ return self.sum * bounds / self.numIter
419
+
420
+ def getLowerTailStat(self, zvalue, numIntPoints=50):
421
+ """
422
+ get lower tail stat
423
+
424
+ Parameters
425
+ zvalue : zscore upper bound
426
+ numIntPoints : no of interpolation point for cum distribution
427
+ """
428
+ mean = self.getMean()
429
+ sd = self.getStdDev()
430
+ tailStart = self.getMin()
431
+ tailEnd = mean - zvalue * sd
432
+ cvaCounts = self.cumDistr(tailStart, tailEnd, numIntPoints)
433
+
434
+ reqConf = floatRange(0.0, 0.150, .01)
435
+ msg = "p value outside interpolation range, reduce zvalue and try again {:.5f} {:.5f}".format(reqConf[-1], cvaCounts[-1][1])
436
+ assert reqConf[-1] < cvaCounts[-1][1], msg
437
+ critValues = self.interpolateCritValues(reqConf, cvaCounts, True, tailStart, tailEnd)
438
+ return critValues
439
+
440
+ def getPercentile(self, cvalue):
441
+ """
442
+ percentile
443
+
444
+ Parameters
445
+ cvalue : value for percentile
446
+ """
447
+ count = 0
448
+ for v in self.output:
449
+ if v < cvalue:
450
+ count += 1
451
+ percent = int(count * 100.0 / self.numIter)
452
+ return percent
453
+
454
+
455
+ def getCritValue(self, pvalue):
456
+ """
457
+ critical value for probabaility threshold
458
+
459
+ Parameters
460
+ pvalue : pvalue
461
+ """
462
+ assertWithinRange(pvalue, 0.0, 1.0, "invalid probabaility value")
463
+ svalues = self.output.sorted()
464
+ ppval = None
465
+ cpval = None
466
+ intv = 1.0 / (self.numIter - 1)
467
+ for i in range(self.numIter - 1):
468
+ cpval = (i + 1) / self.numIter
469
+ if cpval > pvalue:
470
+ sl = svalues[i] - svalues[i-1]
471
+ cval = svalues[i-1] + sl * (pvalue - ppval)
472
+ break
473
+ ppval = cpval
474
+ return cval
475
+
476
+
477
+ def getUpperTailStat(self, zvalue, numIntPoints=50):
478
+ """
479
+ upper tail stat
480
+
481
+ Parameters
482
+ zvalue : zscore upper bound
483
+ numIntPoints : no of interpolation point for cum distribution
484
+ """
485
+ mean = self.getMean()
486
+ sd = self.getStdDev()
487
+ tailStart = mean + zvalue * sd
488
+ tailEnd = self.getMax()
489
+ cvaCounts = self.cumDistr(tailStart, tailEnd, numIntPoints)
490
+
491
+ reqConf = floatRange(0.85, 1.0, .01)
492
+ msg = "p value outside interpolation range, reduce zvalue and try again {:.5f} {:.5f}".format(reqConf[0], cvaCounts[0][1])
493
+ assert reqConf[0] > cvaCounts[0][1], msg
494
+ critValues = self.interpolateCritValues(reqConf, cvaCounts, False, tailStart, tailEnd)
495
+ return critValues
496
+
497
+ def cumDistr(self, tailStart, tailEnd, numIntPoints):
498
+ """
499
+ cumulative distribution at tail
500
+
501
+ Parameters
502
+ tailStart : tail start
503
+ tailEnd : tail end
504
+ numIntPoints : no of interpolation points
505
+ """
506
+ delta = (tailEnd - tailStart) / numIntPoints
507
+ cvalues = floatRange(tailStart, tailEnd, delta)
508
+ cvaCounts = list()
509
+ for cv in cvalues:
510
+ count = 0
511
+ for v in self.output:
512
+ if v < cv:
513
+ count += 1
514
+ p = (cv, count/self.numIter)
515
+ if self.logger is not None:
516
+ self.logger.info("{:.3f} {:.3f}".format(p[0], p[1]))
517
+ cvaCounts.append(p)
518
+ return cvaCounts
519
+
520
+ def interpolateCritValues(self, reqConf, cvaCounts, lowertTail, tailStart, tailEnd):
521
+ """
522
+ interpolate for spefici confidence limits
523
+
524
+ Parameters
525
+ reqConf : confidence level values
526
+ cvaCounts : cum values
527
+ lowertTail : True if lower tail
528
+ tailStart ; tail start
529
+ tailEnd : tail end
530
+ """
531
+ critValues = list()
532
+ if self.logger is not None:
533
+ self.logger.info("target conf limit " + str(reqConf))
534
+ reqConfSub = reqConf[1:] if lowertTail else reqConf[:-1]
535
+ for rc in reqConfSub:
536
+ for i in range(len(cvaCounts) -1):
537
+ if rc >= cvaCounts[i][1] and rc < cvaCounts[i+1][1]:
538
+ #print("interpoltate between " + str(cvaCounts[i]) + " and " + str(cvaCounts[i+1]))
539
+ slope = (cvaCounts[i+1][0] - cvaCounts[i][0]) / (cvaCounts[i+1][1] - cvaCounts[i][1])
540
+ cval = cvaCounts[i][0] + slope * (rc - cvaCounts[i][1])
541
+ p = (rc, cval)
542
+ if self.logger is not None:
543
+ self.logger.debug("interpolated crit values {:.3f} {:.3f}".format(p[0], p[1]))
544
+ critValues.append(p)
545
+ break
546
+ if lowertTail:
547
+ p = (0.0, tailStart)
548
+ critValues.insert(0, p)
549
+ else:
550
+ p = (1.0, tailEnd)
551
+ critValues.append(p)
552
+ return critValues
matumizi/mlutil.py ADDED
@@ -0,0 +1,1500 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/local/bin/python3
2
+
3
+ # avenir-python: Machine Learning
4
+ # Author: Pranab Ghosh
5
+ #
6
+ # Licensed under the Apache License, Version 2.0 (the "License"); you
7
+ # may not use this file except in compliance with the License. You may
8
+ # obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
15
+ # implied. See the License for the specific language governing
16
+ # permissions and limitations under the License.
17
+
18
+ # Package imports
19
+ import os
20
+ import sys
21
+ import numpy as np
22
+ from sklearn import preprocessing
23
+ from sklearn import metrics
24
+ from sklearn.datasets import make_blobs
25
+ from sklearn.datasets import make_classification
26
+ import random
27
+ from math import *
28
+ from decimal import Decimal
29
+ import statistics
30
+ import jprops
31
+ from Levenshtein import distance as ld
32
+ from .util import *
33
+ from .sampler import *
34
+
35
+ class Configuration:
36
+ """
37
+ Configuration management. Supports default value, mandatory value and typed value.
38
+ """
39
+ def __init__(self, configFile, defValues, verbose=False):
40
+ """
41
+ initializer
42
+
43
+ Parameters
44
+ configFile : config file path
45
+ defValues : dictionary of default values
46
+ verbose : verbosity flag
47
+ """
48
+ configs = {}
49
+ with open(configFile) as fp:
50
+ for key, value in jprops.iter_properties(fp):
51
+ configs[key] = value
52
+ self.configs = configs
53
+ self.defValues = defValues
54
+ self.verbose = verbose
55
+
56
+ def override(self, configFile):
57
+ """
58
+ over ride configuration from file
59
+
60
+ Parameters
61
+ configFile : override config file path
62
+ """
63
+ with open(configFile) as fp:
64
+ for key, value in jprops.iter_properties(fp):
65
+ self.configs[key] = value
66
+
67
+
68
+ def setParam(self, name, value):
69
+ """
70
+ override individual configuration
71
+
72
+ Parameters
73
+ name : config param name
74
+ value : config param value
75
+ """
76
+ self.configs[name] = value
77
+
78
+
79
+ def getStringConfig(self, name):
80
+ """
81
+ get string param
82
+
83
+ Parameters
84
+ name : config param name
85
+ """
86
+ if self.isNone(name):
87
+ val = (None, False)
88
+ elif self.isDefault(name):
89
+ val = (self.handleDefault(name), True)
90
+ else:
91
+ val = (self.configs[name], False)
92
+ if self.verbose:
93
+ print( "{} {} {}".format(name, self.configs[name], val[0]))
94
+ return val
95
+
96
+
97
+ def getIntConfig(self, name):
98
+ """
99
+ get int param
100
+
101
+ Parameters
102
+ name : config param name
103
+ """
104
+ #print "%s %s" %(name,self.configs[name])
105
+ if self.isNone(name):
106
+ val = (None, False)
107
+ elif self.isDefault(name):
108
+ val = (self.handleDefault(name), True)
109
+ else:
110
+ val = (int(self.configs[name]), False)
111
+ if self.verbose:
112
+ print( "{} {} {}".format(name, self.configs[name], val[0]))
113
+ return val
114
+
115
+
116
+ def getFloatConfig(self, name):
117
+ """
118
+ get float param
119
+
120
+ Parameters
121
+ name : config param name
122
+ """
123
+ #print "%s %s" %(name,self.configs[name])
124
+ if self.isNone(name):
125
+ val = (None, False)
126
+ elif self.isDefault(name):
127
+ val = (self.handleDefault(name), True)
128
+ else:
129
+ val = (float(self.configs[name]), False)
130
+ if self.verbose:
131
+ print( "{} {} {:06.3f}".format(name, self.configs[name], val[0]))
132
+ return val
133
+
134
+
135
+ def getBooleanConfig(self, name):
136
+ """
137
+ #get boolean param
138
+
139
+ Parameters
140
+ name : config param name
141
+ """
142
+ if self.isNone(name):
143
+ val = (None, False)
144
+ elif self.isDefault(name):
145
+ val = (self.handleDefault(name), True)
146
+ else:
147
+ bVal = self.configs[name].lower() == "true"
148
+ val = (bVal, False)
149
+ if self.verbose:
150
+ print( "{} {} {}".format(name, self.configs[name], val[0]))
151
+ return val
152
+
153
+
154
+ def getIntListConfig(self, name, delim=","):
155
+ """
156
+ get int list param
157
+
158
+ Parameters
159
+ name : config param name
160
+ delim : delemeter
161
+ """
162
+ if self.isNone(name):
163
+ val = (None, False)
164
+ elif self.isDefault(name):
165
+ val = (self.handleDefault(name), True)
166
+ else:
167
+ delSepStr = self.getStringConfig(name)
168
+
169
+ #specified as range
170
+ intList = strListOrRangeToIntArray(delSepStr[0])
171
+ val =(intList, delSepStr[1])
172
+ return val
173
+
174
+ def getFloatListConfig(self, name, delim=","):
175
+ """
176
+ get float list param
177
+
178
+ Parameters
179
+ name : config param name
180
+ delim : delemeter
181
+ """
182
+ delSepStr = self.getStringConfig(name)
183
+ if self.isNone(name):
184
+ val = (None, False)
185
+ elif self.isDefault(name):
186
+ val = (self.handleDefault(name), True)
187
+ else:
188
+ flList = strToFloatArray(delSepStr[0], delim)
189
+ val =(flList, delSepStr[1])
190
+ return val
191
+
192
+
193
+ def getStringListConfig(self, name, delim=","):
194
+ """
195
+ get string list param
196
+
197
+ Parameters
198
+ name : config param name
199
+ delim : delemeter
200
+ """
201
+ delSepStr = self.getStringConfig(name)
202
+ if self.isNone(name):
203
+ val = (None, False)
204
+ elif self.isDefault(name):
205
+ val = (self.handleDefault(name), True)
206
+ else:
207
+ strList = delSepStr[0].split(delim)
208
+ val = (strList, delSepStr[1])
209
+ return val
210
+
211
+ def handleDefault(self, name):
212
+ """
213
+ handles default
214
+
215
+ Parameters
216
+ name : config param name
217
+ """
218
+ dVal = self.defValues[name]
219
+ if (dVal[1] is None):
220
+ val = dVal[0]
221
+ else:
222
+ raise ValueError(dVal[1])
223
+ return val
224
+
225
+
226
+ def isNone(self, name):
227
+ """
228
+ true is value is None
229
+
230
+ Parameters
231
+ name : config param name
232
+ """
233
+ return self.configs[name].lower() == "none"
234
+
235
+
236
+ def isDefault(self, name):
237
+ """
238
+ true if the value is default
239
+
240
+ Parameters
241
+ name : config param name
242
+ """
243
+ de = self.configs[name] == "_"
244
+ #print de
245
+ return de
246
+
247
+
248
+ def eitherOrStringConfig(self, firstName, secondName):
249
+ """
250
+ returns one of two string parameters
251
+
252
+ Parameters
253
+ firstName : first parameter name
254
+ secondName : second parameter name
255
+ """
256
+ if not self.isNone(firstName):
257
+ first = self.getStringConfig(firstName)[0]
258
+ second = None
259
+ if not self.isNone(secondName):
260
+ raise ValueError("only one of the two parameters should be set and not both " + firstName + " " + secondName)
261
+ else:
262
+ if not self.isNone(secondName):
263
+ second = self.getStringConfig(secondtName)[0]
264
+ first = None
265
+ else:
266
+ raise ValueError("at least one of the two parameters should be set " + firstName + " " + secondName)
267
+ return (first, second)
268
+
269
+
270
+ def eitherOrIntConfig(self, firstName, secondName):
271
+ """
272
+ returns one of two int parameters
273
+
274
+ Parameters
275
+ firstName : first parameter name
276
+ secondName : second parameter name
277
+ """
278
+ if not self.isNone(firstName):
279
+ first = self.getIntConfig(firstName)[0]
280
+ second = None
281
+ if not self.isNone(secondName):
282
+ raise ValueError("only one of the two parameters should be set and not both " + firstName + " " + secondName)
283
+ else:
284
+ if not self.isNone(secondName):
285
+ second = self.getIntConfig(secondsName)[0]
286
+ first = None
287
+ else:
288
+ raise ValueError("at least one of the two parameters should be set " + firstName + " " + secondName)
289
+ return (first, second)
290
+
291
+
292
+ class CatLabelGenerator:
293
+ """
294
+ label generator for categorical variables
295
+ """
296
+ def __init__(self, catValues, delim):
297
+ """
298
+ initilizers
299
+
300
+ Parameters
301
+ catValues : dictionary of categorical values
302
+ delim : delemeter
303
+ """
304
+ self.encoders = {}
305
+ self.catValues = catValues
306
+ self.delim = delim
307
+ for k in self.catValues.keys():
308
+ le = preprocessing.LabelEncoder()
309
+ le.fit(self.catValues[k])
310
+ self.encoders[k] = le
311
+
312
+ def processRow(self, row):
313
+ """
314
+ encode row categorical values
315
+
316
+ Parameters:
317
+ row : data row
318
+ """
319
+ #print row
320
+ rowArr = row.split(self.delim)
321
+ for i in range(len(rowArr)):
322
+ if (i in self.catValues):
323
+ curVal = rowArr[i]
324
+ assert curVal in self.catValues[i], "categorival value invalid"
325
+ encVal = self.encoders[i].transform([curVal])
326
+ rowArr[i] = str(encVal[0])
327
+ return self.delim.join(rowArr)
328
+
329
+ def getOrigLabels(self, indx):
330
+ """
331
+ get original labels
332
+
333
+ Parameters:
334
+ indx : column index
335
+ """
336
+ return self.encoders[indx].classes_
337
+
338
+
339
+ class SupvLearningDataGenerator:
340
+ """
341
+ data generator for supervised learning
342
+ """
343
+ def __init__(self, configFile):
344
+ """
345
+ initilizers
346
+
347
+ Parameters
348
+ configFile : config file path
349
+ """
350
+ defValues = dict()
351
+ defValues["common.num.samp"] = (100, None)
352
+ defValues["common.num.feat"] = (5, None)
353
+ defValues["common.feat.trans"] = (None, None)
354
+ defValues["common.feat.types"] = (None, "missing feature types")
355
+ defValues["common.cat.feat.distr"] = (None, None)
356
+ defValues["common.output.precision"] = (3, None)
357
+ defValues["common.error"] = (0.01, None)
358
+ defValues["class.gen.technique"] = ("blob", None)
359
+ defValues["class.num.feat.informative"] = (2, None)
360
+ defValues["class.num.feat.redundant"] = (2, None)
361
+ defValues["class.num.feat.repeated"] = (0, None)
362
+ defValues["class.num.feat.cat"] = (0, None)
363
+ defValues["class.num.class"] = (2, None)
364
+
365
+ self.config = Configuration(configFile, defValues)
366
+
367
+ def genClassifierData(self):
368
+ """
369
+ generates classifier data
370
+ """
371
+ nsamp = self.config.getIntConfig("common.num.samp")[0]
372
+ nfeat = self.config.getIntConfig("common.num.feat")[0]
373
+ nclass = self.config.getIntConfig("class.num.class")[0]
374
+ #transform with shift and scale
375
+ ftrans = self.config.getFloatListConfig("common.feat.trans")[0]
376
+ feTrans = dict()
377
+ for i in range(0, len(ftrans), 2):
378
+ tr = (ftrans[i], ftrans[i+1])
379
+ indx = int(i/2)
380
+ feTrans[indx] = tr
381
+
382
+ ftypes = self.config.getStringListConfig("common.feat.types")[0]
383
+
384
+ # categorical feature distribution
385
+ feCatDist = dict()
386
+ fcatdl = self.config.getStringListConfig("common.cat.feat.distr")[0]
387
+ for fcatds in fcatdl:
388
+ fcatd = fcatds.split(":")
389
+ feInd = int(fcatd[0])
390
+ clVal = int(fcatd[1])
391
+ key = (feInd, clVal) #feature index and class value
392
+ dist = list(map(lambda i : (fcatd[i], float(fcatd[i+1])), range(2, len(fcatd), 2)))
393
+ feCatDist[key] = CategoricalRejectSampler(*dist)
394
+
395
+ #shift and scale
396
+ genTechnique = self.config.getStringConfig("class.gen.technique")[0]
397
+ error = self.config.getFloatConfig("common.error")[0]
398
+ if genTechnique == "blob":
399
+ features, claz = make_blobs(n_samples=nsamp, centers=nclass, n_features=nfeat)
400
+ for i in range(nsamp): #shift and scale
401
+ for j in range(nfeat):
402
+ tr = feTrans[j]
403
+ features[i,j] = (features[i,j] + tr[0]) * tr[1]
404
+ claz = np.array(list(map(lambda c : random.randint(0, nclass-1) if random.random() < error else c, claz)))
405
+ elif genTechnique == "classify":
406
+ nfeatInfo = self.config.getIntConfig("class.num.feat.informative")[0]
407
+ nfeatRed = self.config.getIntConfig("class.num.feat.redundant")[0]
408
+ nfeatRep = self.config.getIntConfig("class.num.feat.repeated")[0]
409
+ shifts = list(map(lambda i : feTrans[i][0], range(nfeat)))
410
+ scales = list(map(lambda i : feTrans[i][1], range(nfeat)))
411
+ features, claz = make_classification(n_samples=nsamp, n_features=nfeat, n_informative=nfeatInfo, n_redundant=nfeatRed,
412
+ n_repeated=nfeatRep, n_classes=nclass, flip_y=error, shift=shifts, scale=scales)
413
+ else:
414
+ raise "invalid genaration technique"
415
+
416
+ # add categorical features and format
417
+ nCatFeat = self.config.getIntConfig("class.num.feat.cat")[0]
418
+ prec = self.config.getIntConfig("common.output.precision")[0]
419
+ for f , c in zip(features, claz):
420
+ nfs = list(map(lambda i : self.numFeToStr(i, f[i], c, ftypes[i], prec), range(nfeat)))
421
+ if nCatFeat > 0:
422
+ cfs = list(map(lambda i : self.catFe(i, c, ftypes[i], feCatDist), range(nfeat, nfeat + nCatFeat, 1)))
423
+ rec = ",".join(nfs) + "," + ",".join(cfs) + "," + str(c)
424
+ else:
425
+ rec = ",".join(nfs) + "," + str(c)
426
+ yield rec
427
+
428
+ def numFeToStr(self, fv, ft, prec):
429
+ """
430
+ nummeric feature value to string
431
+
432
+ Parameters
433
+ fv : field value
434
+ ft : field data type
435
+ prec : precision
436
+ """
437
+ if ft == "float":
438
+ s = formatFloat(prec, fv)
439
+ elif ft =="int":
440
+ s = str(int(fv))
441
+ else:
442
+ raise "invalid type expecting float or int"
443
+ return s
444
+
445
+ def catFe(self, i, cv, ft, feCatDist):
446
+ """
447
+ generate categorical feature
448
+
449
+ Parameters
450
+ i : col index
451
+ cv : class value
452
+ ft : field data type
453
+ feCatDist : cat value distribution
454
+ """
455
+ if ft == "cat":
456
+ key = (i, cv)
457
+ s = feCatDist[key].sample()
458
+ else:
459
+ raise "invalid type expecting categorical"
460
+ return s
461
+
462
+ class RegressionDataGenerator:
463
+ """
464
+ data generator for regression, including square terms, cross terms, bias, noise, correlated variables
465
+ and user defined function
466
+ """
467
+ def __init__(self, configFile, callback=None):
468
+ """
469
+ initilizers
470
+
471
+ Parameters
472
+ configFile : config file path
473
+ callback : user defined function
474
+ """
475
+ defValues = dict()
476
+ defValues["common.pvar.samplers"] = (None, None)
477
+ defValues["common.pvar.ranges"] = (None, None)
478
+ defValues["common.linear.weights"] = (None, None)
479
+ defValues["common.square.weights"] = (None, None)
480
+ defValues["common.crterm.weights"] = (None, None)
481
+ defValues["common.corr.params"] = (None, None)
482
+ defValues["common.bias"] = (0, None)
483
+ defValues["common.noise"] = (None, None)
484
+ defValues["common.tvar.range"] = (None, None)
485
+ defValues["common.weight.niter"] = (20, None)
486
+ self.config = Configuration(configFile, defValues)
487
+ self.callback = callback
488
+
489
+ #samplers for predictor variables
490
+ items = self.config.getStringListConfig("common.pvar.samplers")[0]
491
+ self.samplers = list(map(lambda s : createSampler(s), items))
492
+ self.npvar = len(self.samplers)
493
+
494
+ #values range for predictor variables
495
+ items = self.config.getStringListConfig("common.pvar.ranges")[0]
496
+ self.pvranges = list()
497
+ for i in range(0, len(items), 2):
498
+ if items[i] =="none":
499
+ r = None
500
+ else:
501
+ vmin = float(items[i])
502
+ vmax = float(items[i+1])
503
+ r = (vmin, vmax, vmax-vmin)
504
+ self.pvranges.append(r)
505
+ assertEqual(len(self.pvranges), self.npvar, "no of predicatble var ranges provided is inavalid")
506
+
507
+
508
+ #linear weights for predictor variables
509
+ self.lweights = self.config.getFloatListConfig("common.linear.weights")[0]
510
+ assertEqual(len(self.lweights), self.npvar, "no of linear weights provided is inavalid")
511
+
512
+
513
+ #square weights for predictor variables
514
+ items = self.config.getStringListConfig("common.square.weights")[0]
515
+ self.sqweight = dict()
516
+ for i in range(0, len(items), 2):
517
+ vi = int(items[i])
518
+ assertLesser(vi, self.npvar, "invalid predictor var index")
519
+ wt = float(items[i+1])
520
+ self.sqweight[vi] = wt
521
+
522
+ #crossterm weights for predictor variables
523
+ items = self.config.getStringListConfig("common.crterm.weights")[0]
524
+ self.crweight = dict()
525
+ for i in range(0, len(items), 3):
526
+ vi = int(items[i])
527
+ assertLesser(vi, self.npvar, "invalid predictor var index")
528
+ vj = int(items[i+1])
529
+ assertLesser(vj, self.npvar, "invalid predictor var index")
530
+ wt = float(items[i+2])
531
+ vp = (vi, vj)
532
+ self.crweight[vp] = wt
533
+
534
+ #correlated variables
535
+ items = self.config.getStringListConfig("common.corr.params")[0]
536
+ self.corrparams = dict()
537
+ for co in items:
538
+ cparam = co.split(":")
539
+ vi = int(cparam[0])
540
+ vj = int(cparam[1])
541
+ k = (vi,vj)
542
+ bias = float(cparam[2])
543
+ wt = float(cparam[3])
544
+ noise = float(cparam[4])
545
+ roundoff = cparam[5] == "true"
546
+ v = (bias, wt, noise, roundoff)
547
+ self.corrparams[k] = v
548
+
549
+
550
+ #boas, noise and target range values
551
+ self.bias = self.config.getFloatConfig("common.bias")[0]
552
+ noise = self.config.getStringListConfig("common.noise")[0]
553
+ self.ndistr = noise[0]
554
+ self.noise = float(noise[1])
555
+ self.tvarlim = self.config.getFloatListConfig("common.tvar.range")[0]
556
+
557
+ #sample
558
+ niter = self.config.getIntConfig("common.weight.niter")[0]
559
+ yvals = list()
560
+ for i in range(niter):
561
+ y = self.sample()[1]
562
+ yvals.append(y)
563
+
564
+ #scale weights by sampled mean and target mean
565
+ my = statistics.mean(yvals)
566
+ myt =(self.tvarlim[1] - self.tvarlim[0]) / 2
567
+ sc = (myt - self.bias) / (my - self.bias)
568
+ #print("weight scale {:.3f}".format(sc))
569
+ self.lweights = list(map(lambda w : w * sc, self.lweights))
570
+ #print("weights {}".format(toStrFromList(self.lweights, 3)))
571
+
572
+ for k in self.sqweight.keys():
573
+ self.sqweight[k] *= sc
574
+
575
+ for k in self.crweight.keys():
576
+ self.crweight[k] *= sc
577
+
578
+
579
+ def sample(self):
580
+ """
581
+ sample predictor variables and target variable
582
+
583
+ """
584
+ pvd = list(map(lambda s : s.sample(), self.samplers))
585
+
586
+ #correct for correlated variables
587
+ for k in self.corrparams.keys():
588
+ vi = k[0]
589
+ vj = k[1]
590
+ v = self.corrparams[k]
591
+ bias = v[0]
592
+ wt = v[1]
593
+ noise = v[2]
594
+ roundoff = v[3]
595
+ nv = bias + wt * pvd[vi]
596
+ pvd[vj] = preturbScalar(nv, noise, "normal")
597
+ if roundoff:
598
+ pvd[vj] = round(pvd[vj])
599
+
600
+ spvd = list()
601
+ lsum = self.bias
602
+ for i in range(self.npvar):
603
+ #range limit
604
+ if self.pvranges[i] is not None:
605
+ pvd[i] = rangeLimit(pvd[i], self.pvranges[i][0], self.pvranges[i][1])
606
+ spvd.append(pvd[i])
607
+
608
+ #scale
609
+ pvd[i] = scaleMinMaxScaData(pvd[i], self.pvranges[i])
610
+ lsum += self.lweights[i] * pvd[i]
611
+
612
+ #square terms
613
+ ssum = 0
614
+ for k in self.sqweight.keys():
615
+ ssum += self.sqweight[k] + pvd[k] * pvd[k]
616
+
617
+ #cross terms
618
+ crsum = 0
619
+ for k in self.crweight.keys():
620
+ vi = k[0]
621
+ vj = k[1]
622
+ crsum += self.crweight[k] * pvd[vi] * pvd[vj]
623
+
624
+ y = lsum + ssum + crsum
625
+ y = preturbScalar(y, self.noise, self.ndistr)
626
+ if self.callback is not None:
627
+ ufy = self.callback(spvd)
628
+ y += ufy
629
+ r = (spvd, y)
630
+ return r
631
+
632
+
633
+ def loadDataFile(file, delim, cols, colIndices):
634
+ """
635
+ loads delim separated file and extracts columns
636
+
637
+ Parameters
638
+ file : file path
639
+ delim : delemeter
640
+ cols : columns to use from file
641
+ colIndices ; columns to extract
642
+ """
643
+ data = np.loadtxt(file, delimiter=delim, usecols=cols)
644
+ extrData = data[:,colIndices]
645
+ return (data, extrData)
646
+
647
+ def loadFeatDataFile(file, delim, cols):
648
+ """
649
+ loads delim separated file and extracts columns
650
+
651
+ Parameters
652
+ file : file path
653
+ delim : delemeter
654
+ cols : columns to use from file
655
+ """
656
+ data = np.loadtxt(file, delimiter=delim, usecols=cols)
657
+ return data
658
+
659
+ def extrColumns(arr, columns):
660
+ """
661
+ extracts columns
662
+
663
+ Parameters
664
+ arr : 2D array
665
+ columns : columns
666
+ """
667
+ return arr[:, columns]
668
+
669
+ def subSample(featData, clsData, subSampleRate, withReplacement):
670
+ """
671
+ subsample feature and class label data
672
+
673
+ Parameters
674
+ featData : 2D array of feature data
675
+ clsData : arrray of class labels
676
+ subSampleRate : fraction to be sampled
677
+ withReplacement : true if sampling with replacement
678
+ """
679
+ sampSize = int(featData.shape[0] * subSampleRate)
680
+ sampledIndx = np.random.choice(featData.shape[0],sampSize, replace=withReplacement)
681
+ sampFeat = featData[sampledIndx]
682
+ sampCls = clsData[sampledIndx]
683
+ return(sampFeat, sampCls)
684
+
685
+ def euclideanDistance(x,y):
686
+ """
687
+ euclidean distance
688
+
689
+ Parameters
690
+ x : first vector
691
+ y : second fvector
692
+ """
693
+ return sqrt(sum(pow(a-b, 2) for a, b in zip(x, y)))
694
+
695
+ def squareRooted(x):
696
+ """
697
+ square root of sum square
698
+
699
+ Parameters
700
+ x : data vector
701
+ """
702
+ return round(sqrt(sum([a*a for a in x])),3)
703
+
704
+ def cosineSimilarity(x,y):
705
+ """
706
+ cosine similarity
707
+
708
+ Parameters
709
+ x : first vector
710
+ y : second fvector
711
+ """
712
+ numerator = sum(a*b for a,b in zip(x,y))
713
+ denominator = squareRooted(x) * squareRooted(y)
714
+ return round(numerator / float(denominator), 3)
715
+
716
+ def cosineDistance(x,y):
717
+ """
718
+ cosine distance
719
+
720
+ Parameters
721
+ x : first vector
722
+ y : second fvector
723
+ """
724
+ return 1.0 - cosineSimilarity(x,y)
725
+
726
+ def manhattanDistance(x,y):
727
+ """
728
+ manhattan distance
729
+
730
+ Parameters
731
+ x : first vector
732
+ y : second fvector
733
+ """
734
+ return sum(abs(a-b) for a,b in zip(x,y))
735
+
736
+ def nthRoot(value, nRoot):
737
+ """
738
+ nth root
739
+
740
+ Parameters
741
+ value : data value
742
+ nRoot : root
743
+ """
744
+ rootValue = 1/float(nRoot)
745
+ return round (Decimal(value) ** Decimal(rootValue),3)
746
+
747
+ def minkowskiDistance(x,y,pValue):
748
+ """
749
+ minkowski distance
750
+
751
+ Parameters
752
+ x : first vector
753
+ y : second fvector
754
+ pValue : power factor
755
+ """
756
+ return nthRoot(sum(pow(abs(a-b),pValue) for a,b in zip(x, y)), pValue)
757
+
758
+ def jaccardSimilarityX(x,y):
759
+ """
760
+ jaccard similarity
761
+
762
+ Parameters
763
+ x : first vector
764
+ y : second fvector
765
+ """
766
+ intersectionCardinality = len(set.intersection(*[set(x), set(y)]))
767
+ unionCardinality = len(set.union(*[set(x), set(y)]))
768
+ return intersectionCardinality/float(unionCardinality)
769
+
770
+ def jaccardSimilarity(x,y,wx=1.0,wy=1.0):
771
+ """
772
+ jaccard similarity
773
+
774
+ Parameters
775
+ x : first vector
776
+ y : second fvector
777
+ wx : weight for x
778
+ wy : weight for y
779
+ """
780
+ sx = set(x)
781
+ sy = set(y)
782
+ sxyInt = sx.intersection(sy)
783
+ intCardinality = len(sxyInt)
784
+ sxIntDiff = sx.difference(sxyInt)
785
+ syIntDiff = sy.difference(sxyInt)
786
+ unionCardinality = len(sx.union(sy))
787
+ return intCardinality/float(intCardinality + wx * len(sxIntDiff) + wy * len(syIntDiff))
788
+
789
+ def levenshteinSimilarity(s1, s2):
790
+ """
791
+ Levenshtein similarity for strings
792
+
793
+ Parameters
794
+ sx : first string
795
+ sy : second string
796
+ """
797
+ assert type(s1) == str and type(s2) == str, "Levenshtein similarity is for string only"
798
+ d = ld(s1,s2)
799
+ #print(d)
800
+ l = max(len(s1),len(s2))
801
+ d = 1.0 - min(d/l, 1.0)
802
+ return d
803
+
804
+ def norm(values, po=2):
805
+ """
806
+ norm
807
+
808
+ Parameters
809
+ values : list of values
810
+ po : power
811
+ """
812
+ no = sum(list(map(lambda v: pow(v,po), values)))
813
+ no = pow(no,1.0/po)
814
+ return list(map(lambda v: v/no, values))
815
+
816
+ def createOneHotVec(size, indx = -1):
817
+ """
818
+ random one hot vector
819
+
820
+ Parameters
821
+ size : vector size
822
+ indx : one hot position
823
+ """
824
+ vec = [0] * size
825
+ s = random.randint(0, size - 1) if indx < 0 else indx
826
+ vec[s] = 1
827
+ return vec
828
+
829
+ def createAllOneHotVec(size):
830
+ """
831
+ create all one hot vectors
832
+
833
+ Parameters
834
+ size : vector size and no of vectors
835
+ """
836
+ vecs = list()
837
+ for i in range(size):
838
+ vec = [0] * size
839
+ vec[i] = 1
840
+ vecs.append(vec)
841
+ return vecs
842
+
843
+ def blockShuffle(data, blockSize):
844
+ """
845
+ block shuffle
846
+
847
+ Parameters
848
+ data : list data
849
+ blockSize : block size
850
+ """
851
+ numBlock = int(len(data) / blockSize)
852
+ remain = len(data) % blockSize
853
+ numBlock += (1 if remain > 0 else 0)
854
+ shuffled = list()
855
+ for i in range(numBlock):
856
+ b = random.randint(0, numBlock-1)
857
+ beg = b * blockSize
858
+ if (b < numBlock-1):
859
+ end = beg + blockSize
860
+ shuffled.extend(data[beg:end])
861
+ else:
862
+ shuffled.extend(data[beg:])
863
+ return shuffled
864
+
865
+ def shuffle(data, numShuffle):
866
+ """
867
+ shuffle data by randonm swapping
868
+
869
+ Parameters
870
+ data : list data
871
+ numShuffle : no of pairwise swaps
872
+ """
873
+ sz = len(data)
874
+ if numShuffle is None:
875
+ numShuffle = int(sz / 2)
876
+ for i in range(numShuffle):
877
+ fi = random.randint(0, sz -1)
878
+ se = random.randint(0, sz -1)
879
+ tmp = data[fi]
880
+ data[fi] = data[se]
881
+ data[se] = tmp
882
+
883
+ def randomWalk(size, start, lowStep, highStep):
884
+ """
885
+ random walk
886
+
887
+ Parameters
888
+ size : list data
889
+ start : initial position
890
+ lowStep : step min
891
+ highStep : step max
892
+ """
893
+ cur = start
894
+ for i in range(size):
895
+ yield cur
896
+ cur += randomFloat(lowStep, highStep)
897
+
898
+ def binaryEcodeCategorical(values, value):
899
+ """
900
+ one hot binary encoding
901
+
902
+ Parameters
903
+ values : list of values
904
+ value : value to be replaced with 1
905
+ """
906
+ size = len(values)
907
+ vec = [0] * size
908
+ for i in range(size):
909
+ if (values[i] == value):
910
+ vec[i] = 1
911
+ return vec
912
+
913
+ def createLabeledSeq(inputData, tw):
914
+ """
915
+ Creates feature, label pair from sequence data, where we have tw number of features followed by output
916
+
917
+ Parameters
918
+ values : list containing feature and label
919
+ tw : no of features
920
+ """
921
+ features = list()
922
+ labels = list()
923
+ l = len(inputDta)
924
+ for i in range(l - tw):
925
+ trainSeq = inputData[i:i+tw]
926
+ trainLabel = inputData[i+tw]
927
+ features.append(trainSeq)
928
+ labels.append(trainLabel)
929
+ return (features, labels)
930
+
931
+ def createLabeledSeq(filePath, delim, index, tw):
932
+ """
933
+ Creates feature, label pair from 1D sequence data in file
934
+
935
+ Parameters
936
+ filePath : file path
937
+ delim : delemeter
938
+ index : column index
939
+ tw : no of features
940
+ """
941
+ seqData = getFileColumnAsFloat(filePath, delim, index)
942
+ return createLabeledSeq(seqData, tw)
943
+
944
+ def fromMultDimSeqToTabular(data, inpSize, seqLen):
945
+ """
946
+ Input shape (nrow, inpSize * seqLen) output shape(nrow * seqLen, inpSize)
947
+
948
+ Parameters
949
+ data : 2D array
950
+ inpSize : each input size in sequence
951
+ seqLen : sequence length
952
+ """
953
+ nrow = data.shape[0]
954
+ assert data.shape[1] == inpSize * seqLen, "invalid input size or sequence length"
955
+ return data.reshape(nrow * seqLen, inpSize)
956
+
957
+ def fromTabularToMultDimSeq(data, inpSize, seqLen):
958
+ """
959
+ Input shape (nrow * seqLen, inpSize) output shape (nrow, inpSize * seqLen)
960
+
961
+ Parameters
962
+ data : 2D array
963
+ inpSize : each input size in sequence
964
+ seqLen : sequence length
965
+ """
966
+ nrow = int(data.shape[0] / seqLen)
967
+ assert data.shape[1] == inpSize, "invalid input size"
968
+ return data.reshape(nrow, seqLen * inpSize)
969
+
970
+ def difference(data, interval=1):
971
+ """
972
+ takes difference in time series data
973
+
974
+ Parameters
975
+ data :list data
976
+ interval : interval for difference
977
+ """
978
+ diff = list()
979
+ for i in range(interval, len(data)):
980
+ value = data[i] - data[i - interval]
981
+ diff.append(value)
982
+ return diff
983
+
984
+ def normalizeMatrix(data, norm, axis=1):
985
+ """
986
+ normalized each row of the matrix
987
+
988
+ Parameters
989
+ data : 2D data
990
+ nporm : normalization method
991
+ axis : row or column
992
+ """
993
+ normalized = preprocessing.normalize(data,norm=norm, axis=axis)
994
+ return normalized
995
+
996
+ def standardizeMatrix(data, axis=0):
997
+ """
998
+ standardizes each column of the matrix with mean and std deviation
999
+
1000
+ Parameters
1001
+ data : 2D data
1002
+ axis : row or column
1003
+ """
1004
+ standardized = preprocessing.scale(data, axis=axis)
1005
+ return standardized
1006
+
1007
+ def asNumpyArray(data):
1008
+ """
1009
+ converts to numpy array
1010
+
1011
+ Parameters
1012
+ data : array
1013
+ """
1014
+ return np.array(data)
1015
+
1016
+ def perfMetric(metric, yActual, yPred, clabels=None):
1017
+ """
1018
+ predictive model accuracy metric
1019
+
1020
+ Parameters
1021
+ metric : accuracy metric
1022
+ yActual : actual values array
1023
+ yPred : predicted values array
1024
+ clabels : class labels
1025
+ """
1026
+ if metric == "rsquare":
1027
+ score = metrics.r2_score(yActual, yPred)
1028
+ elif metric == "mae":
1029
+ score = metrics.mean_absolute_error(yActual, yPred)
1030
+ elif metric == "mse":
1031
+ score = metrics.mean_squared_error(yActual, yPred)
1032
+ elif metric == "acc":
1033
+ yPred = np.rint(yPred)
1034
+ score = metrics.accuracy_score(yActual, yPred)
1035
+ elif metric == "mlAcc":
1036
+ yPred = np.argmax(yPred, axis=1)
1037
+ score = metrics.accuracy_score(yActual, yPred)
1038
+ elif metric == "prec":
1039
+ yPred = np.argmax(yPred, axis=1)
1040
+ score = metrics.precision_score(yActual, yPred)
1041
+ elif metric == "rec":
1042
+ yPred = np.argmax(yPred, axis=1)
1043
+ score = metrics.recall_score(yActual, yPred)
1044
+ elif metric == "fone":
1045
+ yPred = np.argmax(yPred, axis=1)
1046
+ score = metrics.f1_score(yActual, yPred)
1047
+ elif metric == "confm":
1048
+ yPred = np.argmax(yPred, axis=1)
1049
+ score = metrics.confusion_matrix(yActual, yPred)
1050
+ elif metric == "clarep":
1051
+ yPred = np.argmax(yPred, axis=1)
1052
+ score = metrics.classification_report(yActual, yPred)
1053
+ elif metric == "bce":
1054
+ if clabels is None:
1055
+ clabels = [0, 1]
1056
+ score = metrics.log_loss(yActual, yPred, labels=clabels)
1057
+ elif metric == "ce":
1058
+ assert clabels is not None, "labels must be provided"
1059
+ score = metrics.log_loss(yActual, yPred, labels=clabels)
1060
+ else:
1061
+ exitWithMsg("invalid prediction performance metric " + metric)
1062
+ return score
1063
+
1064
+ def scaleData(data, method):
1065
+ """
1066
+ scales feature data column wise
1067
+
1068
+ Parameters
1069
+ data : 2D array
1070
+ method : scaling method
1071
+ """
1072
+ if method == "minmax":
1073
+ scaler = preprocessing.MinMaxScaler()
1074
+ data = scaler.fit_transform(data)
1075
+ elif method == "zscale":
1076
+ data = preprocessing.scale(data)
1077
+ else:
1078
+ raise ValueError("invalid scaling method")
1079
+ return data
1080
+
1081
+ def scaleDataWithParams(data, method, scParams):
1082
+ """
1083
+ scales feature data column wise
1084
+
1085
+ Parameters
1086
+ data : 2D array
1087
+ method : scaling method
1088
+ scParams : scaling parameters
1089
+ """
1090
+ if method == "minmax":
1091
+ data = scaleMinMaxTabData(data, scParams)
1092
+ elif method == "zscale":
1093
+ raise ValueError("invalid scaling method")
1094
+ else:
1095
+ raise ValueError("invalid scaling method")
1096
+ return data
1097
+
1098
+ def scaleMinMaxScaData(data, minMax):
1099
+ """
1100
+ minmax scales scalar data
1101
+
1102
+ Parameters
1103
+ data : scalar data
1104
+ minMax : min, max and range for each column
1105
+ """
1106
+ sd = (data - minMax[0]) / minMax[2]
1107
+ return sd
1108
+
1109
+
1110
+ def scaleMinMaxTabData(tdata, minMax):
1111
+ """
1112
+ for tabular scales feature data column wise using min max values for each field
1113
+
1114
+ Parameters
1115
+ tdata : 2D array
1116
+ minMax : min, max and range for each column
1117
+ """
1118
+ stdata = list()
1119
+ for r in tdata:
1120
+ srdata = list()
1121
+ for i, c in enumerate(r):
1122
+ sd = (c - minMax[i][0]) / minMax[i][2]
1123
+ srdata.append(sd)
1124
+ stdata.append(srdata)
1125
+ return stdata
1126
+
1127
+ def scaleMinMax(rdata, minMax):
1128
+ """
1129
+ scales feature data column wise using min max values for each field
1130
+
1131
+ Parameters
1132
+ rdata : data array
1133
+ minMax : min, max and range for each column
1134
+ """
1135
+ srdata = list()
1136
+ for i in range(len(rdata)):
1137
+ d = rdata[i]
1138
+ sd = (d - minMax[i][0]) / minMax[i][2]
1139
+ srdata.append(sd)
1140
+ return srdata
1141
+
1142
+ def harmonicNum(n):
1143
+ """
1144
+ harmonic number
1145
+
1146
+ Parameters
1147
+ n : number
1148
+ """
1149
+ h = 0
1150
+ for i in range(1, n+1, 1):
1151
+ h += 1.0 / i
1152
+ return h
1153
+
1154
+ def digammaFun(n):
1155
+ """
1156
+ figamma function
1157
+
1158
+ Parameters
1159
+ n : number
1160
+ """
1161
+ #Euler Mascheroni constant
1162
+ ec = 0.577216
1163
+ return harmonicNum(n - 1) - ec
1164
+
1165
+ def getDataPartitions(tdata, types, columns = None):
1166
+ """
1167
+ partitions data with the given columns and random split point defined with predicates
1168
+
1169
+ Parameters
1170
+ tdata : 2D array
1171
+ types : data typers
1172
+ columns : column indexes
1173
+ """
1174
+ (dtypes, cvalues) = extractTypesFromString(types)
1175
+ if columns is None:
1176
+ ncol = len(data[0])
1177
+ columns = list(range(ncol))
1178
+ ncol = len(columns)
1179
+ #print(columns)
1180
+
1181
+ # partition predicates
1182
+ partitions = None
1183
+ for c in columns:
1184
+ #print(c)
1185
+ dtype = dtypes[c]
1186
+ pred = list()
1187
+ if dtype == "int" or dtype == "float":
1188
+ (vmin, vmax) = getColMinMax(tdata, c)
1189
+ r = vmax - vmin
1190
+ rmin = vmin + .2 * r
1191
+ rmax = vmax - .2 * r
1192
+ sp = randomFloat(rmin, rmax)
1193
+ if dtype == "int":
1194
+ sp = int(sp)
1195
+ else:
1196
+ sp = "{:.3f}".format(sp)
1197
+ sp = float(sp)
1198
+ pred.append([c, "LT", sp])
1199
+ pred.append([c, "GE", sp])
1200
+ elif dtype == "cat":
1201
+ cv = cvalues[c]
1202
+ card = len(cv)
1203
+ if card < 3:
1204
+ num = 1
1205
+ else:
1206
+ num = randomInt(1, card - 1)
1207
+ sp = selectRandomSubListFromList(cv, num)
1208
+ sp = " ".join(sp)
1209
+ pred.append([c, "IN", sp])
1210
+ pred.append([c, "NOTIN", sp])
1211
+
1212
+ #print(pred)
1213
+ if partitions is None:
1214
+ partitions = pred.copy()
1215
+ #print("initial")
1216
+ #print(partitions)
1217
+ else:
1218
+ #print("extension")
1219
+ tparts = list()
1220
+ for p in partitions:
1221
+ #print(p)
1222
+ l1 = p.copy()
1223
+ l1.extend(pred[0])
1224
+ l2 = p.copy()
1225
+ l2.extend(pred[1])
1226
+ #print("after extension")
1227
+ #print(l1)
1228
+ #print(l2)
1229
+ tparts.append(l1)
1230
+ tparts.append(l2)
1231
+ partitions = tparts
1232
+ #print("extending")
1233
+ #print(partitions)
1234
+
1235
+ #for p in partitions:
1236
+ #print(p)
1237
+ return partitions
1238
+
1239
+ def genAlmostUniformDistr(size, nswap=50):
1240
+ """
1241
+ generate probability distribution
1242
+
1243
+ Parameters
1244
+ size : distr size
1245
+ nswap : no of mass swaps
1246
+ """
1247
+ un = 1.0 / size
1248
+ distr = [un] * size
1249
+ distr = mutDistr(distr, 0.1 * un, nswap)
1250
+ return distr
1251
+
1252
+ def mutDistr(distr, shift, nswap=50):
1253
+ """
1254
+ mutates a probability distribution
1255
+
1256
+ Parameters
1257
+ distr distribution
1258
+ shift : amount of shift for swap
1259
+ nswap : no of mass swaps
1260
+ """
1261
+ size = len(distr)
1262
+ for _ in range(nswap):
1263
+ fi = randomInt(0, size -1)
1264
+ si = randomInt(0, size -1)
1265
+ while fi == si:
1266
+ fi = randomInt(0, size -1)
1267
+ si = randomInt(0, size -1)
1268
+
1269
+ shift = randomFloat(0, shift)
1270
+ t = distr[fi]
1271
+ distr[fi] -= shift
1272
+ if (distr[fi] < 0):
1273
+ distr[fi] = 0.0
1274
+ shift = t
1275
+ distr[si] += shift
1276
+ return distr
1277
+
1278
+ def generateBinDistribution(size, ntrue):
1279
+ """
1280
+ generate binary array with some elements set to 1
1281
+
1282
+ Parameters
1283
+ size : distr size
1284
+ ntrue : no of true values
1285
+ """
1286
+ distr = [0] * size
1287
+ idxs = selectRandomSubListFromList(list(range(size)), ntrue)
1288
+ for i in idxs:
1289
+ distr[i] = 1
1290
+ return distr
1291
+
1292
+ def mutBinaryDistr(distr, nmut):
1293
+ """
1294
+ mutate binary distribution
1295
+
1296
+ Parameters
1297
+ distr : distr
1298
+ nmut : no of mutations
1299
+ """
1300
+ idxs = selectRandomSubListFromList(list(range(len(distr))), nmut)
1301
+ for i in idxs:
1302
+ distr[i] = distr[i] ^ 1
1303
+ return distr
1304
+
1305
+ def fileSelFieldSubSeqModifierGen(filePath, column, offset, seqLen, modifier, precision, delim=","):
1306
+ """
1307
+ file record generator that superimposes given data in the specified segment of a column
1308
+
1309
+ Parameters
1310
+ filePath ; file path
1311
+ column : column index
1312
+ offset : offset into column values
1313
+ seqLen : length of subseq
1314
+ modifier : data to be superimposed either list or a sampler object
1315
+ precision : floating point precision
1316
+ delim : delemeter
1317
+ """
1318
+ beg = offset
1319
+ end = beg + seqLen
1320
+ isList = type(modifier) == list
1321
+ i = 0
1322
+ for rec in fileRecGen(filePath, delim):
1323
+ if i >= beg and i < end:
1324
+ va = float(rec[column])
1325
+ if isList:
1326
+ va += modifier[i - beg]
1327
+ else:
1328
+ va += modifier.sample()
1329
+ rec[column] = formatFloat(precision, va)
1330
+ yield delim.join(rec)
1331
+ i += 1
1332
+
1333
+ class ShiftedDataGenerator:
1334
+ """
1335
+ transforms data for distribution shift
1336
+ """
1337
+ def __init__(self, types, tdata, addFact, multFact):
1338
+ """
1339
+ initializer
1340
+
1341
+ Parameters
1342
+ types data types
1343
+ tdata : 2D array
1344
+ addFact ; factor for data shift
1345
+ multFact ; factor for data scaling
1346
+ """
1347
+ (self.dtypes, self.cvalues) = extractTypesFromString(types)
1348
+
1349
+ self.limits = dict()
1350
+ for k,v in self.dtypes.items():
1351
+ if v == "int" or v == "false":
1352
+ (vmax, vmin) = getColMinMax(tdata, k)
1353
+ self.limits[k] = vmax - vmin
1354
+ self.addMin = - addFact / 2
1355
+ self.addMax = addFact / 2
1356
+ self.multMin = 1.0 - multFact / 2
1357
+ self.multMax = 1.0 + multFact / 2
1358
+
1359
+
1360
+
1361
+
1362
+ def transform(self, tdata):
1363
+ """
1364
+ linear transforms data to create distribution shift with random shift and scale
1365
+
1366
+ Parameters
1367
+ types : data types
1368
+ """
1369
+ transforms = dict()
1370
+ for k,v in self.dtypes.items():
1371
+ if v == "int" or v == "false":
1372
+ shift = randomFloat(self.addMin, self.addMax) * self.limits[k]
1373
+ scale = randomFloat(self.multMin, self.multMax)
1374
+ trns = (shift, scale)
1375
+ transforms[k] = trns
1376
+ elif v == "cat":
1377
+ transforms[k] = isEventSampled(50)
1378
+
1379
+ ttdata = list()
1380
+ for rec in tdata:
1381
+ nrec = rec.copy()
1382
+ for c in range(len(rec)):
1383
+ if c in self.dtypes:
1384
+ dtype = self.dtypes[c]
1385
+ if dtype == "int" or dtype == "float":
1386
+ (shift, scale) = transforms[c]
1387
+ nval = shift + rec[c] * scale
1388
+ if dtype == "int":
1389
+ nrec[c] = int(nval)
1390
+ else:
1391
+ nrec[c] = nval
1392
+ elif dtype == "cat":
1393
+ cv = self.cvalues[c]
1394
+ if transforms[c]:
1395
+ nval = selectOtherRandomFromList(cv, rec[c])
1396
+ nrec[c] = nval
1397
+
1398
+ ttdata.append(nrec)
1399
+
1400
+ return ttdata
1401
+
1402
+ def transformSpecified(self, tdata, sshift, scale):
1403
+ """
1404
+ linear transforms data to create distribution shift shift specified shift and scale
1405
+
1406
+ Parameters
1407
+ types : data types
1408
+ sshift : shift factor
1409
+ scale : scale factor
1410
+ """
1411
+ transforms = dict()
1412
+ for k,v in self.dtypes.items():
1413
+ if v == "int" or v == "false":
1414
+ shift = sshift * self.limits[k]
1415
+ trns = (shift, scale)
1416
+ transforms[k] = trns
1417
+ elif v == "cat":
1418
+ transforms[k] = isEventSampled(50)
1419
+
1420
+ ttdata = self.__scaleShift(tdata, transforms)
1421
+ return ttdata
1422
+
1423
+ def __scaleShift(self, tdata, transforms):
1424
+ """
1425
+ shifts and scales tabular data
1426
+
1427
+ Parameters
1428
+ tdata : 2D array
1429
+ transforms : transforms to apply
1430
+ """
1431
+ ttdata = list()
1432
+ for rec in tdata:
1433
+ nrec = rec.copy()
1434
+ for c in range(len(rec)):
1435
+ if c in self.dtypes:
1436
+ dtype = self.dtypes[c]
1437
+ if dtype == "int" or dtype == "float":
1438
+ (shift, scale) = transforms[c]
1439
+ nval = shift + rec[c] * scale
1440
+ if dtype == "int":
1441
+ nrec[c] = int(nval)
1442
+ else:
1443
+ nrec[c] = nval
1444
+ elif dtype == "cat":
1445
+ cv = self.cvalues[c]
1446
+ if transforms[c]:
1447
+ #nval = selectOtherRandomFromList(cv, rec[c])
1448
+ #nrec[c] = nval
1449
+ pass
1450
+
1451
+ ttdata.append(nrec)
1452
+ return ttdata
1453
+
1454
+ class RollingStat(object):
1455
+ """
1456
+ stats for rolling windowt
1457
+ """
1458
+ def __init__(self, wsize):
1459
+ """
1460
+ initializer
1461
+
1462
+ Parameters
1463
+ wsize : window size
1464
+ """
1465
+ self.window = list()
1466
+ self.wsize = wsize
1467
+ self.mean = None
1468
+ self.sd = None
1469
+
1470
+ def add(self, value):
1471
+ """
1472
+ add a value
1473
+
1474
+ Parameters
1475
+ value : value to add
1476
+ """
1477
+ self.window.append(value)
1478
+ if len(self.window) > self.wsize:
1479
+ self.window = self.window[1:]
1480
+
1481
+ def getStat(self):
1482
+ """
1483
+ get rolling window mean and std deviation
1484
+ """
1485
+ assertGreater(len(self.window), 0, "window is empty")
1486
+ if len(self.window) == 1:
1487
+ self.mean = self.window[0]
1488
+ self.sd = 0
1489
+ else:
1490
+ self.mean = statistics.mean(self.window)
1491
+ self.sd = statistics.stdev(self.window, xbar=self.mean)
1492
+ re = (self.mean, self.sd)
1493
+ return re
1494
+
1495
+ def getSize(self):
1496
+ """
1497
+ return window size
1498
+ """
1499
+ return len(self.window)
1500
+
matumizi/sampler.py ADDED
@@ -0,0 +1,1455 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/local/bin/python3
2
+
3
+ # avenir-python: Machine Learning
4
+ # Author: Pranab Ghosh
5
+ #
6
+ # Licensed under the Apache License, Version 2.0 (the "License"); you
7
+ # may not use this file except in compliance with the License. You may
8
+ # obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
15
+ # implied. See the License for the specific language governing
16
+ # permissions and limitations under the License.
17
+
18
+ import sys
19
+ import random
20
+ import time
21
+ import math
22
+ import random
23
+ import numpy as np
24
+ from scipy import stats
25
+ from random import randint
26
+ from .util import *
27
+ from .stats import Histogram
28
+
29
+ def randomFloat(low, high):
30
+ """
31
+ sample float within range
32
+
33
+ Parameters
34
+ low : low valuee
35
+ high : high valuee
36
+ """
37
+ return random.random() * (high-low) + low
38
+
39
+ def randomInt(minv, maxv):
40
+ """
41
+ sample int within range
42
+
43
+ Parameters
44
+ minv : low valuee
45
+ maxv : high valuee
46
+ """
47
+ return randint(minv, maxv)
48
+
49
+ def randIndex(lData):
50
+ """
51
+ random index of a list
52
+
53
+ Parameters
54
+ lData : list data
55
+ """
56
+ return randint(0, len(lData)-1)
57
+
58
+ def randomUniformSampled(low, high):
59
+ """
60
+ sample float within range
61
+
62
+ Parameters
63
+ low : low value
64
+ high : high value
65
+ """
66
+ return np.random.uniform(low, high)
67
+
68
+ def randomUniformSampledList(low, high, size):
69
+ """
70
+ sample floats within range to create list
71
+
72
+ Parameters
73
+ low : low value
74
+ high : high value
75
+ size ; size of list to be returned
76
+ """
77
+ return np.random.uniform(low, high, size)
78
+
79
+ def randomNormSampled(mean, sd):
80
+ """
81
+ sample float from normal
82
+
83
+ Parameters
84
+ mean : mean
85
+ sd : std deviation
86
+ """
87
+ return np.random.normal(mean, sd)
88
+
89
+ def randomNormSampledList(mean, sd, size):
90
+ """
91
+ sample float list from normal
92
+
93
+ Parameters
94
+ mean : mean
95
+ sd : std deviation
96
+ size : size of list to be returned
97
+ """
98
+ return np.random.normal(mean, sd, size)
99
+
100
+ def randomSampledList(sampler, size):
101
+ """
102
+ sample list from given sampler
103
+
104
+ Parameters
105
+ sampler : sampler object
106
+ size : size of list to be returned
107
+ """
108
+ return list(map(lambda i : sampler.sample(), range(size)))
109
+
110
+
111
+ def minLimit(val, minv):
112
+ """
113
+ min limit
114
+
115
+ Parameters
116
+ val : value
117
+ minv : min limit
118
+ """
119
+ if (val < minv):
120
+ val = minv
121
+ return val
122
+
123
+
124
+ def rangeLimit(val, minv, maxv):
125
+ """
126
+ range limit
127
+
128
+ Parameters
129
+ val : value
130
+ minv : min limit
131
+ maxv : max limit
132
+ """
133
+ if (val < minv):
134
+ val = minv
135
+ elif (val > maxv):
136
+ val = maxv
137
+ return val
138
+
139
+
140
+ def sampleUniform(minv, maxv):
141
+ """
142
+ sample int within range
143
+
144
+ Parameters
145
+ minv ; int min limit
146
+ maxv : int max limit
147
+ """
148
+ return randint(minv, maxv)
149
+
150
+
151
+ def sampleFromBase(value, dev):
152
+ """
153
+ sample int wrt base
154
+
155
+ Parameters
156
+ value : base value
157
+ dev : deviation
158
+ """
159
+ return randint(value - dev, value + dev)
160
+
161
+
162
+ def sampleFloatFromBase(value, dev):
163
+ """
164
+ sample float wrt base
165
+
166
+ Parameters
167
+ value : base value
168
+ dev : deviation
169
+ """
170
+ return randomFloat(value - dev, value + dev)
171
+
172
+
173
+ def distrUniformWithRanndom(total, numItems, noiseLevel):
174
+ """
175
+ uniformly distribute with some randomness and preserves total
176
+
177
+ Parameters
178
+ total : total count
179
+ numItems : no of bins
180
+ noiseLevel : noise level fraction
181
+ """
182
+ perItem = total / numItems
183
+ var = perItem * noiseLevel
184
+ items = []
185
+ for i in range(numItems):
186
+ item = perItem + randomFloat(-var, var)
187
+ items.append(item)
188
+
189
+ #adjust last item
190
+ sm = sum(items[:-1])
191
+ items[-1] = total - sm
192
+ return items
193
+
194
+
195
+ def isEventSampled(threshold, maxv=100):
196
+ """
197
+ sample event which occurs if sampled below threshold
198
+
199
+ Parameters
200
+ threshold : threshold for sampling
201
+ maxv : maximum values
202
+ """
203
+ return randint(0, maxv) < threshold
204
+
205
+
206
+ def sampleBinaryEvents(events, probPercent):
207
+ """
208
+ sample binary events
209
+
210
+ Parameters
211
+ events : two events
212
+ probPercent : probability as percentage
213
+ """
214
+ if (randint(0, 100) < probPercent):
215
+ event = events[0]
216
+ else:
217
+ event = events[1]
218
+ return event
219
+
220
+
221
+ def addNoiseNum(value, sampler):
222
+ """
223
+ add noise to numeric value
224
+
225
+ Parameters
226
+ value : base value
227
+ sampler : sampler for noise
228
+ """
229
+ return value * (1 + sampler.sample())
230
+
231
+
232
+ def addNoiseCat(value, values, noise):
233
+ """
234
+ add noise to categorical value i.e with some probability change value
235
+
236
+ Parameters
237
+ value : cat value
238
+ values : cat values
239
+ noise : noise level fraction
240
+ """
241
+ newValue = value
242
+ threshold = int(noise * 100)
243
+ if (isEventSampled(threshold)):
244
+ newValue = selectRandomFromList(values)
245
+ while newValue == value:
246
+ newValue = selectRandomFromList(values)
247
+ return newValue
248
+
249
+
250
+ def sampleWithReplace(data, sampSize):
251
+ """
252
+ sample with replacement
253
+
254
+ Parameters
255
+ data : array
256
+ sampSize : sample size
257
+ """
258
+ sampled = list()
259
+ le = len(data)
260
+ if sampSize is None:
261
+ sampSize = le
262
+ for i in range(sampSize):
263
+ j = random.randint(0, le - 1)
264
+ sampled.append(data[j])
265
+ return sampled
266
+
267
+ class CumDistr:
268
+ """
269
+ cumulative distr
270
+ """
271
+
272
+ def __init__(self, data, numBins = None):
273
+ """
274
+ initializer
275
+
276
+ Parameters
277
+ data : array
278
+ numBins : no of bins
279
+ """
280
+ if not numBins:
281
+ numBins = int(len(data) / 5)
282
+ res = stats.cumfreq(data, numbins=numBins)
283
+ self.cdistr = res.cumcount / len(data)
284
+ self.loLim = res.lowerlimit
285
+ self.upLim = res.lowerlimit + res.binsize * res.cumcount.size
286
+ self.binWidth = res.binsize
287
+
288
+ def getDistr(self, value):
289
+ """
290
+ get cumulative distribution
291
+
292
+ Parameters
293
+ value : value
294
+ """
295
+ if value <= self.loLim:
296
+ d = 0.0
297
+ elif value >= self.upLim:
298
+ d = 1.0
299
+ else:
300
+ bin = int((value - self.loLim) / self.binWidth)
301
+ d = self.cdistr[bin]
302
+ return d
303
+
304
+ class BernoulliTrialSampler:
305
+ """
306
+ bernoulli trial sampler return True or False
307
+ """
308
+
309
+ def __init__(self, pr, events=None):
310
+ """
311
+ initializer
312
+
313
+ Parameters
314
+ pr : probability
315
+ events : event values
316
+ """
317
+ self.pr = pr
318
+ self.retEvent = False if events is None else True
319
+ self.events = events
320
+
321
+
322
+ def sample(self):
323
+ """
324
+ samples value
325
+ """
326
+ res = random.random() < self.pr
327
+ if self.retEvent:
328
+ res = self.events[0] if res else self.events[1]
329
+ return res
330
+
331
+ class PoissonSampler:
332
+ """
333
+ poisson sampler returns number of events
334
+ """
335
+ def __init__(self, rateOccur, maxSamp):
336
+ """
337
+ initializer
338
+
339
+ Parameters
340
+ rateOccur : rate of occurence
341
+ maxSamp : max limit on no of samples
342
+ """
343
+ self.rateOccur = rateOccur
344
+ self.maxSamp = int(maxSamp)
345
+ self.pmax = self.calculatePr(rateOccur)
346
+
347
+ def calculatePr(self, numOccur):
348
+ """
349
+ calulates probability
350
+
351
+ Parameters
352
+ numOccur : no of occurence
353
+ """
354
+ p = (self.rateOccur ** numOccur) * math.exp(-self.rateOccur) / math.factorial(numOccur)
355
+ return p
356
+
357
+ def sample(self):
358
+ """
359
+ samples value
360
+ """
361
+ done = False
362
+ samp = 0
363
+ while not done:
364
+ no = randint(0, self.maxSamp)
365
+ sp = randomFloat(0.0, self.pmax)
366
+ ap = self.calculatePr(no)
367
+ if sp < ap:
368
+ done = True
369
+ samp = no
370
+ return samp
371
+
372
+ class ExponentialSampler:
373
+ """
374
+ returns interval between events
375
+ """
376
+ def __init__(self, rateOccur, maxSamp = None):
377
+ """
378
+ initializer
379
+
380
+ Parameters
381
+ rateOccur : rate of occurence
382
+ maxSamp : max limit on interval
383
+ """
384
+ self.interval = 1.0 / rateOccur
385
+ self.maxSamp = int(maxSamp) if maxSamp is not None else None
386
+
387
+ def sample(self):
388
+ """
389
+ samples value
390
+ """
391
+ sampled = np.random.exponential(scale=self.interval)
392
+ if self.maxSamp is not None:
393
+ while sampled > self.maxSamp:
394
+ sampled = np.random.exponential(scale=self.interval)
395
+ return sampled
396
+
397
+ class UniformNumericSampler:
398
+ """
399
+ uniform sampler for numerical values
400
+ """
401
+ def __init__(self, minv, maxv):
402
+ """
403
+ initializer
404
+
405
+ Parameters
406
+ minv : min value
407
+ maxv : max value
408
+ """
409
+ self.minv = minv
410
+ self.maxv = maxv
411
+
412
+ def isNumeric(self):
413
+ """
414
+ returns true
415
+ """
416
+ return True
417
+
418
+ def sample(self):
419
+ """
420
+ samples value
421
+ """
422
+ samp = sampleUniform(self.minv, self.maxv) if isinstance(self.minv, int) else randomFloat(self.minv, self.maxv)
423
+ return samp
424
+
425
+ class UniformCategoricalSampler:
426
+ """
427
+ uniform sampler for categorical values
428
+ """
429
+ def __init__(self, cvalues):
430
+ """
431
+ initializer
432
+
433
+ Parameters
434
+ cvalues : categorical value list
435
+ """
436
+ self.cvalues = cvalues
437
+
438
+ def isNumeric(self):
439
+ return False
440
+
441
+ def sample(self):
442
+ """
443
+ samples value
444
+ """
445
+ return selectRandomFromList(self.cvalues)
446
+
447
+ class NormalSampler:
448
+ """
449
+ normal sampler
450
+ """
451
+ def __init__(self, mean, stdDev):
452
+ """
453
+ initializer
454
+
455
+ Parameters
456
+ mean : mean
457
+ stdDev : std deviation
458
+ """
459
+ self.mean = mean
460
+ self.stdDev = stdDev
461
+ self.sampleAsInt = False
462
+
463
+ def isNumeric(self):
464
+ return True
465
+
466
+ def sampleAsIntValue(self):
467
+ """
468
+ set True to sample as int
469
+ """
470
+ self.sampleAsInt = True
471
+
472
+ def sample(self):
473
+ """
474
+ samples value
475
+ """
476
+ samp = np.random.normal(self.mean, self.stdDev)
477
+ if self.sampleAsInt:
478
+ samp = int(samp)
479
+ return samp
480
+
481
+ class LogNormalSampler:
482
+ """
483
+ log normal sampler
484
+ """
485
+ def __init__(self, mean, stdDev):
486
+ """
487
+ initializer
488
+
489
+ Parameters
490
+ mean : mean
491
+ stdDev : std deviation
492
+ """
493
+ self.mean = mean
494
+ self.stdDev = stdDev
495
+
496
+ def isNumeric(self):
497
+ return True
498
+
499
+ def sample(self):
500
+ """
501
+ samples value
502
+ """
503
+ return np.random.lognormal(self.mean, self.stdDev)
504
+
505
+ class NormalSamplerWithTrendCycle:
506
+ """
507
+ normal sampler with cycle and trend
508
+ """
509
+ def __init__(self, mean, stdDev, dmean, cycle, step=1):
510
+ """
511
+ initializer
512
+
513
+ Parameters
514
+ mean : mean
515
+ stdDev : std deviation
516
+ dmean : trend delta
517
+ cycle : cycle values wrt base mean
518
+ step : adjustment step for cycle and trend
519
+ """
520
+ self.mean = mean
521
+ self.cmean = mean
522
+ self.stdDev = stdDev
523
+ self.dmean = dmean
524
+ self.cycle = cycle
525
+ self.clen = len(cycle) if cycle is not None else 0
526
+ self.step = step
527
+ self.count = 0
528
+
529
+ def isNumeric(self):
530
+ return True
531
+
532
+ def sample(self):
533
+ """
534
+ samples value
535
+ """
536
+ s = np.random.normal(self.cmean, self.stdDev)
537
+ self.count += 1
538
+ if self.count % self.step == 0:
539
+ cy = 0
540
+ if self.clen > 1:
541
+ coff = self.count % self.clen
542
+ cy = self.cycle[coff]
543
+ tr = self.count * self.dmean
544
+ self.cmean = self.mean + tr + cy
545
+ return s
546
+
547
+
548
+ class ParetoSampler:
549
+ """
550
+ pareto sampler
551
+ """
552
+ def __init__(self, mode, shape):
553
+ """
554
+ initializer
555
+
556
+ Parameters
557
+ mode : mode
558
+ shape : shape
559
+ """
560
+ self.mode = mode
561
+ self.shape = shape
562
+
563
+ def isNumeric(self):
564
+ return True
565
+
566
+ def sample(self):
567
+ """
568
+ samples value
569
+ """
570
+ return (np.random.pareto(self.shape) + 1) * self.mode
571
+
572
+ class GammaSampler:
573
+ """
574
+ pareto sampler
575
+ """
576
+ def __init__(self, shape, scale):
577
+ """
578
+ initializer
579
+
580
+ Parameters
581
+ shape : shape
582
+ scale : scale
583
+ """
584
+ self.shape = shape
585
+ self.scale = scale
586
+
587
+ def isNumeric(self):
588
+ return True
589
+
590
+ def sample(self):
591
+ """
592
+ samples value
593
+ """
594
+ return np.random.gamma(self.shape, self.scale)
595
+
596
+ class GaussianRejectSampler:
597
+ """
598
+ gaussian sampling based on rejection sampling
599
+ """
600
+ def __init__(self, mean, stdDev):
601
+ """
602
+ initializer
603
+
604
+ Parameters
605
+ mean : mean
606
+ stdDev : std deviation
607
+ """
608
+ self.mean = mean
609
+ self.stdDev = stdDev
610
+ self.xmin = mean - 3 * stdDev
611
+ self.xmax = mean + 3 * stdDev
612
+ self.ymin = 0.0
613
+ self.fmax = 1.0 / (math.sqrt(2.0 * 3.14) * stdDev)
614
+ self.ymax = 1.05 * self.fmax
615
+ self.sampleAsInt = False
616
+
617
+ def isNumeric(self):
618
+ return True
619
+
620
+ def sampleAsIntValue(self):
621
+ """
622
+ sample as int value
623
+ """
624
+ self.sampleAsInt = True
625
+
626
+ def sample(self):
627
+ """
628
+ samples value
629
+ """
630
+ done = False
631
+ samp = 0
632
+ while not done:
633
+ x = randomFloat(self.xmin, self.xmax)
634
+ y = randomFloat(self.ymin, self.ymax)
635
+ f = self.fmax * math.exp(-(x - self.mean) * (x - self.mean) / (2.0 * self.stdDev * self.stdDev))
636
+ if (y < f):
637
+ done = True
638
+ samp = x
639
+ if self.sampleAsInt:
640
+ samp = int(samp)
641
+ return samp
642
+
643
+ class DiscreteRejectSampler:
644
+ """
645
+ non parametric sampling for discrete values using given distribution based
646
+ on rejection sampling
647
+ """
648
+ def __init__(self, xmin, xmax, step, *values):
649
+ """
650
+ initializer
651
+
652
+ Parameters
653
+ xmin : min value
654
+ xmax : max value
655
+ step : discrete step
656
+ values : distr values
657
+ """
658
+ self.xmin = xmin
659
+ self.xmax = xmax
660
+ self.step = step
661
+ self.distr = values
662
+ if (len(self.distr) == 1):
663
+ self.distr = self.distr[0]
664
+ numSteps = int((self.xmax - self.xmin) / self.step)
665
+ #print("{:.3f} {:.3f} {:.3f} {}".format(self.xmin, self.xmax, self.step, numSteps))
666
+ assert len(self.distr) == numSteps + 1, "invalid number of distr values expected {}".format(numSteps + 1)
667
+ self.ximin = 0
668
+ self.ximax = numSteps
669
+ self.pmax = float(max(self.distr))
670
+
671
+ def isNumeric(self):
672
+ return True
673
+
674
+ def sample(self):
675
+ """
676
+ samples value
677
+ """
678
+ done = False
679
+ samp = None
680
+ while not done:
681
+ xi = randint(self.ximin, self.ximax)
682
+ #print(formatAny(xi, "xi"))
683
+ ps = randomFloat(0.0, self.pmax)
684
+ pa = self.distr[xi]
685
+ if ps < pa:
686
+ samp = self.xmin + xi * self.step
687
+ done = True
688
+ return samp
689
+
690
+
691
+ class TriangularRejectSampler:
692
+ """
693
+ non parametric sampling using triangular distribution based on rejection sampling
694
+ """
695
+ def __init__(self, xmin, xmax, vertexValue, vertexPos=None):
696
+ """
697
+ initializer
698
+
699
+ Parameters
700
+ xmin : min value
701
+ xmax : max value
702
+ vertexValue : distr value at vertex
703
+ vertexPos : vertex pposition
704
+ """
705
+ self.xmin = xmin
706
+ self.xmax = xmax
707
+ self.vertexValue = vertexValue
708
+ if vertexPos:
709
+ assert vertexPos > xmin and vertexPos < xmax, "vertex position outside bound"
710
+ self.vertexPos = vertexPos
711
+ else:
712
+ self.vertexPos = 0.5 * (xmin + xmax)
713
+ self.s1 = vertexValue / (self.vertexPos - xmin)
714
+ self.s2 = vertexValue / (xmax - self.vertexPos)
715
+
716
+ def isNumeric(self):
717
+ return True
718
+
719
+ def sample(self):
720
+ """
721
+ samples value
722
+ """
723
+ done = False
724
+ samp = None
725
+ while not done:
726
+ x = randomFloat(self.xmin, self.xmax)
727
+ y = randomFloat(0.0, self.vertexValue)
728
+ f = (x - self.xmin) * self.s1 if x < self.vertexPos else (self.xmax - x) * self.s2
729
+ if (y < f):
730
+ done = True
731
+ samp = x
732
+
733
+ return samp;
734
+
735
+ class NonParamRejectSampler:
736
+ """
737
+ non parametric sampling using given distribution based on rejection sampling
738
+ """
739
+ def __init__(self, xmin, binWidth, *values):
740
+ """
741
+ initializer
742
+
743
+ Parameters
744
+ xmin : min value
745
+ binWidth : bin width
746
+ values : distr values
747
+ """
748
+ self.values = values
749
+ if (len(self.values) == 1):
750
+ self.values = self.values[0]
751
+ self.xmin = xmin
752
+ self.xmax = xmin + binWidth * (len(self.values) - 1)
753
+ #print(self.xmin, self.xmax, binWidth)
754
+ self.binWidth = binWidth
755
+ self.fmax = 0
756
+ for v in self.values:
757
+ if (v > self.fmax):
758
+ self.fmax = v
759
+ self.ymin = 0
760
+ self.ymax = self.fmax
761
+ self.sampleAsInt = True
762
+
763
+ def isNumeric(self):
764
+ return True
765
+
766
+ def sampleAsFloat(self):
767
+ self.sampleAsInt = False
768
+
769
+ def sample(self):
770
+ """
771
+ samples value
772
+ """
773
+ done = False
774
+ samp = 0
775
+ while not done:
776
+ if self.sampleAsInt:
777
+ x = random.randint(self.xmin, self.xmax)
778
+ y = random.randint(self.ymin, self.ymax)
779
+ else:
780
+ x = randomFloat(self.xmin, self.xmax)
781
+ y = randomFloat(self.ymin, self.ymax)
782
+ bin = int((x - self.xmin) / self.binWidth)
783
+ f = self.values[bin]
784
+ if (y < f):
785
+ done = True
786
+ samp = x
787
+ return samp
788
+
789
+ class JointNonParamRejectSampler:
790
+ """
791
+ non parametric sampling using given distribution based on rejection sampling
792
+ """
793
+ def __init__(self, xmin, xbinWidth, xnbin, ymin, ybinWidth, ynbin, *values):
794
+ """
795
+ initializer
796
+
797
+ Parameters
798
+ xmin : min value for x
799
+ xbinWidth : bin width for x
800
+ xnbin : no of bins for x
801
+ ymin : min value for y
802
+ ybinWidth : bin width for y
803
+ ynbin : no of bins for y
804
+ values : distr values
805
+ """
806
+ self.values = values
807
+ if (len(self.values) == 1):
808
+ self.values = self.values[0]
809
+ assert len(self.values) == xnbin * ynbin, "wrong number of values for joint distr"
810
+ self.xmin = xmin
811
+ self.xmax = xmin + xbinWidth * xnbin
812
+ self.xbinWidth = xbinWidth
813
+ self.ymin = ymin
814
+ self.ymax = ymin + ybinWidth * ynbin
815
+ self.ybinWidth = ybinWidth
816
+ self.pmax = max(self.values)
817
+ self.values = np.array(self.values).reshape(xnbin, ynbin)
818
+
819
+ def isNumeric(self):
820
+ return True
821
+
822
+ def sample(self):
823
+ """
824
+ samples value
825
+ """
826
+ done = False
827
+ samp = 0
828
+ while not done:
829
+ x = randomFloat(self.xmin, self.xmax)
830
+ y = randomFloat(self.ymin, self.ymax)
831
+ xbin = int((x - self.xmin) / self.xbinWidth)
832
+ ybin = int((y - self.ymin) / self.ybinWidth)
833
+ ap = self.values[xbin][ybin]
834
+ sp = randomFloat(0.0, self.pmax)
835
+ if (sp < ap):
836
+ done = True
837
+ samp = [x,y]
838
+ return samp
839
+
840
+
841
+ class JointNormalSampler:
842
+ """
843
+ joint normal sampler
844
+ """
845
+ def __init__(self, *values):
846
+ """
847
+ initializer
848
+
849
+ Parameters
850
+ values : 2 mean values followed by 4 values for covar matrix
851
+ """
852
+ lvalues = list(values)
853
+ assert len(lvalues) == 6, "incorrect number of arguments for joint normal sampler"
854
+ mean = lvalues[:2]
855
+ self.mean = np.array(mean)
856
+ sd = lvalues[2:]
857
+ self.sd = np.array(sd).reshape(2,2)
858
+
859
+ def isNumeric(self):
860
+ return True
861
+
862
+ def sample(self):
863
+ """
864
+ samples value
865
+ """
866
+ return list(np.random.multivariate_normal(self.mean, self.sd))
867
+
868
+
869
+ class MultiVarNormalSampler:
870
+ """
871
+ muti variate normal sampler
872
+ """
873
+ def __init__(self, numVar, *values):
874
+ """
875
+ initializer
876
+
877
+ Parameters
878
+ numVar : no of variables
879
+ values : numVar mean values followed by numVar x numVar values for covar matrix
880
+ """
881
+ lvalues = list(values)
882
+ assert len(lvalues) == numVar + numVar * numVar, "incorrect number of arguments for multi var normal sampler"
883
+ mean = lvalues[:numVar]
884
+ self.mean = np.array(mean)
885
+ sd = lvalues[numVar:]
886
+ self.sd = np.array(sd).reshape(numVar,numVar)
887
+
888
+ def isNumeric(self):
889
+ return True
890
+
891
+ def sample(self):
892
+ """
893
+ samples value
894
+ """
895
+ return list(np.random.multivariate_normal(self.mean, self.sd))
896
+
897
+ class CategoricalRejectSampler:
898
+ """
899
+ non parametric sampling for categorical attributes using given distribution based
900
+ on rejection sampling
901
+ """
902
+ def __init__(self, *values):
903
+ """
904
+ initializer
905
+
906
+ Parameters
907
+ values : list of tuples which contains a categorical value and the corresponsding distr value
908
+ """
909
+ self.distr = values
910
+ if (len(self.distr) == 1):
911
+ self.distr = self.distr[0]
912
+ maxv = 0
913
+ for t in self.distr:
914
+ if t[1] > maxv:
915
+ maxv = t[1]
916
+ self.maxv = maxv
917
+
918
+ def sample(self):
919
+ """
920
+ samples value
921
+ """
922
+ done = False
923
+ samp = ""
924
+ while not done:
925
+ t = self.distr[randint(0, len(self.distr)-1)]
926
+ d = randomFloat(0, self.maxv)
927
+ if (d <= t[1]):
928
+ done = True
929
+ samp = t[0]
930
+ return samp
931
+
932
+
933
+ class CategoricalSetSampler:
934
+ """
935
+ non parametric sampling for categorical attributes using uniform distribution based for
936
+ sampling a set of values from all values
937
+ """
938
+ def __init__(self, *values):
939
+ """
940
+ initializer
941
+
942
+ Parameters
943
+ values : list which contains a categorical values
944
+ """
945
+ self.values = values
946
+ if (len(self.values) == 1):
947
+ self.values = self.values[0]
948
+ self.sampled = list()
949
+
950
+ def sample(self):
951
+ """
952
+ samples value only from previously unsamopled
953
+ """
954
+ samp = selectRandomFromList(self.values)
955
+ while True:
956
+ if samp in self.sampled:
957
+ samp = selectRandomFromList(self.values)
958
+ else:
959
+ self.sampled.append(samp)
960
+ break
961
+ return samp
962
+
963
+ def setSampled(self, sampled):
964
+ """
965
+ set already sampled
966
+
967
+ Parameters
968
+ sampled : already sampled list
969
+ """
970
+ self.sampled = sampled
971
+
972
+ def unsample(self, sample=None):
973
+ """
974
+ rempve from sample history
975
+
976
+ Parameters
977
+ sample : sample to be removed
978
+ """
979
+ if sample is None:
980
+ self.sampled.clear()
981
+ else:
982
+ self.sampled.remove(sample)
983
+
984
+ class DistrMixtureSampler:
985
+ """
986
+ distr mixture sampler
987
+ """
988
+ def __init__(self, mixtureWtDistr, *compDistr):
989
+ """
990
+ initializer
991
+
992
+ Parameters
993
+ mixtureWtDistr : sampler that returns index into sampler list
994
+ compDistr : sampler list
995
+ """
996
+ self.mixtureWtDistr = mixtureWtDistr
997
+ self.compDistr = compDistr
998
+ if (len(self.compDistr) == 1):
999
+ self.compDistr = self.compDistr[0]
1000
+
1001
+ def isNumeric(self):
1002
+ return True
1003
+
1004
+ def sample(self):
1005
+ """
1006
+ samples value
1007
+ """
1008
+ comp = self.mixtureWtDistr.sample()
1009
+
1010
+ #sample sampled comp distr
1011
+ return self.compDistr[comp].sample()
1012
+
1013
+ class AncestralSampler:
1014
+ """
1015
+ ancestral sampler using conditional distribution
1016
+ """
1017
+ def __init__(self, parentDistr, childDistr, numChildren):
1018
+ """
1019
+ initializer
1020
+
1021
+ Parameters
1022
+ parentDistr : parent distr
1023
+ childDistr : childdren distribution dictionary
1024
+ numChildren : no of children
1025
+ """
1026
+ self.parentDistr = parentDistr
1027
+ self.childDistr = childDistr
1028
+ self.numChildren = numChildren
1029
+
1030
+ def sample(self):
1031
+ """
1032
+ samples value
1033
+ """
1034
+ parent = self.parentDistr.sample()
1035
+
1036
+ #sample all children conditioned on parent
1037
+ children = []
1038
+ for i in range(self.numChildren):
1039
+ key = (parent, i)
1040
+ child = self.childDistr[key].sample()
1041
+ children.append(child)
1042
+ return (parent, children)
1043
+
1044
+ class ClusterSampler:
1045
+ """
1046
+ sample cluster and then sample member of sampled cluster
1047
+ """
1048
+ def __init__(self, clusters, *clustDistr):
1049
+ """
1050
+ initializer
1051
+
1052
+ Parameters
1053
+ clusters : dictionary clusters
1054
+ clustDistr : distr for clusters
1055
+ """
1056
+ self.sampler = CategoricalRejectSampler(*clustDistr)
1057
+ self.clusters = clusters
1058
+
1059
+ def sample(self):
1060
+ """
1061
+ samples value
1062
+ """
1063
+ cluster = self.sampler.sample()
1064
+ member = random.choice(self.clusters[cluster])
1065
+ return (cluster, member)
1066
+
1067
+
1068
+ class MetropolitanSampler:
1069
+ """
1070
+ metropolitan sampler
1071
+ """
1072
+ def __init__(self, propStdDev, min, binWidth, values):
1073
+ """
1074
+ initializer
1075
+
1076
+ Parameters
1077
+ propStdDev : proposal distr std dev
1078
+ min : min domain value for target distr
1079
+ binWidth : bin width
1080
+ values : target distr values
1081
+ """
1082
+ self.targetDistr = Histogram.createInitialized(min, binWidth, values)
1083
+ self.propsalDistr = GaussianRejectSampler(0, propStdDev)
1084
+ self.proposalMixture = False
1085
+
1086
+ # bootstrap sample
1087
+ (minv, maxv) = self.targetDistr.getMinMax()
1088
+ self.curSample = random.randint(minv, maxv)
1089
+ self.curDistr = self.targetDistr.value(self.curSample)
1090
+ self.transCount = 0
1091
+
1092
+ def initialize(self):
1093
+ """
1094
+ initialize
1095
+ """
1096
+ (minv, maxv) = self.targetDistr.getMinMax()
1097
+ self.curSample = random.randint(minv, maxv)
1098
+ self.curDistr = self.targetDistr.value(self.curSample)
1099
+ self.transCount = 0
1100
+
1101
+ def setProposalDistr(self, propsalDistr):
1102
+ """
1103
+ set custom proposal distribution
1104
+
1105
+ Parameters
1106
+ propsalDistr : proposal distribution
1107
+ """
1108
+ self.propsalDistr = propsalDistr
1109
+
1110
+
1111
+ def setGlobalProposalDistr(self, globPropStdDev, proposalChoiceThreshold):
1112
+ """
1113
+ set custom proposal distribution
1114
+
1115
+ Parameters
1116
+ globPropStdDev : global proposal distr std deviation
1117
+ proposalChoiceThreshold : threshold for using global proposal distribution
1118
+ """
1119
+ self.globalProposalDistr = GaussianRejectSampler(0, globPropStdDev)
1120
+ self.proposalChoiceThreshold = proposalChoiceThreshold
1121
+ self.proposalMixture = True
1122
+
1123
+ def sample(self):
1124
+ """
1125
+ samples value
1126
+ """
1127
+ nextSample = self.proposalSample(1)
1128
+ self.targetSample(nextSample)
1129
+ return self.curSample;
1130
+
1131
+ def proposalSample(self, skip):
1132
+ """
1133
+ sample from proposal distribution
1134
+
1135
+ Parameters
1136
+ skip : no of samples to skip
1137
+ """
1138
+ for i in range(skip):
1139
+ if not self.proposalMixture:
1140
+ #one proposal distr
1141
+ nextSample = self.curSample + self.propsalDistr.sample()
1142
+ nextSample = self.targetDistr.boundedValue(nextSample)
1143
+ else:
1144
+ #mixture of proposal distr
1145
+ if random.random() < self.proposalChoiceThreshold:
1146
+ nextSample = self.curSample + self.propsalDistr.sample()
1147
+ else:
1148
+ nextSample = self.curSample + self.globalProposalDistr.sample()
1149
+ nextSample = self.targetDistr.boundedValue(nextSample)
1150
+
1151
+ return nextSample
1152
+
1153
+ def targetSample(self, nextSample):
1154
+ """
1155
+ target sample
1156
+
1157
+ Parameters
1158
+ nextSample : proposal distr sample
1159
+ """
1160
+ nextDistr = self.targetDistr.value(nextSample)
1161
+
1162
+ transition = False
1163
+ if nextDistr > self.curDistr:
1164
+ transition = True
1165
+ else:
1166
+ distrRatio = float(nextDistr) / self.curDistr
1167
+ if random.random() < distrRatio:
1168
+ transition = True
1169
+
1170
+ if transition:
1171
+ self.curSample = nextSample
1172
+ self.curDistr = nextDistr
1173
+ self.transCount += 1
1174
+
1175
+
1176
+ def subSample(self, skip):
1177
+ """
1178
+ sub sample
1179
+
1180
+ Parameters
1181
+ skip : no of samples to skip
1182
+ """
1183
+ nextSample = self.proposalSample(skip)
1184
+ self.targetSample(nextSample)
1185
+ return self.curSample;
1186
+
1187
+ def setMixtureProposal(self, globPropStdDev, mixtureThreshold):
1188
+ """
1189
+ mixture proposal
1190
+
1191
+ Parameters
1192
+ globPropStdDev : global proposal distr std deviation
1193
+ mixtureThreshold : threshold for using global proposal distribution
1194
+ """
1195
+ self.globalProposalDistr = GaussianRejectSampler(0, globPropStdDev)
1196
+ self.mixtureThreshold = mixtureThreshold
1197
+
1198
+ def samplePropsal(self):
1199
+ """
1200
+ sample from proposal distr
1201
+
1202
+ """
1203
+ if self.globalPropsalDistr is None:
1204
+ proposal = self.propsalDistr.sample()
1205
+ else:
1206
+ if random.random() < self.mixtureThreshold:
1207
+ proposal = self.propsalDistr.sample()
1208
+ else:
1209
+ proposal = self.globalProposalDistr.sample()
1210
+
1211
+ return proposal
1212
+
1213
+ class PermutationSampler:
1214
+ """
1215
+ permutation sampler by shuffling a list
1216
+ """
1217
+ def __init__(self):
1218
+ """
1219
+ initialize
1220
+ """
1221
+ self.values = None
1222
+ self.numShuffles = None
1223
+
1224
+ @staticmethod
1225
+ def createSamplerWithValues(values, *numShuffles):
1226
+ """
1227
+ creator with values
1228
+
1229
+ Parameters
1230
+ values : list data
1231
+ numShuffles : no of shuffles or range of no of shuffles
1232
+ """
1233
+ sampler = PermutationSampler()
1234
+ sampler.values = values
1235
+ sampler.numShuffles = numShuffles
1236
+ return sampler
1237
+
1238
+ @staticmethod
1239
+ def createSamplerWithRange(minv, maxv, *numShuffles):
1240
+ """
1241
+ creator with ramge min and max
1242
+
1243
+ Parameters
1244
+ minv : min of range
1245
+ maxv : max of range
1246
+ numShuffles : no of shuffles or range of no of shuffles
1247
+ """
1248
+ sampler = PermutationSampler()
1249
+ sampler.values = list(range(minv, maxv + 1))
1250
+ sampler.numShuffles = numShuffles
1251
+ return sampler
1252
+
1253
+ def sample(self):
1254
+ """
1255
+ sample new permutation
1256
+ """
1257
+ cloned = self.values.copy()
1258
+ shuffle(cloned, *self.numShuffles)
1259
+ return cloned
1260
+
1261
+ class SpikeyDataSampler:
1262
+ """
1263
+ samples spikey data
1264
+ """
1265
+ def __init__(self, intvMean, intvScale, distr, spikeValueMean, spikeValueStd, spikeMaxDuration, baseValue = 0):
1266
+ """
1267
+ initializer
1268
+
1269
+ Parameters
1270
+ intvMean : interval mean
1271
+ intvScale : interval std dev
1272
+ distr : type of distr for interval
1273
+ spikeValueMean : spike value mean
1274
+ spikeValueStd : spike value std dev
1275
+ spikeMaxDuration : max duration for spike
1276
+ baseValue : base or offset value
1277
+ """
1278
+ if distr == "norm":
1279
+ self.intvSampler = NormalSampler(intvMean, intvScale)
1280
+ elif distr == "expo":
1281
+ rate = 1.0 / intvScale
1282
+ self.intvSampler = ExponentialSampler(rate)
1283
+ else:
1284
+ raise ValueError("invalid distribution")
1285
+
1286
+ self.spikeSampler = NormalSampler(spikeValueMean, spikeValueStd)
1287
+ self.spikeMaxDuration = spikeMaxDuration
1288
+ self.baseValue = baseValue
1289
+ self.inSpike = False
1290
+ self.spikeCount = 0
1291
+ self.baseCount = 0
1292
+ self.baseLength = int(self.intvSampler.sample())
1293
+ self.spikeValues = list()
1294
+ self.spikeLength = None
1295
+
1296
+ def sample(self):
1297
+ """
1298
+ sample new value
1299
+ """
1300
+ if self.baseCount <= self.baseLength:
1301
+ sampled = self.baseValue
1302
+ self.baseCount += 1
1303
+ else:
1304
+ if not self.inSpike:
1305
+ #starting spike
1306
+ spikeVal = self.spikeSampler.sample()
1307
+ self.spikeLength = sampleUniform(1, self.spikeMaxDuration)
1308
+ spikeMaxPos = 0 if self.spikeLength == 1 else sampleUniform(0, self.spikeLength-1)
1309
+ self.spikeValues.clear()
1310
+ for i in range(self.spikeLength):
1311
+ if i < spikeMaxPos:
1312
+ frac = (i + 1) / (spikeMaxPos + 1)
1313
+ frac = sampleFloatFromBase(frac, 0.1 * frac)
1314
+ elif i > spikeMaxPos:
1315
+ frac = (self.spikeLength - i) / (self.spikeLength - spikeMaxPos)
1316
+ frac = sampleFloatFromBase(frac, 0.1 * frac)
1317
+ else:
1318
+ frac = 1.0
1319
+ self.spikeValues.append(frac * spikeVal)
1320
+ self.inSpike = True
1321
+ self.spikeCount = 0
1322
+
1323
+
1324
+ sampled = self.spikeValues[self.spikeCount]
1325
+ self.spikeCount += 1
1326
+
1327
+ if self.spikeCount == self.spikeLength:
1328
+ #ending spike
1329
+ self.baseCount = 0
1330
+ self.baseLength = int(self.intvSampler.sample())
1331
+ self.inSpike = False
1332
+
1333
+ return sampled
1334
+
1335
+
1336
+ class EventSampler:
1337
+ """
1338
+ sample event
1339
+ """
1340
+ def __init__(self, intvSampler, valSampler=None):
1341
+ """
1342
+ initializer
1343
+
1344
+ Parameters
1345
+ intvSampler : interval sampler
1346
+ valSampler : value sampler
1347
+ """
1348
+ self.intvSampler = intvSampler
1349
+ self.valSampler = valSampler
1350
+ self.trigger = int(self.intvSampler.sample())
1351
+ self.count = 0
1352
+
1353
+ def reset(self):
1354
+ """
1355
+ reset trigger
1356
+ """
1357
+ self.trigger = int(self.intvSampler.sample())
1358
+ self.count = 0
1359
+
1360
+ def sample(self):
1361
+ """
1362
+ sample event
1363
+ """
1364
+ if self.count == self.trigger:
1365
+ sampled = self.valSampler.sample() if self.valSampler is not None else 1.0
1366
+ self.trigger = int(self.intvSampler.sample())
1367
+ self.count = 0
1368
+ else:
1369
+ sample = 0.0
1370
+ self.count += 1
1371
+ return sampled
1372
+
1373
+
1374
+
1375
+
1376
+ def createSampler(data):
1377
+ """
1378
+ create sampler
1379
+
1380
+ Parameters
1381
+ data : sampler description
1382
+ """
1383
+ #print(data)
1384
+ items = data.split(":")
1385
+ size = len(items)
1386
+ dtype = items[-1]
1387
+ stype = items[-2]
1388
+ #print("sampler data {}".format(data))
1389
+ #print("sampler {}".format(stype))
1390
+ sampler = None
1391
+ if stype == "uniform":
1392
+ if dtype == "int":
1393
+ min = int(items[0])
1394
+ max = int(items[1])
1395
+ sampler = UniformNumericSampler(min, max)
1396
+ elif dtype == "float":
1397
+ min = float(items[0])
1398
+ max = float(items[1])
1399
+ sampler = UniformNumericSampler(min, max)
1400
+ elif dtype == "categorical":
1401
+ values = items[:-2]
1402
+ sampler = UniformCategoricalSampler(values)
1403
+ elif stype == "normal":
1404
+ mean = float(items[0])
1405
+ sd = float(items[1])
1406
+ sampler = NormalSampler(mean, sd)
1407
+ if dtype == "int":
1408
+ sampler.sampleAsIntValue()
1409
+ elif stype == "nonparam":
1410
+ if dtype == "int" or dtype == "float":
1411
+ min = int(items[0])
1412
+ binWidth = int(items[1])
1413
+ values = items[2:-2]
1414
+ values = list(map(lambda v: int(v), values))
1415
+ sampler = NonParamRejectSampler(min, binWidth, values)
1416
+ if dtype == "float":
1417
+ sampler.sampleAsFloat()
1418
+ elif dtype == "categorical":
1419
+ values = list()
1420
+ for i in range(0, size-2, 2):
1421
+ cval = items[i]
1422
+ dist = int(items[i+1])
1423
+ pair = (cval, dist)
1424
+ values.append(pair)
1425
+ sampler = CategoricalRejectSampler(values)
1426
+ elif dtype == "scategorical":
1427
+ vfpath = items[0]
1428
+ values = getFileLines(vfpath, None)
1429
+ sampler = CategoricalSetSampler(values)
1430
+ elif stype == "discrete":
1431
+ vmin = int(items[0])
1432
+ vmax = int(items[1])
1433
+ step = int(items[2])
1434
+ values = list(map(lambda i : int(items[i]), range(3, len(items)-2)))
1435
+ sampler = DiscreteRejectSampler(vmin, vmax, step, values)
1436
+ elif stype == "bernauli":
1437
+ pr = float(items[0])
1438
+ events = None
1439
+ if len(items) == 5:
1440
+ events = list()
1441
+ if dtype == "int":
1442
+ events.append(int(items[1]))
1443
+ events.append(int(items[2]))
1444
+ elif dtype == "categorical":
1445
+ events.append(items[1])
1446
+ events.append(items[2])
1447
+ sampler = BernoulliTrialSampler(pr, events)
1448
+ else:
1449
+ raise ValueError("invalid sampler type " + stype)
1450
+ return sampler
1451
+
1452
+
1453
+
1454
+
1455
+
matumizi/stats.py ADDED
@@ -0,0 +1,496 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/local/bin/python3
2
+
3
+ # avenir-python: Machine Learning
4
+ # Author: Pranab Ghosh
5
+ #
6
+ # Licensed under the Apache License, Version 2.0 (the "License"); you
7
+ # may not use this file except in compliance with the License. You may
8
+ # obtain a copy of the License at
9
+ #
10
+ # http://www.apache.org/licenses/LICENSE-2.0
11
+ #
12
+ # Unless required by applicable law or agreed to in writing, software
13
+ # distributed under the License is distributed on an "AS IS" BASIS,
14
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
15
+ # implied. See the License for the specific language governing
16
+ # permissions and limitations under the License.
17
+
18
+ import sys
19
+ import random
20
+ import time
21
+ import math
22
+ import numpy as np
23
+ import statistics
24
+ from .util import *
25
+
26
+ """
27
+ histogram class
28
+ """
29
+ class Histogram:
30
+ def __init__(self, min, binWidth):
31
+ """
32
+ initializer
33
+
34
+ Parameters
35
+ min : min x
36
+ binWidth : bin width
37
+ """
38
+ self.xmin = min
39
+ self.binWidth = binWidth
40
+ self.normalized = False
41
+
42
+ @classmethod
43
+ def createInitialized(cls, xmin, binWidth, values):
44
+ """
45
+ create histogram instance with min domain, bin width and values
46
+
47
+ Parameters
48
+ min : min x
49
+ binWidth : bin width
50
+ values : y values
51
+ """
52
+ instance = cls(xmin, binWidth)
53
+ instance.xmax = xmin + binWidth * (len(values) - 1)
54
+ instance.ymin = 0
55
+ instance.bins = np.array(values)
56
+ instance.fmax = 0
57
+ for v in values:
58
+ if (v > instance.fmax):
59
+ instance.fmax = v
60
+ instance.ymin = 0.0
61
+ instance.ymax = instance.fmax
62
+ return instance
63
+
64
+ @classmethod
65
+ def createWithNumBins(cls, values, numBins=20):
66
+ """
67
+ create histogram instance values and no of bins
68
+
69
+ Parameters
70
+ values : y values
71
+ numBins : no of bins
72
+ """
73
+ xmin = min(values)
74
+ xmax = max(values)
75
+ binWidth = (xmax + .01 - (xmin - .01)) / numBins
76
+ instance = cls(xmin, binWidth)
77
+ instance.xmax = xmax
78
+ instance.numBin = numBins
79
+ instance.bins = np.zeros(instance.numBin)
80
+ for v in values:
81
+ instance.add(v)
82
+ return instance
83
+
84
+ @classmethod
85
+ def createUninitialized(cls, xmin, xmax, binWidth):
86
+ """
87
+ create histogram instance with no y values using domain min , max and bin width
88
+
89
+ Parameters
90
+ min : min x
91
+ max : max x
92
+ binWidth : bin width
93
+ """
94
+ instance = cls(xmin, binWidth)
95
+ instance.xmax = xmax
96
+ instance.numBin = (xmax - xmin) / binWidth + 1
97
+ instance.bins = np.zeros(instance.numBin)
98
+ return instance
99
+
100
+ def initialize(self):
101
+ """
102
+ set y values to 0
103
+ """
104
+ self.bins = np.zeros(self.numBin)
105
+
106
+ def add(self, value):
107
+ """
108
+ adds a value to a bin
109
+
110
+ Parameters
111
+ value : value
112
+ """
113
+ bin = int((value - self.xmin) / self.binWidth)
114
+ if (bin < 0 or bin > self.numBin - 1):
115
+ print (bin)
116
+ raise ValueError("outside histogram range")
117
+ self.bins[bin] += 1.0
118
+
119
+ def normalize(self):
120
+ """
121
+ normalize bin counts
122
+ """
123
+ if not self.normalized:
124
+ total = self.bins.sum()
125
+ self.bins = np.divide(self.bins, total)
126
+ self.normalized = True
127
+
128
+ def cumDistr(self):
129
+ """
130
+ cumulative dists
131
+ """
132
+ self.normalize()
133
+ self.cbins = np.cumsum(self.bins)
134
+ return self.cbins
135
+
136
+ def distr(self):
137
+ """
138
+ distr
139
+ """
140
+ self.normalize()
141
+ return self.bins
142
+
143
+
144
+ def percentile(self, percent):
145
+ """
146
+ return value corresponding to a percentile
147
+
148
+ Parameters
149
+ percent : percentile value
150
+ """
151
+ if self.cbins is None:
152
+ raise ValueError("cumulative distribution is not available")
153
+
154
+ for i,cuml in enumerate(self.cbins):
155
+ if percent > cuml:
156
+ value = (i * self.binWidth) - (self.binWidth / 2) + \
157
+ (percent - self.cbins[i-1]) * self.binWidth / (self.cbins[i] - self.cbins[i-1])
158
+ break
159
+ return value
160
+
161
+ def max(self):
162
+ """
163
+ return max bin value
164
+ """
165
+ return self.bins.max()
166
+
167
+ def value(self, x):
168
+ """
169
+ return a bin value
170
+
171
+ Parameters
172
+ x : x value
173
+ """
174
+ bin = int((x - self.xmin) / self.binWidth)
175
+ f = self.bins[bin]
176
+ return f
177
+
178
+ def bin(self, x):
179
+ """
180
+ return a bin index
181
+
182
+ Parameters
183
+ x : x value
184
+ """
185
+ return int((x - self.xmin) / self.binWidth)
186
+
187
+ def cumValue(self, x):
188
+ """
189
+ return a cumulative bin value
190
+
191
+ Parameters
192
+ x : x value
193
+ """
194
+ bin = int((x - self.xmin) / self.binWidth)
195
+ c = self.cbins[bin]
196
+ return c
197
+
198
+
199
+ def getMinMax(self):
200
+ """
201
+ returns x min and x max
202
+ """
203
+ return (self.xmin, self.xmax)
204
+
205
+ def boundedValue(self, x):
206
+ """
207
+ return x bounde by min and max
208
+
209
+ Parameters
210
+ x : x value
211
+ """
212
+ if x < self.xmin:
213
+ x = self.xmin
214
+ elif x > self.xmax:
215
+ x = self.xmax
216
+ return x
217
+
218
+ """
219
+ categorical histogram class
220
+ """
221
+ class CatHistogram:
222
+ def __init__(self):
223
+ """
224
+ initializer
225
+ """
226
+ self.binCounts = dict()
227
+ self.counts = 0
228
+ self.normalized = False
229
+
230
+ def add(self, value):
231
+ """
232
+ adds a value to a bin
233
+
234
+ Parameters
235
+ x : x value
236
+ """
237
+ addToKeyedCounter(self.binCounts, value)
238
+ self.counts += 1
239
+
240
+ def normalize(self):
241
+ """
242
+ normalize
243
+ """
244
+ if not self.normalized:
245
+ self.binCounts = dict(map(lambda r : (r[0],r[1] / self.counts), self.binCounts.items()))
246
+ self.normalized = True
247
+
248
+ def getMode(self):
249
+ """
250
+ get mode
251
+ """
252
+ maxk = None
253
+ maxv = 0
254
+ #print(self.binCounts)
255
+ for k,v in self.binCounts.items():
256
+ if v > maxv:
257
+ maxk = k
258
+ maxv = v
259
+ return (maxk, maxv)
260
+
261
+ def getEntropy(self):
262
+ """
263
+ get entropy
264
+ """
265
+ self.normalize()
266
+ entr = 0
267
+ #print(self.binCounts)
268
+ for k,v in self.binCounts.items():
269
+ entr -= v * math.log(v)
270
+ return entr
271
+
272
+ def getUniqueValues(self):
273
+ """
274
+ get unique values
275
+ """
276
+ return list(self.binCounts.keys())
277
+
278
+ def getDistr(self):
279
+ """
280
+ get distribution
281
+ """
282
+ self.normalize()
283
+ return self.binCounts.copy()
284
+
285
+ class RunningStat:
286
+ """
287
+ running stat class
288
+ """
289
+ def __init__(self):
290
+ """
291
+ initializer
292
+ """
293
+ self.sum = 0.0
294
+ self.sumSq = 0.0
295
+ self.count = 0
296
+
297
+ @staticmethod
298
+ def create(count, sum, sumSq):
299
+ """
300
+ creates iinstance
301
+
302
+ Parameters
303
+ sum : sum of values
304
+ sumSq : sum of valure squared
305
+ """
306
+ rs = RunningStat()
307
+ rs.sum = sum
308
+ rs.sumSq = sumSq
309
+ rs.count = count
310
+ return rs
311
+
312
+ def add(self, value):
313
+ """
314
+ adds new value
315
+
316
+ Parameters
317
+ value : value to add
318
+ """
319
+ self.sum += value
320
+ self.sumSq += (value * value)
321
+ self.count += 1
322
+
323
+ def getStat(self):
324
+ """
325
+ return mean and std deviation
326
+ """
327
+ mean = self.sum /self. count
328
+ t = self.sumSq / (self.count - 1) - mean * mean * self.count / (self.count - 1)
329
+ sd = math.sqrt(t)
330
+ re = (mean, sd)
331
+ return re
332
+
333
+ def addGetStat(self,value):
334
+ """
335
+ calculate mean and std deviation with new value added
336
+
337
+ Parameters
338
+ value : value to add
339
+ """
340
+ self.add(value)
341
+ re = self.getStat()
342
+ return re
343
+
344
+ def getCount(self):
345
+ """
346
+ return count
347
+ """
348
+ return self.count
349
+
350
+ def getState(self):
351
+ """
352
+ return state
353
+ """
354
+ s = (self.count, self.sum, self.sumSq)
355
+ return s
356
+
357
+ class SlidingWindowStat:
358
+ """
359
+ sliding window stats
360
+ """
361
+ def __init__(self):
362
+ """
363
+ initializer
364
+ """
365
+ self.sum = 0.0
366
+ self.sumSq = 0.0
367
+ self.count = 0
368
+ self.values = None
369
+
370
+ @staticmethod
371
+ def create(values, sum, sumSq):
372
+ """
373
+ creates iinstance
374
+
375
+ Parameters
376
+ sum : sum of values
377
+ sumSq : sum of valure squared
378
+ """
379
+ sws = SlidingWindowStat()
380
+ sws.sum = sum
381
+ sws.sumSq = sumSq
382
+ self.values = values.copy()
383
+ sws.count = len(self.values)
384
+ return sws
385
+
386
+ @staticmethod
387
+ def initialize(values):
388
+ """
389
+ creates iinstance
390
+
391
+ Parameters
392
+ values : list of values
393
+ """
394
+ sws = SlidingWindowStat()
395
+ sws.values = values.copy()
396
+ for v in sws.values:
397
+ sws.sum += v
398
+ sws.sumSq += v * v
399
+ sws.count = len(sws.values)
400
+ return sws
401
+
402
+ @staticmethod
403
+ def createEmpty(count):
404
+ """
405
+ creates iinstance
406
+
407
+ Parameters
408
+ count : count of values
409
+ """
410
+ sws = SlidingWindowStat()
411
+ sws.count = count
412
+ sws.values = list()
413
+ return sws
414
+
415
+ def add(self, value):
416
+ """
417
+ adds new value
418
+
419
+ Parameters
420
+ value : value to add
421
+ """
422
+ self.values.append(value)
423
+ if len(self.values) > self.count:
424
+ self.sum += value - self.values[0]
425
+ self.sumSq += (value * value) - (self.values[0] * self.values[0])
426
+ self.values.pop(0)
427
+ else:
428
+ self.sum += value
429
+ self.sumSq += (value * value)
430
+
431
+
432
+ def getStat(self):
433
+ """
434
+ calculate mean and std deviation
435
+ """
436
+ mean = self.sum /self. count
437
+ t = self.sumSq / (self.count - 1) - mean * mean * self.count / (self.count - 1)
438
+ sd = math.sqrt(t)
439
+ re = (mean, sd)
440
+ return re
441
+
442
+ def addGetStat(self,value):
443
+ """
444
+ calculate mean and std deviation with new value added
445
+ """
446
+ self.add(value)
447
+ re = self.getStat()
448
+ return re
449
+
450
+ def getCount(self):
451
+ """
452
+ return count
453
+ """
454
+ return self.count
455
+
456
+ def getCurSize(self):
457
+ """
458
+ return count
459
+ """
460
+ return len(self.values)
461
+
462
+ def getState(self):
463
+ """
464
+ return state
465
+ """
466
+ s = (self.count, self.sum, self.sumSq)
467
+ return s
468
+
469
+
470
+ def basicStat(ldata):
471
+ """
472
+ mean and std dev
473
+
474
+ Parameters
475
+ ldata : list of values
476
+ """
477
+ m = statistics.mean(ldata)
478
+ s = statistics.stdev(ldata, xbar=m)
479
+ r = (m, s)
480
+ return r
481
+
482
+ def getFileColumnStat(filePath, col, delem=","):
483
+ """
484
+ gets stats for a file column
485
+
486
+ Parameters
487
+ filePath : file path
488
+ col : col index
489
+ delem : field delemter
490
+ """
491
+ rs = RunningStat()
492
+ for rec in fileRecGen(filePath, delem):
493
+ va = float(rec[col])
494
+ rs.add(va)
495
+
496
+ return rs.getStat()
matumizi/util.py ADDED
@@ -0,0 +1,2345 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/local/bin/python3
2
+
3
+ # Author: Pranab Ghosh
4
+ #
5
+ # Licensed under the Apache License, Version 2.0 (the "License"); you
6
+ # may not use this file except in compliance with the License. You may
7
+ # obtain a copy of the License at
8
+ #
9
+ # http://www.apache.org/licenses/LICENSE-2.0
10
+ #
11
+ # Unless required by applicable law or agreed to in writing, software
12
+ # distributed under the License is distributed on an "AS IS" BASIS,
13
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
14
+ # implied. See the License for the specific language governing
15
+ # permissions and limitations under the License.
16
+
17
+ import os
18
+ import sys
19
+ from random import randint
20
+ import random
21
+ import time
22
+ import uuid
23
+ from datetime import datetime
24
+ import math
25
+ import numpy as np
26
+ import pandas as pd
27
+ import matplotlib.pyplot as plt
28
+ import numpy as np
29
+ import logging
30
+ import logging.handlers
31
+ import pickle
32
+ from contextlib import contextmanager
33
+
34
+ tokens = ["0","1","2","3","4","5","6","7","8","9","A","B","C","D","E","F","G","H","I","J","K","L","M",
35
+ "N","O","P","Q","R","S","T","U","V","W","X","Y","Z","0","1","2","3","4","5","6","7","8","9"]
36
+ numTokens = tokens[:10]
37
+ alphaTokens = tokens[10:36]
38
+ loCaseChars = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k","l","m","n","o",
39
+ "p","q","r","s","t","u","v","w","x","y","z"]
40
+
41
+ typeInt = "int"
42
+ typeFloat = "float"
43
+ typeString = "string"
44
+
45
+ secInMinute = 60
46
+ secInHour = 60 * 60
47
+ secInDay = 24 * secInHour
48
+ secInWeek = 7 * secInDay
49
+ secInYear = 365 * secInDay
50
+ secInMonth = secInYear / 12
51
+
52
+ minInHour = 60
53
+ minInDay = 24 * minInHour
54
+
55
+ ftPerYard = 3
56
+ ftPerMile = ftPerYard * 1760
57
+
58
+
59
+ def genID(size):
60
+ """
61
+ generates ID
62
+
63
+ Parameters
64
+ size : size of ID
65
+ """
66
+ id = ""
67
+ for i in range(size):
68
+ id = id + selectRandomFromList(tokens)
69
+ return id
70
+
71
+ def genIdList(numId, idSize):
72
+ """
73
+ generate list of IDs
74
+
75
+ Parameters:
76
+ numId: number of Ids
77
+ idSize: ID size
78
+ """
79
+ iDs = []
80
+ for i in range(numId):
81
+ iDs.append(genID(idSize))
82
+ return iDs
83
+
84
+ def genNumID(size):
85
+ """
86
+ generates ID consisting of digits onl
87
+
88
+ Parameters
89
+ size : size of ID
90
+ """
91
+ id = ""
92
+ for i in range(size):
93
+ id = id + selectRandomFromList(numTokens)
94
+ return id
95
+
96
+ def genLowCaseID(size):
97
+ """
98
+ generates ID consisting of lower case chars
99
+
100
+ Parameters
101
+ size : size of ID
102
+ """
103
+ id = ""
104
+ for i in range(size):
105
+ id = id + selectRandomFromList(loCaseChars)
106
+ return id
107
+
108
+ def genNumIdList(numId, idSize):
109
+ """
110
+ generate list of numeric IDs
111
+
112
+ Parameters:
113
+ numId: number of Ids
114
+ idSize: ID size
115
+ """
116
+ iDs = []
117
+ for i in range(numId):
118
+ iDs.append(genNumID(idSize))
119
+ return iDs
120
+
121
+ def genNameInitial():
122
+ """
123
+ generate name initial
124
+ """
125
+ return selectRandomFromList(alphaTokens) + selectRandomFromList(alphaTokens)
126
+
127
+ def genPhoneNum(arCode):
128
+ """
129
+ generates phone number
130
+
131
+ Parameters
132
+ arCode: area code
133
+ """
134
+ phNum = genNumID(7)
135
+ return arCode + str(phNum)
136
+
137
+ def selectRandomFromList(ldata):
138
+ """
139
+ select an element randomly from a lis
140
+
141
+ Parameters
142
+ ldata : list data
143
+ """
144
+ return ldata[randint(0, len(ldata)-1)]
145
+
146
+ def selectOtherRandomFromList(ldata, cval):
147
+ """
148
+ select an element randomly from a list excluding the given one
149
+
150
+ Parameters
151
+ ldata : list data
152
+ cval : value to be excluded
153
+ """
154
+ nval = selectRandomFromList(ldata)
155
+ while nval == cval:
156
+ nval = selectRandomFromList(ldata)
157
+ return nval
158
+
159
+ def selectRandomSubListFromList(ldata, num):
160
+ """
161
+ generates random sublist from a list without replacemment
162
+
163
+ Parameters
164
+ ldata : list data
165
+ num : output list size
166
+ """
167
+ assertLesser(num, len(ldata), "size of sublist to be sampled greater than or equal to main list")
168
+ i = randint(0, len(ldata)-1)
169
+ sel = ldata[i]
170
+ selSet = {i}
171
+ selList = [sel]
172
+ while (len(selSet) < num):
173
+ i = randint(0, len(ldata)-1)
174
+ if (i not in selSet):
175
+ sel = ldata[i]
176
+ selSet.add(i)
177
+ selList.append(sel)
178
+ return selList
179
+
180
+ def selectRandomSubListFromListWithRepl(ldata, num):
181
+ """
182
+ generates random sublist from a list with replacemment
183
+
184
+ Parameters
185
+ ldata : list data
186
+ num : output list size
187
+
188
+ """
189
+ return list(map(lambda i : selectRandomFromList(ldata), range(num)))
190
+
191
+ def selectRandomFromDict(ddata):
192
+ """
193
+ select an element randomly from a dictionary
194
+
195
+ Parameters
196
+ ddata : dictionary data
197
+ """
198
+ dkeys = list(ddata.keys())
199
+ dk = selectRandomFromList(dkeys)
200
+ el = (dk, ddata[dk])
201
+ return el
202
+
203
+ def setListRandomFromList(ldata, ldataRepl):
204
+ """
205
+ sets some elents in the first list randomly with elements from the second list
206
+
207
+ Parameters
208
+ ldata : list data
209
+ ldataRepl : list with replacement data
210
+ """
211
+ l = len(ldata)
212
+ selSet = set()
213
+ for d in ldataRepl:
214
+ i = randint(0, l-1)
215
+ while i in selSet:
216
+ i = randint(0, l-1)
217
+ ldata[i] = d
218
+ selSet.add(i)
219
+
220
+ def genIpAddress():
221
+ """
222
+ generates IP address
223
+ """
224
+ i1 = randint(0,256)
225
+ i2 = randint(0,256)
226
+ i3 = randint(0,256)
227
+ i4 = randint(0,256)
228
+ ip = "%d.%d.%d.%d" %(i1,i2,i3,i4)
229
+ return ip
230
+
231
+ def curTimeMs():
232
+ """
233
+ current time in ms
234
+ """
235
+ return int((datetime.utcnow() - datetime(1970,1,1)).total_seconds() * 1000)
236
+
237
+ def secDegPolyFit(x1, y1, x2, y2, x3, y3):
238
+ """
239
+ second deg polynomial
240
+
241
+ Parameters
242
+ x1 : 1st point x
243
+ y1 : 1st point y
244
+ x2 : 2nd point x
245
+ y2 : 2nd point y
246
+ x3 : 3rd point x
247
+ y3 : 3rd point y
248
+ """
249
+ t = (y1 - y2) / (x1 - x2)
250
+ a = t - (y2 - y3) / (x2 - x3)
251
+ a = a / (x1 - x3)
252
+ b = t - a * (x1 + x2)
253
+ c = y1 - a * x1 * x1 - b * x1
254
+ return (a, b, c)
255
+
256
+ def range_limit(val, minv, maxv):
257
+ """
258
+ range limit a value
259
+
260
+ Parameters
261
+ val : data value
262
+ minv : minimum
263
+ maxv : maximum
264
+ """
265
+ if (val < minv):
266
+ val = minv
267
+ elif (val > maxv):
268
+ val = maxv
269
+ return val
270
+
271
+ def rangeLimit(val, minv, maxv):
272
+ """
273
+ range limit a value
274
+
275
+ Parameters
276
+ val : data value
277
+ minv : minimum
278
+ maxv : maximum
279
+ """
280
+ return range_limit(val, minv, maxv)
281
+
282
+ def isInRange(val, minv, maxv):
283
+ """
284
+ checks if within range
285
+
286
+ Parameters
287
+ val : data value
288
+ minv : minimum
289
+ maxv : maximum
290
+ """
291
+ return val >= minv and val <= maxv
292
+
293
+ def stripFileLines(filePath, offset):
294
+ """
295
+ strips number of chars from both ends
296
+
297
+ Parameters
298
+ filePath : file path
299
+ offset : offset from both ends of line
300
+ """
301
+ fp = open(filePath, "r")
302
+ for line in fp:
303
+ stripped = line[offset:len(line) - 1 - offset]
304
+ print (stripped)
305
+ fp.close()
306
+
307
+ def genLatLong(lat1, long1, lat2, long2):
308
+ """
309
+ generate lat log within limits
310
+
311
+ Parameters
312
+ lat1 : lat of 1st point
313
+ long1 : long of 1st point
314
+ lat2 : lat of 2nd point
315
+ long2 : long of 2nd point
316
+ """
317
+ lat = lat1 + (lat2 - lat1) * random.random()
318
+ longg = long1 + (long2 - long1) * random.random()
319
+ return (lat, longg)
320
+
321
+ def geoDistance(lat1, long1, lat2, long2):
322
+ """
323
+ find geo distance in ft
324
+
325
+ Parameters
326
+ lat1 : lat of 1st point
327
+ long1 : long of 1st point
328
+ lat2 : lat of 2nd point
329
+ long2 : long of 2nd point
330
+ """
331
+ latDiff = math.radians(lat1 - lat2)
332
+ longDiff = math.radians(long1 - long2)
333
+ l1 = math.sin(latDiff/2.0)
334
+ l2 = math.sin(longDiff/2.0)
335
+ l3 = math.cos(math.radians(lat1))
336
+ l4 = math.cos(math.radians(lat2))
337
+ a = l1 * l1 + l3 * l4 * l2 * l2
338
+ l5 = math.sqrt(a)
339
+ l6 = math.sqrt(1.0 - a)
340
+ c = 2.0 * math.atan2(l5, l6)
341
+ r = 6371008.8 * 3.280840
342
+ return c * r
343
+
344
+ def minLimit(val, limit):
345
+ """
346
+ min limit
347
+ Parameters
348
+
349
+ """
350
+ if (val < limit):
351
+ val = limit
352
+ return val;
353
+
354
+ def maxLimit(val, limit):
355
+ """
356
+ max limit
357
+ Parameters
358
+
359
+ """
360
+ if (val > limit):
361
+ val = limit
362
+ return val;
363
+
364
+ def rangeSample(val, minLim, maxLim):
365
+ """
366
+ if out side range sample within range
367
+
368
+ Parameters
369
+ val : value
370
+ minLim : minimum
371
+ maxLim : maximum
372
+ """
373
+ if val < minLim or val > maxLim:
374
+ val = randint(minLim, maxLim)
375
+ return val
376
+
377
+ def genRandomIntListWithinRange(size, minLim, maxLim):
378
+ """
379
+ random unique list of integers within range
380
+
381
+ Parameters
382
+ size : size of returned list
383
+ minLim : minimum
384
+ maxLim : maximum
385
+ """
386
+ values = set()
387
+ for i in range(size):
388
+ val = randint(minLim, maxLim)
389
+ while val not in values:
390
+ values.add(val)
391
+ return list(values)
392
+
393
+ def preturbScalar(value, vrange, distr="uniform"):
394
+ """
395
+ preturbs a mutiplicative value within range
396
+
397
+ Parameters
398
+ value : data value
399
+ vrange : value delta fraction
400
+ distr : noise distribution type
401
+ """
402
+ if distr == "uniform":
403
+ scale = 1.0 - vrange + 2 * vrange * random.random()
404
+ elif distr == "normal":
405
+ scale = 1.0 + np.random.normal(0, vrange)
406
+ else:
407
+ exisWithMsg("unknown noise distr " + distr)
408
+ return value * scale
409
+
410
+ def preturbScalarAbs(value, vrange):
411
+ """
412
+ preturbs an absolute value within range
413
+
414
+ Parameters
415
+ value : data value
416
+ vrange : value delta absolute
417
+
418
+ """
419
+ delta = - vrange + 2.0 * vrange * random.random()
420
+ return value + delta
421
+
422
+ def preturbVector(values, vrange):
423
+ """
424
+ preturbs a list within range
425
+
426
+ Parameters
427
+ values : list data
428
+ vrange : value delta fraction
429
+ """
430
+ nValues = list(map(lambda va: preturbScalar(va, vrange), values))
431
+ return nValues
432
+
433
+ def randomShiftVector(values, smin, smax):
434
+ """
435
+ shifts a list by a random quanity with a range
436
+
437
+ Parameters
438
+ values : list data
439
+ smin : samplinf minimum
440
+ smax : sampling maximum
441
+ """
442
+ shift = np.random.uniform(smin, smax)
443
+ return list(map(lambda va: va + shift, values))
444
+
445
+ def floatRange(beg, end, incr):
446
+ """
447
+ generates float range
448
+
449
+ Parameters
450
+ beg :range begin
451
+ end: range end
452
+ incr : range increment
453
+ """
454
+ return list(np.arange(beg, end, incr))
455
+
456
+ def shuffle(values, *numShuffles):
457
+ """
458
+ in place shuffling with swap of pairs
459
+
460
+ Parameters
461
+ values : list data
462
+ numShuffles : parameter list for number of shuffles
463
+ """
464
+ size = len(values)
465
+ if len(numShuffles) == 0:
466
+ numShuffle = int(size / 2)
467
+ elif len(numShuffles) == 1:
468
+ numShuffle = numShuffles[0]
469
+ else:
470
+ numShuffle = randint(numShuffles[0], numShuffles[1])
471
+ print("numShuffle {}".format(numShuffle))
472
+ for i in range(numShuffle):
473
+ first = random.randint(0, size - 1)
474
+ second = random.randint(0, size - 1)
475
+ while first == second:
476
+ second = random.randint(0, size - 1)
477
+ tmp = values[first]
478
+ values[first] = values[second]
479
+ values[second] = tmp
480
+
481
+
482
+ def splitList(itms, numGr):
483
+ """
484
+ splits a list into sub lists of approximately equal size, with items in sublists randomly chod=sen
485
+
486
+ Parameters
487
+ itms ; list of values
488
+ numGr : no of groups
489
+ """
490
+ tcount = len(itms)
491
+ cItems = list(itms)
492
+ sz = int(len(cItems) / numGr)
493
+ groups = list()
494
+ count = 0
495
+ for i in range(numGr):
496
+ if (i == numGr - 1):
497
+ csz = tcount - count
498
+ else:
499
+ csz = sz + randint(-2, 2)
500
+ count += csz
501
+ gr = list()
502
+ for j in range(csz):
503
+ it = selectRandomFromList(cItems)
504
+ gr.append(it)
505
+ cItems.remove(it)
506
+ groups.append(gr)
507
+ return groups
508
+
509
+ def multVector(values, vrange):
510
+ """
511
+ multiplies a list within value range
512
+
513
+ Parameters
514
+ values : list of values
515
+ vrange : fraction of vaue to be used to update
516
+ """
517
+ scale = 1.0 - vrange + 2 * vrange * random.random()
518
+ nValues = list(map(lambda va: va * scale, values))
519
+ return nValues
520
+
521
+ def weightedAverage(values, weights):
522
+ """
523
+ calculates weighted average
524
+
525
+ Parameters
526
+ values : list of values
527
+ weights : list of weights
528
+ """
529
+ assert len(values) == len(weights), "values and weights should be same size"
530
+ vw = zip(values, weights)
531
+ wva = list(map(lambda e : e[0] * e[1], vw))
532
+ #wa = sum(x * y for x, y in vw) / sum(weights)
533
+ wav = sum(wva) / sum(weights)
534
+ return wav
535
+
536
+ def extractFields(line, delim, keepIndices):
537
+ """
538
+ breaks a line into fields and keeps only specified fileds and returns new line
539
+
540
+ Parameters
541
+ line ; deli separated string
542
+ delim : delemeter
543
+ keepIndices : list of indexes to fields to be retained
544
+ """
545
+ items = line.split(delim)
546
+ newLine = []
547
+ for i in keepIndices:
548
+ newLine.append(line[i])
549
+ return delim.join(newLine)
550
+
551
+ def remFields(line, delim, remIndices):
552
+ """
553
+ removes fields from delim separated string
554
+
555
+ Parameters
556
+ line ; delemeter separated string
557
+ delim : delemeter
558
+ remIndices : list of indexes to fields to be removed
559
+ """
560
+ items = line.split(delim)
561
+ newLine = []
562
+ for i in range(len(items)):
563
+ if not arrayContains(remIndices, i):
564
+ newLine.append(line[i])
565
+ return delim.join(newLine)
566
+
567
+ def extractList(data, indices):
568
+ """
569
+ extracts list from another list, given indices
570
+
571
+ Parameters
572
+ remIndices : list data
573
+ indices : list of indexes to fields to be retained
574
+ """
575
+ if areAllFieldsIncluded(data, indices):
576
+ exList = data.copy()
577
+ #print("all indices")
578
+ else:
579
+ exList = list()
580
+ le = len(data)
581
+ for i in indices:
582
+ assert i < le , "index {} out of bound {}".format(i, le)
583
+ exList.append(data[i])
584
+
585
+ return exList
586
+
587
+ def arrayContains(arr, item):
588
+ """
589
+ checks if array contains an item
590
+
591
+ Parameters
592
+ arr : list data
593
+ item : item to search
594
+ """
595
+ contains = True
596
+ try:
597
+ arr.index(item)
598
+ except ValueError:
599
+ contains = False
600
+ return contains
601
+
602
+ def strToIntArray(line, delim=","):
603
+ """
604
+ int array from delim separated string
605
+
606
+ Parameters
607
+ line ; delemeter separated string
608
+ """
609
+ arr = line.split(delim)
610
+ return [int(a) for a in arr]
611
+
612
+ def strToFloatArray(line, delim=","):
613
+ """
614
+ float array from delim separated string
615
+
616
+ Parameters
617
+ line ; delemeter separated string
618
+ """
619
+ arr = line.split(delim)
620
+ return [float(a) for a in arr]
621
+
622
+ def strListOrRangeToIntArray(line):
623
+ """
624
+ int array from delim separated string or range
625
+
626
+ Parameters
627
+ line ; delemeter separated string
628
+ """
629
+ varr = line.split(",")
630
+ if (len(varr) > 1):
631
+ iarr = list(map(lambda v: int(v), varr))
632
+ else:
633
+ vrange = line.split(":")
634
+ if (len(vrange) == 2):
635
+ lo = int(vrange[0])
636
+ hi = int(vrange[1])
637
+ iarr = list(range(lo, hi+1))
638
+ else:
639
+ iarr = [int(line)]
640
+ return iarr
641
+
642
+ def toStr(val, precision):
643
+ """
644
+ converts any type to string
645
+
646
+ Parameters
647
+ val : value
648
+ precision ; precision for float value
649
+ """
650
+ if type(val) == float or type(val) == np.float64 or type(val) == np.float32:
651
+ format = "%" + ".%df" %(precision)
652
+ sVal = format %(val)
653
+ else:
654
+ sVal = str(val)
655
+ return sVal
656
+
657
+ def toStrFromList(values, precision, delim=","):
658
+ """
659
+ converts list of any type to delim separated string
660
+
661
+ Parameters
662
+ values : list data
663
+ precision ; precision for float value
664
+ delim : delemeter
665
+ """
666
+ sValues = list(map(lambda v: toStr(v, precision), values))
667
+ return delim.join(sValues)
668
+
669
+ def toIntList(values):
670
+ """
671
+ convert to int list
672
+
673
+ Parameters
674
+ values : list data
675
+ """
676
+ return list(map(lambda va: int(va), values))
677
+
678
+ def toFloatList(values):
679
+ """
680
+ convert to float list
681
+
682
+ Parameters
683
+ values : list data
684
+
685
+ """
686
+ return list(map(lambda va: float(va), values))
687
+
688
+ def toStrList(values, precision=None):
689
+ """
690
+ convert to string list
691
+
692
+ Parameters
693
+ values : list data
694
+ precision ; precision for float value
695
+ """
696
+ return list(map(lambda va: toStr(va, precision), values))
697
+
698
+ def toIntFromBoolean(value):
699
+ """
700
+ convert to int
701
+
702
+ Parameters
703
+ value : boolean value
704
+ """
705
+ ival = 1 if value else 0
706
+ return ival
707
+
708
+ def scaleBySum(ldata):
709
+ """
710
+ scales so that sum is 1
711
+
712
+ Parameters
713
+ ldata : list data
714
+ """
715
+ s = sum(ldata)
716
+ return list(map(lambda e : e/s, ldata))
717
+
718
+ def scaleByMax(ldata):
719
+ """
720
+ scales so that max value is 1
721
+
722
+ Parameters
723
+ ldata : list data
724
+ """
725
+ m = max(ldata)
726
+ return list(map(lambda e : e/m, ldata))
727
+
728
+ def typedValue(val, dtype=None):
729
+ """
730
+ return typed value given string, discovers data type if not specified
731
+
732
+ Parameters
733
+ val : value
734
+ dtype : data type
735
+ """
736
+ tVal = None
737
+
738
+ if dtype is not None:
739
+ if dtype == "num":
740
+ dtype = "int" if dtype.find(".") == -1 else "float"
741
+
742
+ if dtype == "int":
743
+ tVal = int(val)
744
+ elif dtype == "float":
745
+ tVal = float(val)
746
+ elif dtype == "bool":
747
+ tVal = bool(val)
748
+ else:
749
+ tVal = val
750
+ else:
751
+ if type(val) == str:
752
+ lVal = val.lower()
753
+
754
+ #int
755
+ done = True
756
+ try:
757
+ tVal = int(val)
758
+ except ValueError:
759
+ done = False
760
+
761
+ #float
762
+ if not done:
763
+ done = True
764
+ try:
765
+ tVal = float(val)
766
+ except ValueError:
767
+ done = False
768
+
769
+ #boolean
770
+ if not done:
771
+ done = True
772
+ if lVal == "true":
773
+ tVal = True
774
+ elif lVal == "false":
775
+ tVal = False
776
+ else:
777
+ done = False
778
+ #None
779
+ if not done:
780
+ if lVal == "none":
781
+ tVal = None
782
+ else:
783
+ tVal = val
784
+ else:
785
+ tVal = val
786
+
787
+ return tVal
788
+
789
+ def isInt(val):
790
+ """
791
+ return true if string is int and the typed value
792
+
793
+ Parameters
794
+ val : value
795
+ """
796
+ valInt = True
797
+ try:
798
+ tVal = int(val)
799
+ except ValueError:
800
+ valInt = False
801
+ tVal = None
802
+ r = (valInt, tVal)
803
+ return r
804
+
805
+ def isFloat(val):
806
+ """
807
+ return true if string is float
808
+
809
+ Parameters
810
+ val : value
811
+ """
812
+ valFloat = True
813
+ try:
814
+ tVal = float(val)
815
+ except ValueError:
816
+ valFloat = False
817
+ tVal = None
818
+ r = (valFloat, tVal)
819
+ return r
820
+
821
+ def getAllFiles(dirPath):
822
+ """
823
+ get all files recursively
824
+
825
+ Parameters
826
+ dirPath : directory path
827
+ """
828
+ filePaths = []
829
+ for (thisDir, subDirs, fileNames) in os.walk(dirPath):
830
+ for fileName in fileNames:
831
+ filePaths.append(os.path.join(thisDir, fileName))
832
+ filePaths.sort()
833
+ return filePaths
834
+
835
+ def getFileContent(fpath, verbose=False):
836
+ """
837
+ get file contents in directory
838
+
839
+ Parameters
840
+ fpath ; directory path
841
+ verbose : verbosity flag
842
+ """
843
+ # dcument list
844
+ docComplete = []
845
+ filePaths = getAllFiles(fpath)
846
+
847
+ # read files
848
+ for filePath in filePaths:
849
+ if verbose:
850
+ print("next file " + filePath)
851
+ with open(filePath, 'r') as contentFile:
852
+ content = contentFile.read()
853
+ docComplete.append(content)
854
+ return (docComplete, filePaths)
855
+
856
+ def getOneFileContent(fpath):
857
+ """
858
+ get one file contents
859
+
860
+ Parameters
861
+ fpath : file path
862
+ """
863
+ with open(fpath, 'r') as contentFile:
864
+ docStr = contentFile.read()
865
+ return docStr
866
+
867
+ def getFileLines(dirPath, delim=","):
868
+ """
869
+ get lines from a file
870
+
871
+ Parameters
872
+ dirPath : file path
873
+ delim : delemeter
874
+ """
875
+ lines = list()
876
+ for li in fileRecGen(dirPath, delim):
877
+ lines.append(li)
878
+ return lines
879
+
880
+ def getFileSampleLines(dirPath, percen, delim=","):
881
+ """
882
+ get sampled lines from a file
883
+
884
+ Parameters
885
+ dirPath : file path
886
+ percen : sampling percentage
887
+ delim : delemeter
888
+ """
889
+ lines = list()
890
+ for li in fileRecGen(dirPath, delim):
891
+ if randint(0, 100) < percen:
892
+ lines.append(li)
893
+ return lines
894
+
895
+ def getFileColumnAsString(dirPath, index, delim=","):
896
+ """
897
+ get string column from a file
898
+
899
+ Parameters
900
+ dirPath : file path
901
+ index : index
902
+ delim : delemeter
903
+ """
904
+ fields = list()
905
+ for rec in fileRecGen(dirPath, delim):
906
+ fields.append(rec[index])
907
+ #print(fields)
908
+ return fields
909
+
910
+ def getFileColumnsAsString(dirPath, indexes, delim=","):
911
+ """
912
+ get multiple string columns from a file
913
+
914
+ Parameters
915
+ dirPath : file path
916
+ indexes : indexes of columns
917
+ delim : delemeter
918
+
919
+ """
920
+ nindex = len(indexes)
921
+ columns = list(map(lambda i : list(), range(nindex)))
922
+ for rec in fileRecGen(dirPath, delim):
923
+ for i in range(nindex):
924
+ columns[i].append(rec[indexes[i]])
925
+ return columns
926
+
927
+ def getFileColumnAsFloat(dirPath, index, delim=","):
928
+ """
929
+ get float fileds from a file
930
+
931
+ Parameters
932
+ dirPath : file path
933
+ index : index
934
+ delim : delemeter
935
+
936
+ """
937
+ #print("{} {}".format(dirPath, index))
938
+ fields = getFileColumnAsString(dirPath, index, delim)
939
+ return list(map(lambda v:float(v), fields))
940
+
941
+ def getFileColumnAsInt(dirPath, index, delim=","):
942
+ """
943
+ get float fileds from a file
944
+
945
+ Parameters
946
+ dirPath : file path
947
+ index : index
948
+ delim : delemeter
949
+ """
950
+ fields = getFileColumnAsString(dirPath, index, delim)
951
+ return list(map(lambda v:int(v), fields))
952
+
953
+ def getFileAsIntMatrix(dirPath, columns, delim=","):
954
+ """
955
+ extracts int matrix from csv file given column indices with each row being concatenation of
956
+ extracted column values row size = num of columns
957
+
958
+ Parameters
959
+ dirPath : file path
960
+ columns : indexes of columns
961
+ delim : delemeter
962
+ """
963
+ mat = list()
964
+ for rec in fileSelFieldsRecGen(dirPath, columns, delim):
965
+ mat.append(asIntList(rec))
966
+ return mat
967
+
968
+ def getFileAsFloatMatrix(dirPath, columns, delim=","):
969
+ """
970
+ extracts float matrix from csv file given column indices with each row being concatenation of
971
+ extracted column values row size = num of columns
972
+
973
+ Parameters
974
+ dirPath : file path
975
+ columns : indexes of columns
976
+ delim : delemeter
977
+ """
978
+ mat = list()
979
+ for rec in fileSelFieldsRecGen(dirPath, columns, delim):
980
+ mat.append(asFloatList(rec))
981
+ return mat
982
+
983
+ def getFileAsFloatColumn(dirPath):
984
+ """
985
+ grt float list from a file with one float per row
986
+
987
+ Parameters
988
+ dirPath : file path
989
+ """
990
+ flist = list()
991
+ for rec in fileRecGen(dirPath, None):
992
+ flist.append(float(rec))
993
+ return flist
994
+
995
+ def getFileAsFiltFloatMatrix(dirPath, filt, columns, delim=","):
996
+ """
997
+ extracts float matrix from csv file given row filter and column indices with each row being
998
+ concatenation of extracted column values row size = num of columns
999
+
1000
+ Parameters
1001
+ dirPath : file path
1002
+ columns : indexes of columns
1003
+ filt : row filter lambda
1004
+ delim : delemeter
1005
+
1006
+ """
1007
+ mat = list()
1008
+ for rec in fileFiltSelFieldsRecGen(dirPath, filt, columns, delim):
1009
+ mat.append(asFloatList(rec))
1010
+ return mat
1011
+
1012
+ def getFileAsTypedRecords(dirPath, types, delim=","):
1013
+ """
1014
+ extracts typed records from csv file with each row being concatenation of
1015
+ extracted column values
1016
+
1017
+ Parameters
1018
+ dirPath : file path
1019
+ types : data types
1020
+ delim : delemeter
1021
+ """
1022
+ (dtypes, cvalues) = extractTypesFromString(types)
1023
+ tdata = list()
1024
+ for rec in fileRecGen(dirPath, delim):
1025
+ trec = list()
1026
+ for index, value in enumerate(rec):
1027
+ value = __convToTyped(index, value, dtypes)
1028
+ trec.append(value)
1029
+ tdata.append(trec)
1030
+ return tdata
1031
+
1032
+
1033
+ def getFileColsAsTypedRecords(dirPath, columns, types, delim=","):
1034
+ """
1035
+ extracts typed records from csv file given column indices with each row being concatenation of
1036
+ extracted column values
1037
+
1038
+ Parameters
1039
+ Parameters
1040
+ dirPath : file path
1041
+ columns : column indexes
1042
+ types : data types
1043
+ delim : delemeter
1044
+ """
1045
+ (dtypes, cvalues) = extractTypesFromString(types)
1046
+ tdata = list()
1047
+ for rec in fileSelFieldsRecGen(dirPath, columns, delim):
1048
+ trec = list()
1049
+ for indx, value in enumerate(rec):
1050
+ tindx = columns[indx]
1051
+ value = __convToTyped(tindx, value, dtypes)
1052
+ trec.append(value)
1053
+ tdata.append(trec)
1054
+ return tdata
1055
+
1056
+ def getFileColumnsMinMax(dirPath, columns, dtype, delim=","):
1057
+ """
1058
+ extracts numeric matrix from csv file given column indices. For each column return min and max
1059
+
1060
+ Parameters
1061
+ dirPath : file path
1062
+ columns : column indexes
1063
+ dtype : data type
1064
+ delim : delemeter
1065
+ """
1066
+ dtypes = list(map(lambda c : str(c) + ":" + dtype, columns))
1067
+ dtypes = ",".join(dtypes)
1068
+ #print(dtypes)
1069
+
1070
+ tdata = getFileColsAsTypedRecords(dirPath, columns, dtypes, delim)
1071
+ minMax = list()
1072
+ ncola = len(tdata[0])
1073
+ ncole = len(columns)
1074
+ assertEqual(ncola, ncole, "actual no of columns different from expected")
1075
+
1076
+ for ci in range(ncole):
1077
+ vmin = sys.float_info.max
1078
+ vmax = sys.float_info.min
1079
+ for r in tdata:
1080
+ cv = r[ci]
1081
+ vmin = cv if cv < vmin else vmin
1082
+ vmax = cv if cv > vmax else vmax
1083
+ mm = (vmin, vmax, vmax - vmin)
1084
+ minMax.append(mm)
1085
+
1086
+ return minMax
1087
+
1088
+
1089
+ def getRecAsTypedRecord(rec, types, delim=None):
1090
+ """
1091
+ converts record to typed records
1092
+
1093
+ Parameters
1094
+ rec : delemeter separate string or list of string
1095
+ types : field data types
1096
+ delim : delemeter
1097
+ """
1098
+ if delim is not None:
1099
+ rec = rec.split(delim)
1100
+ (dtypes, cvalues) = extractTypesFromString(types)
1101
+ #print(types)
1102
+ #print(dtypes)
1103
+ trec = list()
1104
+ for ind, value in enumerate(rec):
1105
+ tvalue = __convToTyped(ind, value, dtypes)
1106
+ trec.append(tvalue)
1107
+ return trec
1108
+
1109
+ def __convToTyped(index, value, dtypes):
1110
+ """
1111
+ convert to typed value
1112
+
1113
+ Parameters
1114
+ index : index in type list
1115
+ value : data value
1116
+ dtypes : data type list
1117
+ """
1118
+ #print(index, value)
1119
+ dtype = dtypes[index]
1120
+ tvalue = value
1121
+ if dtype == "int":
1122
+ tvalue = int(value)
1123
+ elif dtype == "float":
1124
+ tvalue = float(value)
1125
+ return tvalue
1126
+
1127
+
1128
+
1129
+ def extractTypesFromString(types):
1130
+ """
1131
+ extracts column data types and set values for categorical variables
1132
+
1133
+ Parameters
1134
+ types : encoded type information
1135
+ """
1136
+ ftypes = types.split(",")
1137
+ dtypes = dict()
1138
+ cvalues = dict()
1139
+ for ftype in ftypes:
1140
+ items = ftype.split(":")
1141
+ cindex = int(items[0])
1142
+ dtype = items[1]
1143
+ dtypes[cindex] = dtype
1144
+ if len(items) == 3:
1145
+ sitems = items[2].split()
1146
+ cvalues[cindex] = sitems
1147
+ return (dtypes, cvalues)
1148
+
1149
+ def getMultipleFileAsInttMatrix(dirPathWithCol, delim=","):
1150
+ """
1151
+ extracts int matrix from from csv files given column index for each file.
1152
+ num of columns = number of rows in each file and num of rows = number of files
1153
+
1154
+ Parameters
1155
+ dirPathWithCol: list of file path and collumn index pair
1156
+ delim : delemeter
1157
+ """
1158
+ mat = list()
1159
+ minLen = -1
1160
+ for path, col in dirPathWithCol:
1161
+ colVals = getFileColumnAsInt(path, col, delim)
1162
+ if minLen < 0 or len(colVals) < minLen:
1163
+ minLen = len(colVals)
1164
+ mat.append(colVals)
1165
+
1166
+ #make all same length
1167
+ mat = list(map(lambda li:li[:minLen], mat))
1168
+ return mat
1169
+
1170
+ def getMultipleFileAsFloatMatrix(dirPathWithCol, delim=","):
1171
+ """
1172
+ extracts float matrix from from csv files given column index for each file.
1173
+ num of columns = number of rows in each file and num of rows = number of files
1174
+
1175
+ Parameters
1176
+ dirPathWithCol: list of file path and collumn index pair
1177
+ delim : delemeter
1178
+ """
1179
+ mat = list()
1180
+ minLen = -1
1181
+ for path, col in dirPathWithCol:
1182
+ colVals = getFileColumnAsFloat(path, col, delim)
1183
+ if minLen < 0 or len(colVals) < minLen:
1184
+ minLen = len(colVals)
1185
+ mat.append(colVals)
1186
+
1187
+ #make all same length
1188
+ mat = list(map(lambda li:li[:minLen], mat))
1189
+ return mat
1190
+
1191
+ def writeStrListToFile(ldata, filePath, delem=","):
1192
+ """
1193
+ writes list of dlem separated string or list of list of string to afile
1194
+
1195
+ Parameters
1196
+ ldata : list data
1197
+ filePath : file path
1198
+ delim : delemeter
1199
+ """
1200
+ with open(filePath, "w") as fh:
1201
+ for r in ldata:
1202
+ if type(r) == list:
1203
+ r = delem.join(r)
1204
+ fh.write(r + "\n")
1205
+
1206
+ def writeFloatListToFile(ldata, prec, filePath):
1207
+ """
1208
+ writes float list to file, one value per line
1209
+
1210
+ Parameters
1211
+ ldata : list data
1212
+ prec : precision
1213
+ filePath : file path
1214
+ """
1215
+ with open(filePath, "w") as fh:
1216
+ for d in ldata:
1217
+ fh.write(formatFloat(prec, d) + "\n")
1218
+
1219
+ def mutateFileLines(dirPath, mutator, marg, delim=","):
1220
+ """
1221
+ mutates lines from a file
1222
+
1223
+ Parameters
1224
+ dirPath : file path
1225
+ mutator : mutation callback
1226
+ marg : argument for mutation call back
1227
+ delim : delemeter
1228
+ """
1229
+ lines = list()
1230
+ for li in fileRecGen(dirPath, delim):
1231
+ li = mutator(li) if marg is None else mutator(li, marg)
1232
+ lines.append(li)
1233
+ return lines
1234
+
1235
+ def takeFirst(elems):
1236
+ """
1237
+ return fisrt item
1238
+
1239
+ Parameters
1240
+ elems : list of data
1241
+ """
1242
+ return elems[0]
1243
+
1244
+ def takeSecond(elems):
1245
+ """
1246
+ return 2nd element
1247
+
1248
+ Parameters
1249
+ elems : list of data
1250
+ """
1251
+ return elems[1]
1252
+
1253
+ def takeThird(elems):
1254
+ """
1255
+ returns 3rd element
1256
+
1257
+ Parameters
1258
+ elems : list of data
1259
+ """
1260
+ return elems[2]
1261
+
1262
+ def addToKeyedCounter(dCounter, key, count=1):
1263
+ """
1264
+ add to to keyed counter
1265
+
1266
+ Parameters
1267
+ dCounter : dictionary of counters
1268
+ key : dictionary key
1269
+ count : count to add
1270
+ """
1271
+ curCount = dCounter.get(key, 0)
1272
+ dCounter[key] = curCount + count
1273
+
1274
+ def incrKeyedCounter(dCounter, key):
1275
+ """
1276
+ increment keyed counter
1277
+
1278
+ Parameters
1279
+ dCounter : dictionary of counters
1280
+ key : dictionary key
1281
+ """
1282
+ addToKeyedCounter(dCounter, key, 1)
1283
+
1284
+ def appendKeyedList(dList, key, elem):
1285
+ """
1286
+ keyed list
1287
+
1288
+ Parameters
1289
+ dList : dictionary of lists
1290
+ key : dictionary key
1291
+ elem : value to append
1292
+ """
1293
+ curList = dList.get(key, [])
1294
+ curList.append(elem)
1295
+ dList[key] = curList
1296
+
1297
+ def isNumber(st):
1298
+ """
1299
+ Returns True is string is a number
1300
+
1301
+ Parameters
1302
+ st : string value
1303
+ """
1304
+ return st.replace('.','',1).isdigit()
1305
+
1306
+ def removeNan(values):
1307
+ """
1308
+ removes nan from list
1309
+
1310
+ Parameters
1311
+ values : list data
1312
+ """
1313
+ return list(filter(lambda v: not math.isnan(v), values))
1314
+
1315
+ def fileRecGen(filePath, delim = ","):
1316
+ """
1317
+ file record generator
1318
+
1319
+ Parameters
1320
+ filePath ; file path
1321
+ delim : delemeter
1322
+ """
1323
+ with open(filePath, "r") as fp:
1324
+ for line in fp:
1325
+ line = line[:-1]
1326
+ if delim is not None:
1327
+ line = line.split(delim)
1328
+ yield line
1329
+
1330
+ def fileSelFieldsRecGen(dirPath, columns, delim=","):
1331
+ """
1332
+ file record generator given column indices
1333
+
1334
+ Parameters
1335
+ filePath ; file path
1336
+ columns : column indexes as int array or coma separated string
1337
+ delim : delemeter
1338
+ """
1339
+ if type(columns) == str:
1340
+ columns = strToIntArray(columns, delim)
1341
+ for rec in fileRecGen(dirPath, delim):
1342
+ extracted = extractList(rec, columns)
1343
+ yield extracted
1344
+
1345
+ def fileSelFieldValueGen(dirPath, column, delim=","):
1346
+ """
1347
+ file record generator for a given column
1348
+
1349
+ Parameters
1350
+ filePath ; file path
1351
+ column : column index
1352
+ delim : delemeter
1353
+ """
1354
+ for rec in fileRecGen(dirPath, delim):
1355
+ yield rec[column]
1356
+
1357
+ def fileFiltRecGen(filePath, filt, delim = ","):
1358
+ """
1359
+ file record generator with row filter applied
1360
+
1361
+ Parameters
1362
+ filePath ; file path
1363
+ filt : row filter
1364
+ delim : delemeter
1365
+ """
1366
+ with open(filePath, "r") as fp:
1367
+ for line in fp:
1368
+ line = line[:-1]
1369
+ if delim is not None:
1370
+ line = line.split(delim)
1371
+ if filt(line):
1372
+ yield line
1373
+
1374
+ def fileFiltSelFieldsRecGen(filePath, filt, columns, delim = ","):
1375
+ """
1376
+ file record generator with row and column filter applied
1377
+
1378
+ Parameters
1379
+ filePath ; file path
1380
+ filt : row filter
1381
+ columns : column indexes as int array or coma separated string
1382
+ delim : delemeter
1383
+ """
1384
+ columns = strToIntArray(columns, delim)
1385
+ with open(filePath, "r") as fp:
1386
+ for line in fp:
1387
+ line = line[:-1]
1388
+ if delim is not None:
1389
+ line = line.split(delim)
1390
+ if filt(line):
1391
+ selected = extractList(line, columns)
1392
+ yield selected
1393
+
1394
+ def fileTypedRecGen(filePath, ftypes, delim = ","):
1395
+ """
1396
+ file typed record generator
1397
+
1398
+ Parameters
1399
+ filePath ; file path
1400
+ ftypes : list of field types
1401
+ delim : delemeter
1402
+ """
1403
+ with open(filePath, "r") as fp:
1404
+ for line in fp:
1405
+ line = line[:-1]
1406
+ line = line.split(delim)
1407
+ for i in range(0, len(ftypes), 2):
1408
+ ci = ftypes[i]
1409
+ dtype = ftypes[i+1]
1410
+ assertLesser(ci, len(line), "index out of bound")
1411
+ if dtype == "int":
1412
+ line[ci] = int(line[ci])
1413
+ elif dtype == "float":
1414
+ line[ci] = float(line[ci])
1415
+ else:
1416
+ exitWithMsg("invalid data type")
1417
+ yield line
1418
+
1419
+ def fileMutatedFieldsRecGen(dirPath, mutator, delim=","):
1420
+ """
1421
+ file record generator with some columns mutated
1422
+
1423
+ Parameters
1424
+ dirPath ; file path
1425
+ mutator : row field mutator
1426
+ delim : delemeter
1427
+ """
1428
+ for rec in fileRecGen(dirPath, delim):
1429
+ mutated = mutator(rec)
1430
+ yield mutated
1431
+
1432
+ def tableSelFieldsFilter(tdata, columns):
1433
+ """
1434
+ gets tabular data for selected columns
1435
+
1436
+ Parameters
1437
+ tdata : tabular data
1438
+ columns : column indexes
1439
+ """
1440
+ if areAllFieldsIncluded(tdata[0], columns):
1441
+ ntdata = tdata
1442
+ else:
1443
+ ntdata = list()
1444
+ for rec in tdata:
1445
+ #print(rec)
1446
+ #print(columns)
1447
+ nrec = extractList(rec, columns)
1448
+ ntdata.append(nrec)
1449
+ return ntdata
1450
+
1451
+
1452
+ def areAllFieldsIncluded(ldata, columns):
1453
+ """
1454
+ return True id all indexes are in the columns
1455
+
1456
+ Parameters
1457
+ ldata : list data
1458
+ columns : column indexes
1459
+ """
1460
+ return list(range(len(ldata))) == columns
1461
+
1462
+ def asIntList(items):
1463
+ """
1464
+ returns int list
1465
+
1466
+ Parameters
1467
+ items : list data
1468
+ """
1469
+ return [int(i) for i in items]
1470
+
1471
+ def asFloatList(items):
1472
+ """
1473
+ returns float list
1474
+
1475
+ Parameters
1476
+ items : list data
1477
+ """
1478
+ return [float(i) for i in items]
1479
+
1480
+ def pastTime(interval, unit):
1481
+ """
1482
+ current and past time
1483
+
1484
+ Parameters
1485
+ interval : time interval
1486
+ unit: time unit
1487
+ """
1488
+ curTime = int(time.time())
1489
+ if unit == "d":
1490
+ pastTime = curTime - interval * secInDay
1491
+ elif unit == "h":
1492
+ pastTime = curTime - interval * secInHour
1493
+ elif unit == "m":
1494
+ pastTime = curTime - interval * secInMinute
1495
+ else:
1496
+ raise ValueError("invalid time unit " + unit)
1497
+ return (curTime, pastTime)
1498
+
1499
+ def minuteAlign(ts):
1500
+ """
1501
+ minute aligned time
1502
+
1503
+ Parameters
1504
+ ts : time stamp in sec
1505
+ """
1506
+ return int((ts / secInMinute)) * secInMinute
1507
+
1508
+ def multMinuteAlign(ts, min):
1509
+ """
1510
+ multi minute aligned time
1511
+
1512
+ Parameters
1513
+ ts : time stamp in sec
1514
+ min : minute value
1515
+ """
1516
+ intv = secInMinute * min
1517
+ return int((ts / intv)) * intv
1518
+
1519
+ def hourAlign(ts):
1520
+ """
1521
+ hour aligned time
1522
+
1523
+ Parameters
1524
+ ts : time stamp in sec
1525
+ """
1526
+ return int((ts / secInHour)) * secInHour
1527
+
1528
+ def hourOfDayAlign(ts, hour):
1529
+ """
1530
+ hour of day aligned time
1531
+
1532
+ Parameters
1533
+ ts : time stamp in sec
1534
+ hour : hour of day
1535
+ """
1536
+ day = int(ts / secInDay)
1537
+ return (24 * day + hour) * secInHour
1538
+
1539
+ def dayAlign(ts):
1540
+ """
1541
+ day aligned time
1542
+
1543
+ Parameters
1544
+ ts : time stamp in sec
1545
+ """
1546
+ return int(ts / secInDay) * secInDay
1547
+
1548
+ def timeAlign(ts, unit):
1549
+ """
1550
+ boundary alignment of time
1551
+
1552
+ Parameters
1553
+ ts : time stamp in sec
1554
+ unit : unit of time
1555
+ """
1556
+ alignedTs = 0
1557
+ if unit == "s":
1558
+ alignedTs = ts
1559
+ elif unit == "m":
1560
+ alignedTs = minuteAlign(ts)
1561
+ elif unit == "h":
1562
+ alignedTs = hourAlign(ts)
1563
+ elif unit == "d":
1564
+ alignedTs = dayAlign(ts)
1565
+ else:
1566
+ raise ValueError("invalid time unit")
1567
+ return alignedTs
1568
+
1569
+ def monthOfYear(ts):
1570
+ """
1571
+ month of year
1572
+
1573
+ Parameters
1574
+ ts : time stamp in sec
1575
+ """
1576
+ rem = ts % secInYear
1577
+ dow = int(rem / secInMonth)
1578
+ return dow
1579
+
1580
+ def dayOfWeek(ts):
1581
+ """
1582
+ day of week
1583
+
1584
+ Parameters
1585
+ ts : time stamp in sec
1586
+ """
1587
+ rem = ts % secInWeek
1588
+ dow = int(rem / secInDay)
1589
+ return dow
1590
+
1591
+ def hourOfDay(ts):
1592
+ """
1593
+ hour of day
1594
+
1595
+ Parameters
1596
+ ts : time stamp in sec
1597
+ """
1598
+ rem = ts % secInDay
1599
+ hod = int(rem / secInHour)
1600
+ return hod
1601
+
1602
+ def processCmdLineArgs(expectedTypes, usage):
1603
+ """
1604
+ process command line args and returns args as typed values
1605
+
1606
+ Parameters
1607
+ expectedTypes : expected data types of arguments
1608
+ usage : usage message string
1609
+ """
1610
+ args = []
1611
+ numComLineArgs = len(sys.argv)
1612
+ numExpected = len(expectedTypes)
1613
+ if (numComLineArgs - 1 == len(expectedTypes)):
1614
+ try:
1615
+ for i in range(0, numExpected):
1616
+ if (expectedTypes[i] == typeInt):
1617
+ args.append(int(sys.argv[i+1]))
1618
+ elif (expectedTypes[i] == typeFloat):
1619
+ args.append(float(sys.argv[i+1]))
1620
+ elif (expectedTypes[i] == typeString):
1621
+ args.append(sys.argv[i+1])
1622
+ except ValueError:
1623
+ print ("expected number of command line arguments found but there is type mis match")
1624
+ sys.exit(1)
1625
+ else:
1626
+ print ("expected number of command line arguments not found")
1627
+ print (usage)
1628
+ sys.exit(1)
1629
+ return args
1630
+
1631
+ def mutateString(val, numMutate, ctype):
1632
+ """
1633
+ mutate string multiple times
1634
+
1635
+ Parameters
1636
+ val : string value
1637
+ numMutate : num of mutations
1638
+ ctype : type of character to mutate with
1639
+ """
1640
+ mutations = set()
1641
+ count = 0
1642
+ while count < numMutate:
1643
+ j = randint(0, len(val)-1)
1644
+ if j not in mutations:
1645
+ if ctype == "alpha":
1646
+ ch = selectRandomFromList(alphaTokens)
1647
+ elif ctype == "num":
1648
+ ch = selectRandomFromList(numTokens)
1649
+ elif ctype == "any":
1650
+ ch = selectRandomFromList(tokens)
1651
+ val = val[:j] + ch + val[j+1:]
1652
+ mutations.add(j)
1653
+ count += 1
1654
+ return val
1655
+
1656
+ def mutateList(values, numMutate, vmin, vmax, rabs=True):
1657
+ """
1658
+ mutate list multiple times
1659
+
1660
+ Parameters
1661
+ values : list value
1662
+ numMutate : num of mutations
1663
+ vmin : minimum of value range
1664
+ vmax : maximum of value range
1665
+ rabs : True if mim max range is absolute otherwise relative
1666
+ """
1667
+ mutations = set()
1668
+ count = 0
1669
+ while count < numMutate:
1670
+ j = randint(0, len(values)-1)
1671
+ if j not in mutations:
1672
+ s = np.random.uniform(vmin, vmax)
1673
+ values[j] = s if rabs else values[j] * s
1674
+ count += 1
1675
+ mutations.add(j)
1676
+ return values
1677
+
1678
+
1679
+ def swap(values, first, second):
1680
+ """
1681
+ swap two elements
1682
+
1683
+ Parameters
1684
+ values : list value
1685
+ first : first swap position
1686
+ second : second swap position
1687
+ """
1688
+ t = values[first]
1689
+ values[first] = values[second]
1690
+ values[second] = t
1691
+
1692
+ def swapBetweenLists(values1, values2):
1693
+ """
1694
+ swap two elements between 2 lists
1695
+
1696
+ Parameters
1697
+ values1 : first list of values
1698
+ values2 : second list of values
1699
+ """
1700
+ p1 = randint(0, len(values1)-1)
1701
+ p2 = randint(0, len(values2)-1)
1702
+ tmp = values1[p1]
1703
+ values1[p1] = values2[p2]
1704
+ values2[p2] = tmp
1705
+
1706
+ def safeAppend(values, value):
1707
+ """
1708
+ append only if not None
1709
+
1710
+ Parameters
1711
+ values : list value
1712
+ value : value to append
1713
+ """
1714
+ if value is not None:
1715
+ values.append(value)
1716
+
1717
+ def getAllIndex(ldata, fldata):
1718
+ """
1719
+ get ALL indexes of list elements
1720
+
1721
+ Parameters
1722
+ ldata : list data to find index in
1723
+ fldata : list data for values for index look up
1724
+ """
1725
+ return list(map(lambda e : fldata.index(e), ldata))
1726
+
1727
+ def findIntersection(lOne, lTwo):
1728
+ """
1729
+ find intersection elements between 2 lists
1730
+
1731
+ Parameters
1732
+ lOne : first list of data
1733
+ lTwo : second list of data
1734
+ """
1735
+ sOne = set(lOne)
1736
+ sTwo = set(lTwo)
1737
+ sInt = sOne.intersection(sTwo)
1738
+ return list(sInt)
1739
+
1740
+ def isIntvOverlapped(rOne, rTwo):
1741
+ """
1742
+ checks overlap between 2 intervals
1743
+
1744
+ Parameters
1745
+ rOne : first interval boundaries
1746
+ rTwo : second interval boundaries
1747
+ """
1748
+ clear = rOne[1] <= rTwo[0] or rOne[0] >= rTwo[1]
1749
+ return not clear
1750
+
1751
+ def isIntvLess(rOne, rTwo):
1752
+ """
1753
+ checks if first iterval is less than second
1754
+
1755
+ Parameters
1756
+ rOne : first interval boundaries
1757
+ rTwo : second interval boundaries
1758
+ """
1759
+ less = rOne[1] <= rTwo[0]
1760
+ return less
1761
+
1762
+ def findRank(e, values):
1763
+ """
1764
+ find rank of value in a list
1765
+
1766
+ Parameters
1767
+ e : value to compare with
1768
+ values : list data
1769
+ """
1770
+ count = 1
1771
+ for ve in values:
1772
+ if ve < e:
1773
+ count += 1
1774
+ return count
1775
+
1776
+ def findRanks(toBeRanked, values):
1777
+ """
1778
+ find ranks of values in one list in another list
1779
+
1780
+ Parameters
1781
+ toBeRanked : list of values for which ranks are found
1782
+ values : list in which rank is found :
1783
+ """
1784
+ return list(map(lambda e: findRank(e, values), toBeRanked))
1785
+
1786
+ def formatFloat(prec, value, label = None):
1787
+ """
1788
+ formats a float with optional label
1789
+
1790
+ Parameters
1791
+ prec : precision
1792
+ value : data value
1793
+ label : label for data
1794
+ """
1795
+ st = (label + " ") if label else ""
1796
+ formatter = "{:." + str(prec) + "f}"
1797
+ return st + formatter.format(value)
1798
+
1799
+ def formatAny(value, label = None):
1800
+ """
1801
+ formats any obkect with optional label
1802
+
1803
+ Parameters
1804
+ value : data value
1805
+ label : label for data
1806
+ """
1807
+ st = (label + " ") if label else ""
1808
+ return st + str(value)
1809
+
1810
+ def printList(values):
1811
+ """
1812
+ pretty print list
1813
+
1814
+ Parameters
1815
+ values : list of values
1816
+ """
1817
+ for v in values:
1818
+ print(v)
1819
+
1820
+ def printMap(values, klab, vlab, precision, offset=16):
1821
+ """
1822
+ pretty print hash map
1823
+
1824
+ Parameters
1825
+ values : dictionary of values
1826
+ klab : label for key
1827
+ vlab : label for value
1828
+ precision : precision
1829
+ offset : left justify offset
1830
+ """
1831
+ print(klab.ljust(offset, " ") + vlab)
1832
+ for k in values.keys():
1833
+ v = values[k]
1834
+ ks = toStr(k, precision).ljust(offset, " ")
1835
+ vs = toStr(v, precision)
1836
+ print(ks + vs)
1837
+
1838
+ def printPairList(values, lab1, lab2, precision, offset=16):
1839
+ """
1840
+ pretty print list of pairs
1841
+
1842
+ Parameters
1843
+ values : dictionary of values
1844
+ lab1 : first label
1845
+ lab2 : second label
1846
+ precision : precision
1847
+ offset : left justify offset
1848
+ """
1849
+ print(lab1.ljust(offset, " ") + lab2)
1850
+ for (v1, v2) in values:
1851
+ sv1 = toStr(v1, precision).ljust(offset, " ")
1852
+ sv2 = toStr(v2, precision)
1853
+ print(sv1 + sv2)
1854
+
1855
+ def createMap(*values):
1856
+ """
1857
+ create disctionary with results
1858
+
1859
+ Parameters
1860
+ values : sequence of key value pairs
1861
+ """
1862
+ result = dict()
1863
+ for i in range(0, len(values), 2):
1864
+ result[values[i]] = values[i+1]
1865
+ return result
1866
+
1867
+ def getColMinMax(table, col):
1868
+ """
1869
+ return min, max values of a column
1870
+
1871
+ Parameters
1872
+ table : tabular data
1873
+ col : column index
1874
+ """
1875
+ vmin = None
1876
+ vmax = None
1877
+ for rec in table:
1878
+ value = rec[col]
1879
+ if vmin is None:
1880
+ vmin = value
1881
+ vmax = value
1882
+ else:
1883
+ if value < vmin:
1884
+ vmin = value
1885
+ elif value > vmax:
1886
+ vmax = value
1887
+ return (vmin, vmax, vmax - vmin)
1888
+
1889
+ def createLogger(name, logFilePath, logLevName):
1890
+ """
1891
+ creates logger
1892
+
1893
+ Parameters
1894
+ name : logger name
1895
+ logFilePath : log file path
1896
+ logLevName : log level
1897
+ """
1898
+ logger = logging.getLogger(name)
1899
+ fHandler = logging.handlers.RotatingFileHandler(logFilePath, maxBytes=1048576, backupCount=4)
1900
+ logLev = logLevName.lower()
1901
+ if logLev == "debug":
1902
+ logLevel = logging.DEBUG
1903
+ elif logLev == "info":
1904
+ logLevel = logging.INFO
1905
+ elif logLev == "warning":
1906
+ logLevel = logging.WARNING
1907
+ elif logLev == "error":
1908
+ logLevel = logging.ERROR
1909
+ elif logLev == "critical":
1910
+ logLevel = logging.CRITICAL
1911
+ else:
1912
+ raise ValueError("invalid log level name " + logLevelName)
1913
+ fHandler.setLevel(logLevel)
1914
+ fFormat = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
1915
+ fHandler.setFormatter(fFormat)
1916
+ logger.addHandler(fHandler)
1917
+ logger.setLevel(logLevel)
1918
+ return logger
1919
+
1920
+ @contextmanager
1921
+ def suppressStdout():
1922
+ """
1923
+ suppress stdout
1924
+
1925
+ Parameters
1926
+
1927
+ """
1928
+ with open(os.devnull, "w") as devnull:
1929
+ oldStdout = sys.stdout
1930
+ sys.stdout = devnull
1931
+ try:
1932
+ yield
1933
+ finally:
1934
+ sys.stdout = oldStdout
1935
+
1936
+ def exitWithMsg(msg):
1937
+ """
1938
+ print message and exit
1939
+
1940
+ Parameters
1941
+ msg : message
1942
+ """
1943
+ print(msg + " -- quitting")
1944
+ sys.exit(0)
1945
+
1946
+ def drawLine(data, yscale=None):
1947
+ """
1948
+ line plot
1949
+
1950
+ Parameters
1951
+ data : list data
1952
+ yscale : y axis scale
1953
+ """
1954
+ plt.plot(data)
1955
+ if yscale:
1956
+ step = int(yscale / 10)
1957
+ step = int(step / 10) * 10
1958
+ plt.yticks(range(0, yscale, step))
1959
+ plt.show()
1960
+
1961
+ def drawPlot(x, y, xlabel, ylabel):
1962
+ """
1963
+ line plot
1964
+
1965
+ Parameters
1966
+ x : x values
1967
+ y : y values
1968
+ xlabel : x axis label
1969
+ ylabel : y axis label
1970
+ """
1971
+ if x is None:
1972
+ x = list(range(len(y)))
1973
+ plt.plot(x,y)
1974
+ plt.xlabel(xlabel)
1975
+ plt.ylabel(ylabel)
1976
+ plt.show()
1977
+
1978
+ def drawPairPlot(x, y1, y2, xlabel,ylabel, y1label, y2label):
1979
+ """
1980
+ line plot of 2 lines
1981
+
1982
+ Parameters
1983
+ x : x values
1984
+ y1 : first y values
1985
+ y2 : second y values
1986
+ xlabel : x labbel
1987
+ ylabel : y label
1988
+ y1label : first plot label
1989
+ y2label : second plot label
1990
+ """
1991
+ plt.plot(x, y1, label = y1label)
1992
+ plt.plot(x, y2, label = y2label)
1993
+ plt.xlabel(xlabel)
1994
+ plt.ylabel(ylabel)
1995
+ plt.legend()
1996
+ plt.show()
1997
+
1998
+ def drawHist(ldata, myTitle, myXlabel, myYlabel, nbins=10):
1999
+ """
2000
+ draw histogram
2001
+
2002
+ Parameters
2003
+ ldata : list data
2004
+ myTitle : title
2005
+ myXlabel : x label
2006
+ myYlabel : y label
2007
+ nbins : num of bins
2008
+ """
2009
+ plt.hist(ldata, bins=nbins, density=True)
2010
+ plt.title(myTitle)
2011
+ plt.xlabel(myXlabel)
2012
+ plt.ylabel(myYlabel)
2013
+ plt.show()
2014
+
2015
+ def saveObject(obj, filePath):
2016
+ """
2017
+ saves an object
2018
+
2019
+ Parameters
2020
+ obj : object
2021
+ filePath : file path for saved object
2022
+ """
2023
+ with open(filePath, "wb") as outfile:
2024
+ pickle.dump(obj,outfile)
2025
+
2026
+ def restoreObject(filePath):
2027
+ """
2028
+ restores an object
2029
+
2030
+ Parameters
2031
+ filePath : file path to restore object from
2032
+ """
2033
+ with open(filePath, "rb") as infile:
2034
+ obj = pickle.load(infile)
2035
+ return obj
2036
+
2037
+ def isNumeric(data):
2038
+ """
2039
+ true if all elements int or float
2040
+
2041
+ Parameters
2042
+ data : numeric data list
2043
+ """
2044
+ if type(data) == list or type(data) == np.ndarray:
2045
+ col = pd.Series(data)
2046
+ else:
2047
+ col = data
2048
+ return col.dtype == np.int32 or col.dtype == np.int64 or col.dtype == np.float32 or col.dtype == np.float64
2049
+
2050
+ def isInteger(data):
2051
+ """
2052
+ true if all elements int
2053
+
2054
+ Parameters
2055
+ data : numeric data list
2056
+ """
2057
+ if type(data) == list or type(data) == np.ndarray:
2058
+ col = pd.Series(data)
2059
+ else:
2060
+ col = data
2061
+ return col.dtype == np.int32 or col.dtype == np.int64
2062
+
2063
+ def isFloat(data):
2064
+ """
2065
+ true if all elements float
2066
+
2067
+ Parameters
2068
+ data : numeric data list
2069
+ """
2070
+ if type(data) == list or type(data) == np.ndarray:
2071
+ col = pd.Series(data)
2072
+ else:
2073
+ col = data
2074
+ return col.dtype == np.float32 or col.dtype == np.float64
2075
+
2076
+ def isBinary(data):
2077
+ """
2078
+ true if all elements either 0 or 1
2079
+
2080
+ Parameters
2081
+ data : binary data
2082
+ """
2083
+ re = next((d for d in data if not (type(d) == int and (d == 0 or d == 1))), None)
2084
+ return (re is None)
2085
+
2086
+ def isCategorical(data):
2087
+ """
2088
+ true if all elements int or string
2089
+
2090
+ Parameters
2091
+ data : data value
2092
+ """
2093
+ re = next((d for d in data if not (type(d) == int or type(d) == str)), None)
2094
+ return (re is None)
2095
+
2096
+ def assertEqual(value, veq, msg):
2097
+ """
2098
+ assert equal to
2099
+
2100
+ Parameters
2101
+ value : value
2102
+ veq : value to be equated with
2103
+ msg : error msg
2104
+ """
2105
+ assert value == veq , msg
2106
+
2107
+ def assertGreater(value, vmin, msg):
2108
+ """
2109
+ assert greater than
2110
+
2111
+ Parameters
2112
+ value : value
2113
+ vmin : minimum value
2114
+ msg : error msg
2115
+ """
2116
+ assert value > vmin , msg
2117
+
2118
+ def assertGreaterEqual(value, vmin, msg):
2119
+ """
2120
+ assert greater than
2121
+
2122
+ Parameters
2123
+ value : value
2124
+ vmin : minimum value
2125
+ msg : error msg
2126
+ """
2127
+ assert value >= vmin , msg
2128
+
2129
+ def assertLesser(value, vmax, msg):
2130
+ """
2131
+ assert less than
2132
+
2133
+ Parameters
2134
+ value : value
2135
+ vmax : maximum value
2136
+ msg : error msg
2137
+ """
2138
+ assert value < vmax , msg
2139
+
2140
+ def assertLesserEqual(value, vmax, msg):
2141
+ """
2142
+ assert less than
2143
+
2144
+ Parameters
2145
+ value : value
2146
+ vmax : maximum value
2147
+ msg : error msg
2148
+ """
2149
+ assert value <= vmax , msg
2150
+
2151
+ def assertWithinRange(value, vmin, vmax, msg):
2152
+ """
2153
+ assert within range
2154
+
2155
+ Parameters
2156
+ value : value
2157
+ vmin : minimum value
2158
+ vmax : maximum value
2159
+ msg : error msg
2160
+ """
2161
+ assert value >= vmin and value <= vmax, msg
2162
+
2163
+ def assertInList(value, values, msg):
2164
+ """
2165
+ assert contains in a list
2166
+
2167
+ Parameters
2168
+ value ; balue to check for inclusion
2169
+ values : list data
2170
+ msg : error msg
2171
+ """
2172
+ assert value in values, msg
2173
+
2174
+ def maxListDist(l1, l2):
2175
+ """
2176
+ maximum list element difference between 2 lists
2177
+
2178
+ Parameters
2179
+ l1 : first list data
2180
+ l2 : second list data
2181
+ """
2182
+ dist = max(list(map(lambda v : abs(v[0] - v[1]), zip(l1, l2))))
2183
+ return dist
2184
+
2185
+ def fileLineCount(fPath):
2186
+ """
2187
+ number of lines ina file
2188
+
2189
+ Parameters
2190
+ fPath : file path
2191
+ """
2192
+ with open(fPath) as f:
2193
+ for i, li in enumerate(f):
2194
+ pass
2195
+ return (i + 1)
2196
+
2197
+ def getAlphaNumCharCount(sdata):
2198
+ """
2199
+ number of alphabetic and numeric charcters in a string
2200
+
2201
+ Parameters
2202
+ sdata : string data
2203
+ """
2204
+ acount = 0
2205
+ ncount = 0
2206
+ scount = 0
2207
+ ocount = 0
2208
+ assertEqual(type(sdata), str, "input must be string")
2209
+ for c in sdata:
2210
+ if c.isnumeric():
2211
+ ncount += 1
2212
+ elif c.isalpha():
2213
+ acount += 1
2214
+ elif c.isspace():
2215
+ scount += 1
2216
+ else:
2217
+ ocount += 1
2218
+ r = (acount, ncount, ocount)
2219
+ return r
2220
+
2221
+ def genPowerSet(cvalues, incEmpty=False):
2222
+ """
2223
+ generates power set i.e all possible subsets
2224
+
2225
+ Parameters
2226
+ cvalues : list of categorical values
2227
+ incEmpty : include empty set if True
2228
+ """
2229
+ ps = list()
2230
+ for cv in cvalues:
2231
+ pse = list()
2232
+ for s in ps:
2233
+ sc = s.copy()
2234
+ sc.add(cv)
2235
+ #print(sc)
2236
+ pse.append(sc)
2237
+ ps.extend(pse)
2238
+ es = set()
2239
+ es.add(cv)
2240
+ ps.append(es)
2241
+ #print(es)
2242
+
2243
+ if incEmpty:
2244
+ ps.append({})
2245
+ return ps
2246
+
2247
+ class StepFunction:
2248
+ """
2249
+ step function
2250
+
2251
+ Parameters
2252
+
2253
+ """
2254
+ def __init__(self, *values):
2255
+ """
2256
+ initilizer
2257
+
2258
+ Parameters
2259
+ values : list of tuples, wich each tuple containing 2 x values and corresponding y value
2260
+ """
2261
+ self.points = values
2262
+
2263
+ def find(self, x):
2264
+ """
2265
+ finds step function value
2266
+
2267
+ Parameters
2268
+ x : x value
2269
+ """
2270
+ found = False
2271
+ y = 0
2272
+ for p in self.points:
2273
+ if (x >= p[0] and x < p[1]):
2274
+ y = p[2]
2275
+ found = True
2276
+ break
2277
+
2278
+ if not found:
2279
+ l = len(self.points)
2280
+ if (x < self.points[0][0]):
2281
+ y = self.points[0][2]
2282
+ elif (x > self.points[l-1][1]):
2283
+ y = self.points[l-1][2]
2284
+ return y
2285
+
2286
+
2287
+ class DummyVarGenerator:
2288
+ """
2289
+ dummy variable generator for categorical variable
2290
+ """
2291
+ def __init__(self, rowSize, catValues, trueVal, falseVal, delim=None):
2292
+ """
2293
+ initilizer
2294
+
2295
+ Parameters
2296
+ rowSize : row size
2297
+ catValues : dictionary with field index as key and list of categorical values as value
2298
+ trueVal : true value, typically "1"
2299
+ falseval : false value , typically "0"
2300
+ delim : field delemeter
2301
+ """
2302
+ self.rowSize = rowSize
2303
+ self.catValues = catValues
2304
+ numCatVar = len(catValues)
2305
+ colCount = 0
2306
+ for v in self.catValues.values():
2307
+ colCount += len(v)
2308
+ self.newRowSize = rowSize - numCatVar + colCount
2309
+ #print ("new row size {}".format(self.newRowSize))
2310
+ self.trueVal = trueVal
2311
+ self.falseVal = falseVal
2312
+ self.delim = delim
2313
+
2314
+ def processRow(self, row):
2315
+ """
2316
+ encodes categorical variables, returning as delemeter separate dstring or list
2317
+
2318
+ Parameters
2319
+ row : row either delemeter separated string or list
2320
+ """
2321
+ if self.delim is not None:
2322
+ rowArr = row.split(self.delim)
2323
+ msg = "row does not have expected number of columns found " + str(len(rowArr)) + " expected " + str(self.rowSize)
2324
+ assert len(rowArr) == self.rowSize, msg
2325
+ else:
2326
+ rowArr = row
2327
+
2328
+ newRowArr = []
2329
+ for i in range(len(rowArr)):
2330
+ curVal = rowArr[i]
2331
+ if (i in self.catValues):
2332
+ values = self.catValues[i]
2333
+ for val in values:
2334
+ if val == curVal:
2335
+ newVal = self.trueVal
2336
+ else:
2337
+ newVal = self.falseVal
2338
+ newRowArr.append(newVal)
2339
+ else:
2340
+ newRowArr.append(curVal)
2341
+ assert len(newRowArr) == self.newRowSize, "invalid new row size " + str(len(newRowArr)) + " expected " + str(self.newRowSize)
2342
+ encRow = self.delim.join(newRowArr) if self.delim is not None else newRowArr
2343
+ return encRow
2344
+
2345
+