Priyanka-Kumavat-At-TE commited on
Commit
8c99283
1 Parent(s): 59ade4b

Delete matumizi

Browse files
matumizi/LICENSE DELETED
@@ -1,202 +0,0 @@
1
-
2
- Apache License
3
- Version 2.0, January 2004
4
- http://www.apache.org/licenses/
5
-
6
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
7
-
8
- 1. Definitions.
9
-
10
- "License" shall mean the terms and conditions for use, reproduction,
11
- and distribution as defined by Sections 1 through 9 of this document.
12
-
13
- "Licensor" shall mean the copyright owner or entity authorized by
14
- the copyright owner that is granting the License.
15
-
16
- "Legal Entity" shall mean the union of the acting entity and all
17
- other entities that control, are controlled by, or are under common
18
- control with that entity. For the purposes of this definition,
19
- "control" means (i) the power, direct or indirect, to cause the
20
- direction or management of such entity, whether by contract or
21
- otherwise, or (ii) ownership of fifty percent (50%) or more of the
22
- outstanding shares, or (iii) beneficial ownership of such entity.
23
-
24
- "You" (or "Your") shall mean an individual or Legal Entity
25
- exercising permissions granted by this License.
26
-
27
- "Source" form shall mean the preferred form for making modifications,
28
- including but not limited to software source code, documentation
29
- source, and configuration files.
30
-
31
- "Object" form shall mean any form resulting from mechanical
32
- transformation or translation of a Source form, including but
33
- not limited to compiled object code, generated documentation,
34
- and conversions to other media types.
35
-
36
- "Work" shall mean the work of authorship, whether in Source or
37
- Object form, made available under the License, as indicated by a
38
- copyright notice that is included in or attached to the work
39
- (an example is provided in the Appendix below).
40
-
41
- "Derivative Works" shall mean any work, whether in Source or Object
42
- form, that is based on (or derived from) the Work and for which the
43
- editorial revisions, annotations, elaborations, or other modifications
44
- represent, as a whole, an original work of authorship. For the purposes
45
- of this License, Derivative Works shall not include works that remain
46
- separable from, or merely link (or bind by name) to the interfaces of,
47
- the Work and Derivative Works thereof.
48
-
49
- "Contribution" shall mean any work of authorship, including
50
- the original version of the Work and any modifications or additions
51
- to that Work or Derivative Works thereof, that is intentionally
52
- submitted to Licensor for inclusion in the Work by the copyright owner
53
- or by an individual or Legal Entity authorized to submit on behalf of
54
- the copyright owner. For the purposes of this definition, "submitted"
55
- means any form of electronic, verbal, or written communication sent
56
- to the Licensor or its representatives, including but not limited to
57
- communication on electronic mailing lists, source code control systems,
58
- and issue tracking systems that are managed by, or on behalf of, the
59
- Licensor for the purpose of discussing and improving the Work, but
60
- excluding communication that is conspicuously marked or otherwise
61
- designated in writing by the copyright owner as "Not a Contribution."
62
-
63
- "Contributor" shall mean Licensor and any individual or Legal Entity
64
- on behalf of whom a Contribution has been received by Licensor and
65
- subsequently incorporated within the Work.
66
-
67
- 2. Grant of Copyright License. Subject to the terms and conditions of
68
- this License, each Contributor hereby grants to You a perpetual,
69
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
70
- copyright license to reproduce, prepare Derivative Works of,
71
- publicly display, publicly perform, sublicense, and distribute the
72
- Work and such Derivative Works in Source or Object form.
73
-
74
- 3. Grant of Patent License. Subject to the terms and conditions of
75
- this License, each Contributor hereby grants to You a perpetual,
76
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
77
- (except as stated in this section) patent license to make, have made,
78
- use, offer to sell, sell, import, and otherwise transfer the Work,
79
- where such license applies only to those patent claims licensable
80
- by such Contributor that are necessarily infringed by their
81
- Contribution(s) alone or by combination of their Contribution(s)
82
- with the Work to which such Contribution(s) was submitted. If You
83
- institute patent litigation against any entity (including a
84
- cross-claim or counterclaim in a lawsuit) alleging that the Work
85
- or a Contribution incorporated within the Work constitutes direct
86
- or contributory patent infringement, then any patent licenses
87
- granted to You under this License for that Work shall terminate
88
- as of the date such litigation is filed.
89
-
90
- 4. Redistribution. You may reproduce and distribute copies of the
91
- Work or Derivative Works thereof in any medium, with or without
92
- modifications, and in Source or Object form, provided that You
93
- meet the following conditions:
94
-
95
- (a) You must give any other recipients of the Work or
96
- Derivative Works a copy of this License; and
97
-
98
- (b) You must cause any modified files to carry prominent notices
99
- stating that You changed the files; and
100
-
101
- (c) You must retain, in the Source form of any Derivative Works
102
- that You distribute, all copyright, patent, trademark, and
103
- attribution notices from the Source form of the Work,
104
- excluding those notices that do not pertain to any part of
105
- the Derivative Works; and
106
-
107
- (d) If the Work includes a "NOTICE" text file as part of its
108
- distribution, then any Derivative Works that You distribute must
109
- include a readable copy of the attribution notices contained
110
- within such NOTICE file, excluding those notices that do not
111
- pertain to any part of the Derivative Works, in at least one
112
- of the following places: within a NOTICE text file distributed
113
- as part of the Derivative Works; within the Source form or
114
- documentation, if provided along with the Derivative Works; or,
115
- within a display generated by the Derivative Works, if and
116
- wherever such third-party notices normally appear. The contents
117
- of the NOTICE file are for informational purposes only and
118
- do not modify the License. You may add Your own attribution
119
- notices within Derivative Works that You distribute, alongside
120
- or as an addendum to the NOTICE text from the Work, provided
121
- that such additional attribution notices cannot be construed
122
- as modifying the License.
123
-
124
- You may add Your own copyright statement to Your modifications and
125
- may provide additional or different license terms and conditions
126
- for use, reproduction, or distribution of Your modifications, or
127
- for any such Derivative Works as a whole, provided Your use,
128
- reproduction, and distribution of the Work otherwise complies with
129
- the conditions stated in this License.
130
-
131
- 5. Submission of Contributions. Unless You explicitly state otherwise,
132
- any Contribution intentionally submitted for inclusion in the Work
133
- by You to the Licensor shall be under the terms and conditions of
134
- this License, without any additional terms or conditions.
135
- Notwithstanding the above, nothing herein shall supersede or modify
136
- the terms of any separate license agreement you may have executed
137
- with Licensor regarding such Contributions.
138
-
139
- 6. Trademarks. This License does not grant permission to use the trade
140
- names, trademarks, service marks, or product names of the Licensor,
141
- except as required for reasonable and customary use in describing the
142
- origin of the Work and reproducing the content of the NOTICE file.
143
-
144
- 7. Disclaimer of Warranty. Unless required by applicable law or
145
- agreed to in writing, Licensor provides the Work (and each
146
- Contributor provides its Contributions) on an "AS IS" BASIS,
147
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
148
- implied, including, without limitation, any warranties or conditions
149
- of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
150
- PARTICULAR PURPOSE. You are solely responsible for determining the
151
- appropriateness of using or redistributing the Work and assume any
152
- risks associated with Your exercise of permissions under this License.
153
-
154
- 8. Limitation of Liability. In no event and under no legal theory,
155
- whether in tort (including negligence), contract, or otherwise,
156
- unless required by applicable law (such as deliberate and grossly
157
- negligent acts) or agreed to in writing, shall any Contributor be
158
- liable to You for damages, including any direct, indirect, special,
159
- incidental, or consequential damages of any character arising as a
160
- result of this License or out of the use or inability to use the
161
- Work (including but not limited to damages for loss of goodwill,
162
- work stoppage, computer failure or malfunction, or any and all
163
- other commercial damages or losses), even if such Contributor
164
- has been advised of the possibility of such damages.
165
-
166
- 9. Accepting Warranty or Additional Liability. While redistributing
167
- the Work or Derivative Works thereof, You may choose to offer,
168
- and charge a fee for, acceptance of support, warranty, indemnity,
169
- or other liability obligations and/or rights consistent with this
170
- License. However, in accepting such obligations, You may act only
171
- on Your own behalf and on Your sole responsibility, not on behalf
172
- of any other Contributor, and only if You agree to indemnify,
173
- defend, and hold each Contributor harmless for any liability
174
- incurred by, or claims asserted against, such Contributor by reason
175
- of your accepting any such warranty or additional liability.
176
-
177
- END OF TERMS AND CONDITIONS
178
-
179
- APPENDIX: How to apply the Apache License to your work.
180
-
181
- To apply the Apache License to your work, attach the following
182
- boilerplate notice, with the fields enclosed by brackets "[]"
183
- replaced with your own identifying information. (Don't include
184
- the brackets!) The text should be enclosed in the appropriate
185
- comment syntax for the file format. We also recommend that a
186
- file or class name and description of purpose be included on the
187
- same "printed page" as the copyright notice for easier
188
- identification within third-party archives.
189
-
190
- Copyright [yyyy] [name of copyright owner]
191
-
192
- Licensed under the Apache License, Version 2.0 (the "License");
193
- you may not use this file except in compliance with the License.
194
- You may obtain a copy of the License at
195
-
196
- http://www.apache.org/licenses/LICENSE-2.0
197
-
198
- Unless required by applicable law or agreed to in writing, software
199
- distributed under the License is distributed on an "AS IS" BASIS,
200
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
201
- See the License for the specific language governing permissions and
202
- limitations under the License.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/MANIFEST.in DELETED
File without changes
matumizi/README.md DELETED
@@ -1,98 +0,0 @@
1
- # matumizi
2
-
3
- Data Science utilities including following modules
4
- * util : misc utility functions
5
- * mlutil : machine learning related unitilies including a type aware confifiguration class
6
- * stats : various stats classes and functions
7
- * sampler : sampling from various statu=istical distributions
8
- * daexp : many data exploration functions consoloidating numpy, scipy, statsmodel and scikit
9
- * mcsim : monte carlo simulation
10
-
11
- ## Instructions
12
-
13
- 1. Install:
14
-
15
- Run
16
- pip3 install -i https://test.pypi.org/simple/ matumizi==0.0.7
17
-
18
- For installing latest, clone rep and run this at the project root directory
19
- pip3 install .
20
-
21
-
22
- 2. Project page in testpypi
23
-
24
- https://test.pypi.org/project/matumizi/0.0.7/
25
-
26
-
27
- 3. Blogs posts
28
-
29
- * [Data exploration module overview including usage examples](https://pkghosh.wordpress.com/2020/07/13/learn-about-your-data-with-about-seventy-data-exploration-functions-all-in-one-python-class/)
30
- * [Monte Carlo simulation for project cost estimation](https://pkghosh.wordpress.com/2020/05/11/monte-carlo-simulation-library-in-python-with-project-cost-estimation-as-an-example/)
31
- * [Information theory based feature selection](https://pkghosh.wordpress.com/2022/05/29/feature-selection-with-information-theory-based-techniques-in-python/)
32
- * [Stock Portfolio Balancing with Monte Carlo Simulation](https://pkghosh.wordpress.com/2022/08/23/stock-portfolio-balancing-with-monte-carlo-simulation/)
33
- * [Synthetic Regression Data Generation in Python](https://pkghosh.wordpress.com/2023/01/22/synthetic-regression-data-generation-in-python/)
34
-
35
- 4. Code usage example
36
-
37
- Here is some example code that uses all 5 modules. You can find lots of examples in
38
- [another repo](https://github.com/pranab/avenir/tree/master/python/app) of mine. There the
39
- imports are direct and not through the package matmizi. The example directory also has example code
40
-
41
-
42
- import sys
43
- import math
44
- from matumizi import util as ut
45
- from matumizi import mlutil as ml
46
- from matumizi import sampler as sa
47
- from matumizi import stats as st
48
- from matumizi import daexp as de
49
-
50
- #generate some random strings
51
- ldata = ut.genIdList(10, 6)
52
- print("random strings")
53
- print(ldata)
54
-
55
- #select random sublist from a list
56
- sldata = ut.selectRandomSubListFromList(ldata, 4)
57
- print("nselected random strings)")
58
- print(sldata)
59
-
60
- random walk
61
- print("\nrandom walk")
62
- for pos in ml.randomWalk(20, 10, -2, 2):
63
- print(pos)
64
-
65
- #sample from non parametric sampler
66
- print("\nsampling from a non parametric sampler")
67
- sampler = sa.NonParamRejectSampler(10, 4, 1, 4, 8, 16, 14, 12, 8, 4, 2)
68
- for _ in range(8):
69
- d = sampler.sample()
70
- print(ut.formatFloat(3, d))
71
-
72
- #statistics from asliding window
73
- print("\nstats from sliding window")
74
- wsize = 30
75
- win = st.SlidingWindowStat.createEmpty(wsize)
76
- mean = 10
77
- sd = 2
78
- ns = sa.NormalSampler(mean, sd)
79
- for _ in range(40):
80
- #gaussian with some noise
81
- d = ns.sample() + sa.randomFloat(-1, 1)
82
- win.add(d)
83
- re = win.getStat()
84
- print(re)
85
-
86
- #get time series components
87
- print("\ntime series components")
88
- expl = de.DataExplorer(False)
89
- mean = 100
90
- sd = 5
91
- period = 7
92
- trdelta = .1
93
- cycle = list(map(lambda v : 10 * math.sin(2 * math.pi * v / period), range(period)))
94
- sampler = sa.NormalSamplerWithTrendCycle(mean, sd, trdelta, cycle)
95
- ldata = list(map(lambda i : sampler.sample(), range(200)))
96
- expl.addListNumericData(ldata, "test")
97
- re = expl.getTimeSeriesComponents("test", "additive", period, True)
98
- print(re)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/config/mcamp.properties DELETED
@@ -1,10 +0,0 @@
1
- common.pvar.samplers=1:3:1:30:50:20:discrete:int,100:20:normal:float,1:8:1:10:20:50:70:85:100:60:30:discrete:int,1:7:1:60:40:30:50:70:95:120:discrete:int,0.5:0:1:bernauli:int
2
- common.pvar.ranges=1,3,30,200,1,8,1,7,0,1
3
- common.linear.weights=1.2,1.4,1.0,1.2,1.5
4
- common.square.weights=1,0.15
5
- common.crterm.weights=2,3,0.1
6
- common.corr.params=0:1:40.0:30.0:.08:false
7
- common.bias=20
8
- common.noise=normal,.05
9
- common.tvar.range=50,300
10
- common.weight.niter=200
 
 
 
 
 
 
 
 
 
 
 
matumizi/docs/info_theory_based_feat_sel_tutorial.txt DELETED
@@ -1,36 +0,0 @@
1
- This tutorial is for information theory based feature selection a loan application data set. The
2
- implementation is the python package matumizi
3
-
4
- Setup
5
- =====
6
- Install matumizi as follows
7
- pip3 install -i https://test.pypi.org/simple/ matumizi==0.0.5
8
-
9
- Install requirements
10
- pip3 install -r requirements.txt
11
-
12
- Generate loan application data
13
- ==============================
14
- python3 fesel.py --op gen --nloan 2000 --noise .05 --klen 10 > lo.txt
15
-
16
- where
17
- op = operation to perform
18
- nloan = num of loans
19
- noise = noise level
20
- klen = loan ID length
21
-
22
- Options for "op" (featute selection techniques)
23
- mrmr - Max relevance min redundancy
24
- jmi - Joint mutual information
25
- cmim - Conditional mutual information maximization
26
- icap - Interaction capping
27
- infg - Information gain
28
-
29
- Feature selection
30
- =================
31
- python3 fesel.py --op fsel --fpath lo.txt --algo mrmr
32
-
33
- where
34
- op = operation to perform
35
- fpath = path to file containing loan data
36
- algo = feature selection algorithm (mrmr, jmi, cmim, icap)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/docs/stock_portfolio_balancing_with_mc_simulation_tutorial.txt DELETED
@@ -1,59 +0,0 @@
1
- This tutorial is for financial protfolio balancing with Monte Carlo simulation and Sharpe Ration
2
-
3
-
4
- Setup
5
- =====
6
- Install matumizi which is a package for data exploration and various other utilities
7
- pip3 install -i https://test.pypi.org/simple/ matumizi==0.0.3
8
-
9
- Portfolio data
10
- ==============
11
- Decide what stocks to have in the portfolio and create a portfolio data file, with one row
12
- per stock, with each row as below containing 3 fields
13
- stock_symbol,num_stocks,value_at-beginning_of_time_window
14
-
15
- Stock historical data
16
- =====================
17
- Choose a time window (e.g. 6 months) and download historical stock data for all the stocks in the portfolio
18
- from this web site
19
- https://www.nasdaq.com/market-activity/quotes/historical
20
-
21
- Store all files in the directory specified by the command line arg "sdfpath". Change each file name so that
22
- file name begins as "SS_" where SS is a stock symbol
23
-
24
-
25
- Run simulator
26
- =============
27
- python3 pobal.py --op simu --niter 100 --sdfpath ./sdata --spdpath spdata.txt --exfac 0.9 --rfret 0.01
28
-
29
- niter = Num of iterations
30
- sdfpath = Path of directory containing stock data files. The filenames should start with <SS>_ where SS
31
- is the stock symbol
32
- spdpath = Path of file containg current holding. each row is coma separated 3 fields stock symbol,
33
- nium of stocks and the value at the beginning of historic data time window (spdata.txt in the resource directory)
34
- exfac = Factor exponential forecast of stock price
35
- rfret = Risk free investement return in the time window
36
-
37
- Command line argument values are example. Change them as needed
38
-
39
- Output
40
- ======
41
- The output end will look as below
42
- best score 8.839
43
- weights [0.10270294837929556, 0.11041322597243025, 0.000652404909398755, 0.11668341692081166, 0.018728111576860603, 0.12688306074193234, 0.016674345483451796, 0.1310681987561672, 0.020349302455518792, 0.15131254832113178, 0.07228010995988338, 0.13225232652311789]
44
- buy and sell recommendations
45
- ('WMT', 27)
46
- ('PFE', 358)
47
- ('NFLX', -212)
48
- ('AMD', 93)
49
- ('TSLA', -58)
50
- ('AMZN', 155)
51
- ('META', -120)
52
- ('QCOM', 129)
53
- ('CSCO', -17)
54
- ('MSFT', 73)
55
- ('SBUX', 62)
56
- ('AAPL', 129)
57
-
58
-
59
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/examples/fesel.py DELETED
@@ -1,264 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # Author: Pranab Ghosh
4
- #
5
- # Licensed under the Apache License, Version 2.0 (the "License"); you
6
- # may not use this file except in compliance with the License. You may
7
- # obtain a copy of the License at
8
- #
9
- # http://www.apache.org/licenses/LICENSE-2.0
10
- #
11
- # Unless required by applicable law or agreed to in writing, software
12
- # distributed under the License is distributed on an "AS IS" BASIS,
13
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
14
- # implied. See the License for the specific language governing
15
- # permissions and limitations under the License.
16
-
17
- # Package imports
18
- import os
19
- import sys
20
- import random
21
- import statistics
22
- import matplotlib.pyplot as plt
23
- import argparse
24
- from matumizi.util import *
25
- from matumizi.mlutil import *
26
- from matumizi.daexp import *
27
- from matumizi.sampler import *
28
-
29
- NFEAT = 11
30
- NFEAT_EXT = 14
31
-
32
- class LoanApprove:
33
- def __init__(self, numLoans=None):
34
- self.numLoans = numLoans
35
- self.marStatus = ["married", "single", "divorced"]
36
- self.loanTerm = ["7", "15", "30"]
37
- self.addExtra = False
38
-
39
-
40
- def initTwo(self):
41
- """
42
- initialize samplers
43
- """
44
- self.approvDistr = CategoricalRejectSampler(("1", 60), ("0", 40))
45
- self.featCondDister = {}
46
-
47
- #marital status
48
- key = ("1", 0)
49
- distr = CategoricalRejectSampler(("married", 100), ("single", 60), ("divorced", 40))
50
- self.featCondDister[key] = distr
51
- key = ("0", 0)
52
- distr = CategoricalRejectSampler(("married", 40), ("single", 100), ("divorced", 40))
53
- self.featCondDister[key] = distr
54
-
55
-
56
- # num of children
57
- key = ("1", 1)
58
- distr = CategoricalRejectSampler(("1", 100), ("2", 90), ("3", 40))
59
- self.featCondDister[key] = distr
60
- key = ("0", 1)
61
- distr = CategoricalRejectSampler(("1", 50), ("2", 70), ("3", 100))
62
- self.featCondDister[key] = distr
63
-
64
- # education
65
- key = ("1", 2)
66
- distr = CategoricalRejectSampler(("1", 30), ("2", 80), ("3", 100))
67
- self.featCondDister[key] = distr
68
- key = ("0", 2)
69
- distr = CategoricalRejectSampler(("1", 100), ("2", 40), ("3", 30))
70
- self.featCondDister[key] = distr
71
-
72
- #self employed
73
- key = ("1", 3)
74
- distr = CategoricalRejectSampler(("1", 40), ("0", 100))
75
- self.featCondDister[key] = distr
76
- key = ("0", 3)
77
- distr = CategoricalRejectSampler(("1", 100), ("0", 30))
78
- self.featCondDister[key] = distr
79
-
80
- # income
81
- key = ("1", 4)
82
- distr = GaussianRejectSampler(120,15)
83
- self.featCondDister[key] = distr
84
- key = ("0", 4)
85
- distr = GaussianRejectSampler(50,10)
86
- self.featCondDister[key] = distr
87
-
88
- # years of experience
89
- key = ("1", 5)
90
- distr = GaussianRejectSampler(15,3)
91
- self.featCondDister[key] = distr
92
- key = ("0", 5)
93
- distr = GaussianRejectSampler(5,1)
94
- self.featCondDister[key] = distr
95
-
96
- # number of years in current job
97
- key = ("1", 6)
98
- distr = GaussianRejectSampler(3,.5)
99
- self.featCondDister[key] = distr
100
- key = ("0", 6)
101
- distr = GaussianRejectSampler(1,.2)
102
- self.featCondDister[key] = distr
103
-
104
- # outstanding debt
105
- key = ("1", 7)
106
- distr = GaussianRejectSampler(20,5)
107
- self.featCondDister[key] = distr
108
- key = ("0", 7)
109
- distr = GaussianRejectSampler(60,10)
110
- self.featCondDister[key] = distr
111
-
112
- # loan amount
113
- key = ("1", 8)
114
- distr = GaussianRejectSampler(300,50)
115
- self.featCondDister[key] = distr
116
- key = ("0", 8)
117
- distr = GaussianRejectSampler(600,50)
118
- self.featCondDister[key] = distr
119
-
120
- # loan term
121
- key = ("1", 9)
122
- distr = CategoricalRejectSampler(("7", 100), ("15", 40), ("30", 60))
123
- self.featCondDister[key] = distr
124
- key = ("0", 9)
125
- distr = CategoricalRejectSampler(("7", 30), ("15", 100), ("30", 60))
126
- self.featCondDister[key] = distr
127
-
128
- # credit score
129
- key = ("1", 10)
130
- distr = GaussianRejectSampler(700,20)
131
- self.featCondDister[key] = distr
132
- key = ("0", 10)
133
- distr = GaussianRejectSampler(500,50)
134
- self.featCondDister[key] = distr
135
-
136
- if self.addExtra:
137
- # saving
138
- key = ("1", 11)
139
- distr = NormalSampler(80,10)
140
- self.featCondDister[key] = distr
141
- key = ("0", 11)
142
- distr = NormalSampler(60,8)
143
- self.featCondDister[key] = distr
144
-
145
- # retirement
146
- zDistr = NormalSampler(0, 0)
147
- key = ("1", 12)
148
- sDistr = DiscreteRejectSampler(0,1,1,20,80)
149
- nzDistr = NormalSampler(100,20)
150
- distr = DistrMixtureSampler(sDistr, zDistr, nzDistr)
151
- self.featCondDister[key] = distr
152
- key = ("0", 12)
153
- sDistr = DiscreteRejectSampler(0,1,1,50,50)
154
- nzDistr = NormalSampler(40,10)
155
- distr = DistrMixtureSampler(sDistr, zDistr, nzDistr)
156
- self.featCondDister[key] = distr
157
-
158
- #num of prior mortgae loans
159
- key = ("1", 13)
160
- distr = DiscreteRejectSampler(0,3,1,20,60,40,15)
161
- self.featCondDister[key] = distr
162
- key = ("0", 13)
163
- distr = DiscreteRejectSampler(0,1,1,70,30)
164
- self.featCondDister[key] = distr
165
-
166
-
167
- def generateTwo(self, noise, keyLen, addExtra):
168
- """
169
- ancestral sampling
170
- """
171
- self.addExtra = addExtra
172
- self.initTwo()
173
-
174
- #error
175
- erDistr = GaussianRejectSampler(0, noise)
176
-
177
- #sampler
178
- numChildren = NFEAT_EXT if self.addExtra else NFEAT
179
- sampler = AncestralSampler(self.approvDistr, self.featCondDister, numChildren)
180
-
181
- for i in range(self.numLoans):
182
- (claz, features) = sampler.sample()
183
-
184
- # add noise
185
- features[4] = int(features[4])
186
- features[7] = int(features[7])
187
- features[8] = int(features[8])
188
- features[10] = int(features[10])
189
- if self.addExtra:
190
- features[11] = int(features[11])
191
- features[12] = int(features[12])
192
-
193
- claz = addNoiseCat(claz, ["0", "1"], noise)
194
-
195
- strFeatures = list(map(lambda f: toStr(f, 2), features))
196
- rec = genID(keyLen) + "," + ",".join(strFeatures) + "," + claz
197
- print (rec)
198
-
199
- def encodeDummy(self, fileName, extra):
200
- """
201
- dummy var encoding
202
- """
203
- catVars = {}
204
- catVars[1] = self.marStatus
205
- catVars[10] = self.loanTerm
206
- rSize = NFEAT_EXT if extra else NFEAT
207
- rSize += 2
208
- dummyVarGen = DummyVarGenerator(rSize, catVars, "1", "0", ",")
209
- for row in fileRecGen(fileName, None):
210
- newRow = dummyVarGen.processRow(row)
211
- print (newRow)
212
-
213
- if __name__ == "__main__":
214
- parser = argparse.ArgumentParser()
215
- parser.add_argument('--op', type=str, default = "none", help = "operation")
216
- parser.add_argument('--nloan', type=int, default = 1000, help = "nom of loans")
217
- parser.add_argument('--noise', type=float, default = 0.1, help = "nom of loans")
218
- parser.add_argument('--klen', type=int, default = 1000, help = "key length")
219
- parser.add_argument('--fpath', type=str, default = "none", help = "source file path")
220
- parser.add_argument('--algo', type=str, default = "none", help = "source file path")
221
- args = parser.parse_args()
222
- op = args.op
223
-
224
- if op == "gen":
225
- """ generate data """
226
- numLoans = args.nloan
227
- loan = LoanApprove(numLoans)
228
- noise = args.noise
229
- keyLen = args.klen
230
- addExtra = True
231
- loan.generateTwo(noise, keyLen, addExtra)
232
-
233
- elif op == "encd":
234
- """ encode binary """
235
- fileName = args.fpath
236
- extra = True
237
- loan = LoanApprove()
238
- loan.encodeDummy(fileName, extra)
239
-
240
-
241
- elif op == "fsel":
242
- """ feature select """
243
- fpath = args.fpath
244
- algo = args.algo
245
- expl = DataExplorer(False)
246
- expl.addFileNumericData(fpath, 5, 8, 11, 12, "income", "debt", "crscore", "saving")
247
- expl.addFileCatData(fpath, 3, 4, 15, "education", "selfemp", "target")
248
-
249
- fdt = ["education", "cat", "selfemp", "cat", "income", "num", "debt", "num", "crscore", "num"]
250
- tdt = ["target", "cat"]
251
- if args.algo == "mrmr":
252
- res = expl.getMaxRelMinRedFeatures(fdt, tdt, 3)
253
- elif args.algo == "jmi":
254
- res = expl.getJointMutInfoFeatures(fdt, tdt, 3)
255
- elif args.algo == "cmim":
256
- res = expl.getCondMutInfoMaxFeatures(fdt, tdt, 3)
257
- elif args.algo == "icap":
258
- res = expl.getInteractCapFeatures(fdt, tdt, 3)
259
- elif args.algo == "infg":
260
- res = expl.getInfoGainFeatures(fdt, tdt, 3, 8)
261
-
262
- print(res)
263
- else:
264
- exitWithMsg("invalid command")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/examples/mcamp.py DELETED
@@ -1,50 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # Author: Pranab Ghosh
4
- #
5
- # Licensed under the Apache License, Version 2.0 (the "License"); you
6
- # may not use this file except in compliance with the License. You may
7
- # obtain a copy of the License at
8
- #
9
- # http://www.apache.org/licenses/LICENSE-2.0
10
- #
11
- # Unless required by applicable law or agreed to in writing, software
12
- # distributed under the License is distributed on an "AS IS" BASIS,
13
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
14
- # implied. See the License for the specific language governing
15
- # permissions and limitations under the License.
16
-
17
- # Package imports
18
- import os
19
- import sys
20
- import random
21
- import statistics
22
- import matplotlib.pyplot as plt
23
- import argparse
24
- from matumizi.util import *
25
- from matumizi.mlutil import *
26
- from matumizi.daexp import *
27
- from matumizi.sampler import *
28
-
29
- """
30
- AB test simulation with counterfactuals
31
- """
32
-
33
- if __name__ == "__main__":
34
- parser = argparse.ArgumentParser()
35
- parser.add_argument('--op', type=str, default = "none", help = "operation")
36
- parser.add_argument('--genconf', type=str, default = "", help = "data gennerator config file")
37
- parser.add_argument('--nsamp', type=int, default = 1000, help = "no of samples to generate")
38
- args = parser.parse_args()
39
- op = args.op
40
-
41
- if op == "gen":
42
- """ generate data """
43
- dgen = RegressionDataGenerator(args.genconf)
44
- for _ in range(args.nsamp):
45
- s = dgen.sample()
46
- pv = toStrFromList(s[0], 2)
47
- pv = pv + "," + toStr(s[1], 2)
48
- print(pv)
49
-
50
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/examples/pobal.py DELETED
@@ -1,193 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # Author: Pranab Ghosh
4
- #
5
- # Licensed under the Apache License, Version 2.0 (the "License"); you
6
- # may not use this file except in compliance with the License. You may
7
- # obtain a copy of the License at
8
- #
9
- # http://www.apache.org/licenses/LICENSE-2.0
10
- #
11
- # Unless required by applicable law or agreed to in writing, software
12
- # distributed under the License is distributed on an "AS IS" BASIS,
13
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
14
- # implied. See the License for the specific language governing
15
- # permissions and limitations under the License.
16
-
17
- # Package imports
18
- import os
19
- import sys
20
- import random
21
- import statistics
22
- import numpy as np
23
- import matplotlib.pyplot as plt
24
- import argparse
25
- from matumizi.util import *
26
- from matumizi.sampler import *
27
- from matumizi.mcsim import *
28
-
29
- """
30
- Balances portfolio with Monte Carlo simulation and Sharpe ratio
31
- """
32
-
33
- class PortFolio():
34
- """
35
- portfolio
36
- """
37
- def __init__(self):
38
- """
39
-
40
- """
41
- self.stocks = list()
42
- self.srets = list()
43
- self.rcovar = None
44
- self.nstock = None
45
- self.weights = None
46
- self.metric = -sys.float_info.max
47
- self.rfret = None
48
- self.spred = list()
49
-
50
-
51
- def loadStData(self, sdfPath, exfac):
52
- """
53
- load and process stock data
54
- """
55
- e1 = 1 - exfac
56
- e2 = e1 * e1
57
- files = getAllFiles(sdfPath)
58
- print(files)
59
-
60
- returns = list()
61
- for ss, qn, pp in self.stocks:
62
- print("next stock ", ss)
63
- for fp in files:
64
- fname = os.path.basename(fp)
65
- stname = fname.split("_")[0]
66
- #print("stock nane from file name ", stname)
67
-
68
- if stname == ss:
69
- #daily prices
70
- print("loading ", ss)
71
- prices = getFileColumnAsString(fp, 1)
72
- prices = prices [1:]
73
- prices = list(map(lambda p : float(p[1:]), prices))
74
-
75
- #predicted price and retuen
76
- sppred = exfac * prices[0] + exfac * e1 * prices[1] + exfac * e2 * prices[2]
77
- self.spred.append(sppred)
78
- up = pp / qn
79
- sret = (sppred - up) / up
80
- r = (ss, sret)
81
- self.srets.append(r)
82
-
83
- #daily returns
84
- bp = prices[-1]
85
- sdret = list(map(lambda p : (p - bp) / bp, prices))
86
- #print("daily return size ", len(sdret))
87
- returns.append(sdret)
88
- break
89
-
90
- returns = np.array(returns)
91
- print("daily returns shape ",returns.shape)
92
- self.rcovar = np.cov(returns)
93
- print("covar shape ", self.rcovar.shape)
94
-
95
-
96
- def optimize(self):
97
- """
98
- balance i.e make buy, sell recommendations
99
-
100
- """
101
- tamount = 0
102
- amounts = list()
103
- for ss, qn , pp in self.stocks:
104
- amnt = pp
105
- amounts.append(amnt)
106
- tamount += amnt
107
-
108
- namounts = list(map(lambda w : w * tamount, self.weights))
109
- quantities = list()
110
- for am, nam, ppr in zip(amounts, namounts, self.spred):
111
- #no of stocks to buy or sell for each
112
- tamount = nam - am
113
- qnt = int(tamount / ppr)
114
- quantities.append(qnt)
115
-
116
- trans = list()
117
- for s, q in zip(self.stocks, quantities):
118
- tr = (s[0], q)
119
- trans.append(tr)
120
-
121
- return trans
122
-
123
- # portfolio object
124
- pfolio = PortFolio()
125
-
126
- def balance(args):
127
- """
128
- callback for portfolio weights
129
- """
130
- weights = args[:pfolio.nstock]
131
- #print("wieights ", weights)
132
- weights = scaleBySum(weights)
133
- #print("scaled wieights ", weights)
134
-
135
- #weighted return
136
- wr = 0
137
- for r, w in zip(pfolio.srets, weights):
138
- wr += (r[1] - pfolio.rfret) * w
139
-
140
- wrcv = 0
141
- for i in range(pfolio.nstock):
142
- for j in range(pfolio.nstock):
143
- wrcv += pfolio.rcovar[i][j] * weights[i] * weights[j]
144
-
145
- metric = wr / wrcv
146
- print("score {:.3f}".format(metric))
147
- if metric > pfolio.metric:
148
- pfolio.metric = metric
149
- pfolio.weights = weights
150
-
151
-
152
- if __name__ == "__main__":
153
- parser = argparse.ArgumentParser()
154
- parser.add_argument('--op', type=str, default = "none", help = "operation")
155
- parser.add_argument('--niter', type=int, default = "none", help = "num of iterations")
156
- parser.add_argument('--sdfpath', type=str, default = "none", help = "stock data file directory path")
157
- parser.add_argument('--spdpath', type=str, default = "none", help = "path of file containing purchase data")
158
- parser.add_argument('--exfac', type=float, default = 0.9, help = "exponential factor for prediction")
159
- parser.add_argument('--rfret', type=float, default = 0.2, help = "risk free return")
160
- args = parser.parse_args()
161
- op = args.op
162
-
163
- if op == "simu":
164
- tdata = getFileLines(args.spdpath)
165
- for rec in tdata:
166
- #stock symbol, quantity, purchase price
167
- sname = rec[0]
168
- quant = int(rec[1])
169
- pcost = float(rec[2])
170
- t = (sname, quant, pcost)
171
- pfolio.stocks.append(t)
172
-
173
- #create and run simulator
174
- numIter = args.niter
175
- lfp = "./log/mcsim.log"
176
- simulator = MonteCarloSimulator(numIter, balance, lfp, "info")
177
- nstock = len(pfolio.stocks)
178
- for _ in range(nstock):
179
- simulator.registerUniformSampler(0.0, 1.0)
180
- pfolio.nstock = nstock
181
- pfolio.rfret = args.rfret
182
- pfolio.loadStData(args.sdfpath, args.exfac)
183
- simulator.run()
184
-
185
- print("best score {:.3f}".format(pfolio.metric))
186
- print("weights ", pfolio.weights)
187
- print("buy and sell recommendations")
188
- trans = pfolio.optimize()
189
- for tr in trans:
190
- print(tr)
191
-
192
-
193
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/matumizi/__init__.py DELETED
File without changes
matumizi/matumizi/daexp.py DELETED
@@ -1,3121 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # Author: Pranab Ghosh
4
- #
5
- # Licensed under the Apache License, Version 2.0 (the "License"); you
6
- # may not use this file except in compliance with the License. You may
7
- # obtain a copy of the License at
8
- #
9
- # http://www.apache.org/licenses/LICENSE-2.0
10
- #
11
- # Unless required by applicable law or agreed to in writing, software
12
- # distributed under the License is distributed on an "AS IS" BASIS,
13
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
14
- # implied. See the License for the specific language governing
15
- # permissions and limitations under the License.
16
-
17
- # Package imports
18
- import os
19
- import sys
20
- import numpy as np
21
- import pandas as pd
22
- import sklearn as sk
23
- from sklearn import preprocessing
24
- from sklearn import metrics
25
- import random
26
- from math import *
27
- from decimal import Decimal
28
- import pprint
29
- from statsmodels.graphics import tsaplots
30
- from statsmodels.tsa import stattools as stt
31
- from statsmodels.stats import stattools as sstt
32
- from sklearn.linear_model import LinearRegression
33
- from matplotlib import pyplot as plt
34
- from scipy import stats as sta
35
- from statsmodels.tsa.seasonal import seasonal_decompose
36
- import statsmodels.api as sm
37
- from sklearn.ensemble import IsolationForest
38
- from sklearn.neighbors import LocalOutlierFactor
39
- from sklearn.svm import OneClassSVM
40
- from sklearn.covariance import EllipticEnvelope
41
- from sklearn.mixture import GaussianMixture
42
- from sklearn.cluster import KMeans
43
- from sklearn.decomposition import PCA
44
- import hurst
45
- from .util import *
46
- from .mlutil import *
47
- from .sampler import *
48
- from .stats import *
49
-
50
- """
51
- Load data from a CSV file, data frame, numpy array or list
52
- Each data set (array like) is given a name while loading
53
- Perform various data exploration operation refering to the data sets by name
54
- Save and restore workspace if needed
55
- """
56
- class DataSetMetaData:
57
- """
58
- data set meta data
59
- """
60
- dtypeNum = 1
61
- dtypeCat = 2
62
- dtypeBin = 3
63
- def __init__(self, dtype):
64
- self.notes = list()
65
- self.dtype = dtype
66
-
67
- def addNote(self, note):
68
- """
69
- add note
70
- """
71
- self.notes.append(note)
72
-
73
-
74
- class DataExplorer:
75
- """
76
- various data exploration functions
77
- """
78
- def __init__(self, verbose=True):
79
- """
80
- initialize
81
-
82
- Parameters
83
- verbose : True for verbosity
84
- """
85
- self.dataSets = dict()
86
- self.metaData = dict()
87
- self.pp = pprint.PrettyPrinter(indent=4)
88
- self.verbose = verbose
89
-
90
- def setVerbose(self, verbose):
91
- """
92
- sets verbose
93
-
94
- Parameters
95
- verbose : True for verbosity
96
- """
97
- self.verbose = verbose
98
-
99
- def save(self, filePath):
100
- """
101
- save checkpoint
102
-
103
- Parameters
104
- filePath : path of file where saved
105
- """
106
- self.__printBanner("saving workspace")
107
- ws = dict()
108
- ws["data"] = self.dataSets
109
- ws["metaData"] = self.metaData
110
- saveObject(ws, filePath)
111
- self.__printDone()
112
-
113
- def restore(self, filePath):
114
- """
115
- restore checkpoint
116
-
117
- Parameters
118
- filePath : path of file from where to store
119
- """
120
- self.__printBanner("restoring workspace")
121
- ws = restoreObject(filePath)
122
- self.dataSets = ws["data"]
123
- self.metaData = ws["metaData"]
124
- self.__printDone()
125
-
126
-
127
- def queryFileData(self, filePath, *columns):
128
- """
129
- query column data type from a data file
130
-
131
- Parameters
132
- filePath : path of file with data
133
- columns : indexes followed by column names or column names
134
- """
135
- self.__printBanner("querying column data type from a data frame")
136
- lcolumns = list(columns)
137
- noHeader = type(lcolumns[0]) == int
138
- if noHeader:
139
- df = pd.read_csv(filePath, header=None)
140
- else:
141
- df = pd.read_csv(filePath, header=0)
142
- return self.queryDataFrameData(df, *columns)
143
-
144
- def queryDataFrameData(self, df, *columns):
145
- """
146
- query column data type from a data frame
147
-
148
- Parameters
149
- df : data frame with data
150
- columns : indexes followed by column name or column names
151
- """
152
- self.__printBanner("querying column data type from a data frame")
153
- columns = list(columns)
154
- noHeader = type(columns[0]) == int
155
- dtypes = list()
156
- if noHeader:
157
- nCols = int(len(columns) / 2)
158
- colIndexes = columns[:nCols]
159
- cnames = columns[nCols:]
160
- nColsDf = len(df.columns)
161
- for i in range(nCols):
162
- ci = colIndexes[i]
163
- assert ci < nColsDf, "col index {} outside range".format(ci)
164
- col = df.loc[ : , ci]
165
- dtypes.append(self.getDataType(col))
166
- else:
167
- cnames = columns
168
- for c in columns:
169
- col = df[c]
170
- dtypes.append(self.getDataType(col))
171
-
172
- nt = list(zip(cnames, dtypes))
173
- result = self.__printResult("columns and data types", nt)
174
- return result
175
-
176
- def getDataType(self, col):
177
- """
178
- get data type
179
-
180
- Parameters
181
- col : contains data array like
182
- """
183
- if isBinary(col):
184
- dtype = "binary"
185
- elif isInteger(col):
186
- dtype = "integer"
187
- elif isFloat(col):
188
- dtype = "float"
189
- elif isCategorical(col):
190
- dtype = "categorical"
191
- else:
192
- dtype = "mixed"
193
- return dtype
194
-
195
-
196
- def addFileNumericData(self,filePath, *columns):
197
- """
198
- add numeric columns from a file
199
-
200
- Parameters
201
- filePath : path of file with data
202
- columns : indexes followed by column names or column names
203
- """
204
- self.__printBanner("adding numeric columns from a file")
205
- self.addFileData(filePath, True, *columns)
206
- self.__printDone()
207
-
208
-
209
- def addFileBinaryData(self,filePath, *columns):
210
- """
211
- add binary columns from a file
212
-
213
- Parameters
214
- filePath : path of file with data
215
- columns : indexes followed by column names or column names
216
- """
217
- self.__printBanner("adding binary columns from a file")
218
- self.addFileData(filePath, False, *columns)
219
- self.__printDone()
220
-
221
- def addFileData(self, filePath, numeric, *columns):
222
- """
223
- add columns from a file
224
-
225
- Parameters
226
- filePath : path of file with data
227
- numeric : True if numeric False in binary
228
- columns : indexes followed by column names or column names
229
- """
230
- columns = list(columns)
231
- noHeader = type(columns[0]) == int
232
- if noHeader:
233
- df = pd.read_csv(filePath, header=None)
234
- else:
235
- df = pd.read_csv(filePath, header=0)
236
- self.addDataFrameData(df, numeric, *columns)
237
-
238
- def addDataFrameNumericData(self,filePath, *columns):
239
- """
240
- add numeric columns from a data frame
241
-
242
- Parameters
243
- filePath : path of file with data
244
- columns : indexes followed by column names or column names
245
- """
246
- self.__printBanner("adding numeric columns from a data frame")
247
- self.addDataFrameData(filePath, True, *columns)
248
-
249
-
250
- def addDataFrameBinaryData(self,filePath, *columns):
251
- """
252
- add binary columns from a data frame
253
-
254
- Parameters
255
- filePath : path of file with data
256
- columns : indexes followed by column names or column names
257
- """
258
- self.__printBanner("adding binary columns from a data frame")
259
- self.addDataFrameData(filePath, False, *columns)
260
-
261
-
262
- def addDataFrameData(self, df, numeric, *columns):
263
- """
264
- add columns from a data frame
265
-
266
- Parameters
267
- df : data frame with data
268
- numeric : True if numeric False in binary
269
- columns : indexes followed by column names or column names
270
- """
271
- columns = list(columns)
272
- noHeader = type(columns[0]) == int
273
- if noHeader:
274
- nCols = int(len(columns) / 2)
275
- colIndexes = columns[:nCols]
276
- nColsDf = len(df.columns)
277
- for i in range(nCols):
278
- ci = colIndexes[i]
279
- assert ci < nColsDf, "col index {} outside range".format(ci)
280
- col = df.loc[ : , ci]
281
- if numeric:
282
- assert isNumeric(col), "data is not numeric"
283
- else:
284
- assert isBinary(col), "data is not binary"
285
- col = col.to_numpy()
286
- cn = columns[i + nCols]
287
- dtype = DataSetMetaData.dtypeNum if numeric else DataSetMetaData.dtypeBin
288
- self.__addDataSet(cn, col, dtype)
289
- else:
290
- for c in columns:
291
- col = df[c]
292
- if numeric:
293
- assert isNumeric(col), "data is not numeric"
294
- else:
295
- assert isBinary(col), "data is not binary"
296
- col = col.to_numpy()
297
- dtype = DataSetMetaData.dtypeNum if numeric else DataSetMetaData.dtypeBin
298
- self.__addDataSet(c, col, dtype)
299
-
300
- def __addDataSet(self, dsn, data, dtype):
301
- """
302
- add dada set
303
-
304
- Parameters
305
- dsn: data set name
306
- data : numpy array data
307
- """
308
- self.dataSets[dsn] = data
309
- self.metaData[dsn] = DataSetMetaData(dtype)
310
-
311
-
312
- def addListNumericData(self, ds, name):
313
- """
314
- add numeric data from a list
315
-
316
- Parameters
317
- ds : list with data
318
- name : name of data set
319
- """
320
- self.__printBanner("add numeric data from a list")
321
- self.addListData(ds, True, name)
322
- self.__printDone()
323
-
324
-
325
- def addListBinaryData(self, ds, name):
326
- """
327
- add binary data from a list
328
-
329
- Parameters
330
- ds : list with data
331
- name : name of data set
332
- """
333
- self.__printBanner("adding binary data from a list")
334
- self.addListData(ds, False, name)
335
- self.__printDone()
336
-
337
- def addListData(self, ds, numeric, name):
338
- """
339
- adds list data
340
-
341
- Parameters
342
- ds : list with data
343
- numeric : True if numeric False in binary
344
- name : name of data set
345
- """
346
- assert type(ds) == list, "data not a list"
347
- if numeric:
348
- assert isNumeric(ds), "data is not numeric"
349
- else:
350
- assert isBinary(ds), "data is not binary"
351
- dtype = DataSetMetaData.dtypeNum if numeric else DataSetMetaData.dtypeBin
352
- self.dataSets[name] = np.array(ds)
353
- self.metaData[name] = DataSetMetaData(dtype)
354
-
355
-
356
- def addFileCatData(self, filePath, *columns):
357
- """
358
- add categorical columns from a file
359
-
360
- Parameters
361
- filePath : path of file with data
362
- columns : indexes followed by column names or column names
363
- """
364
- self.__printBanner("adding categorical columns from a file")
365
- columns = list(columns)
366
- noHeader = type(columns[0]) == int
367
- if noHeader:
368
- df = pd.read_csv(filePath, header=None)
369
- else:
370
- df = pd.read_csv(filePath, header=0)
371
-
372
- self.addDataFrameCatData(df, *columns)
373
- self.__printDone()
374
-
375
- def addDataFrameCatData(self, df, *columns):
376
- """
377
- add categorical columns from a data frame
378
-
379
- Parameters
380
- df : data frame with data
381
- columns : indexes followed by column names or column names
382
- """
383
- self.__printBanner("adding categorical columns from a data frame")
384
- columns = list(columns)
385
- noHeader = type(columns[0]) == int
386
- if noHeader:
387
- nCols = int(len(columns) / 2)
388
- colIndexes = columns[:nCols]
389
- nColsDf = len(df.columns)
390
- for i in range(nCols):
391
- ci = colIndexes[i]
392
- assert ci < nColsDf, "col index {} outside range".format(ci)
393
- col = df.loc[ : , ci]
394
- assert isCategorical(col), "data is not categorical"
395
- col = col.tolist()
396
- cn = columns[i + nCols]
397
- self.__addDataSet(cn, col, DataSetMetaData.dtypeCat)
398
- else:
399
- for c in columns:
400
- col = df[c].tolist()
401
- self.__addDataSet(c, col, DataSetMetaData.dtypeCat)
402
-
403
- def addListCatData(self, ds, name):
404
- """
405
- add categorical list data
406
-
407
- Parameters
408
- ds : list with data
409
- name : name of data set
410
- """
411
- self.__printBanner("adding categorical list data")
412
- assert type(ds) == list, "data not a list"
413
- assert isCategorical(ds), "data is not categorical"
414
- self.__addDataSet(name, ds, DataSetMetaData.dtypeCat)
415
- self.__printDone()
416
-
417
- def remData(self, ds):
418
- """
419
- removes data set
420
-
421
- Parameters
422
- ds : data set name
423
- """
424
- self.__printBanner("removing data set", ds)
425
- assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
426
- self.dataSets.pop(ds)
427
- self.metaData.pop(ds)
428
- names = self.showNames()
429
- self.__printDone()
430
- return names
431
-
432
- def addNote(self, ds, note):
433
- """
434
- get data
435
-
436
- Parameters
437
- ds : data set name or list or numpy array with data
438
- note: note text
439
- """
440
- self.__printBanner("adding note")
441
- assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
442
- mdata = self.metaData[ds]
443
- mdata.addNote(note)
444
- self.__printDone()
445
-
446
- def getNotes(self, ds):
447
- """
448
- get data
449
-
450
- Parameters
451
- ds : data set name or list or numpy array with data
452
- """
453
- self.__printBanner("getting notes")
454
- assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
455
- mdata = self.metaData[ds]
456
- dnotes = mdata.notes
457
- if self.verbose:
458
- for dn in dnotes:
459
- print(dn)
460
- return dnotes
461
-
462
- def getNumericData(self, ds):
463
- """
464
- get numeric data
465
-
466
- Parameters
467
- ds : data set name or list or numpy array with data
468
- """
469
- if type(ds) == str:
470
- assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
471
- assert self.metaData[ds].dtype == DataSetMetaData.dtypeNum, "data set {} is expected to be numerical type for this operation".format(ds)
472
- data = self.dataSets[ds]
473
- elif type(ds) == list:
474
- assert isNumeric(ds), "data is not numeric"
475
- data = np.array(ds)
476
- elif type(ds) == np.ndarray:
477
- data = ds
478
- else:
479
- raise "invalid type, expecting data set name, list or ndarray"
480
- return data
481
-
482
-
483
- def getCatData(self, ds):
484
- """
485
- get categorical data
486
-
487
- Parameters
488
- ds : data set name or list with data
489
- """
490
- if type(ds) == str:
491
- assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
492
- assert self.metaData[ds].dtype == DataSetMetaData.dtypeCat, "data set {} is expected to be categorical type for this operation".format(ds)
493
- data = self.dataSets[ds]
494
- elif type(ds) == list:
495
- assert isCategorical(ds), "data is not categorical"
496
- data = ds
497
- else:
498
- raise "invalid type, expecting data set name or list"
499
- return data
500
-
501
- def getAnyData(self, ds):
502
- """
503
- get any data
504
-
505
- Parameters
506
- ds : data set name or list with data
507
- """
508
- if type(ds) == str:
509
- assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
510
- data = self.dataSets[ds]
511
- elif type(ds) == list:
512
- data = ds
513
- else:
514
- raise "invalid type, expecting data set name or list"
515
- return data
516
-
517
- def loadCatFloatDataFrame(self, ds1, ds2):
518
- """
519
- loads float and cat data into data frame
520
-
521
- Parameters
522
- ds1: data set name or list
523
- ds2: data set name or list or numpy array
524
- """
525
- data1 = self.getCatData(ds1)
526
- data2 = self.getNumericData(ds2)
527
- self.ensureSameSize([data1, data2])
528
- df1 = pd.DataFrame(data=data1)
529
- df2 = pd.DataFrame(data=data2)
530
- df = pd.concat([df1,df2], axis=1)
531
- df.columns = range(df.shape[1])
532
- return df
533
-
534
- def showNames(self):
535
- """
536
- lists data set names
537
- """
538
- self.__printBanner("listing data set names")
539
- names = self.dataSets.keys()
540
- if self.verbose:
541
- print("data sets")
542
- for ds in names:
543
- print(ds)
544
- self.__printDone()
545
- return names
546
-
547
- def plot(self, ds, yscale=None):
548
- """
549
- plots data
550
-
551
- Parameters
552
- ds: data set name or list or numpy array
553
- yscale: y scale
554
- """
555
- self.__printBanner("plotting data", ds)
556
- data = self.getNumericData(ds)
557
- drawLine(data, yscale)
558
-
559
- def plotZoomed(self, ds, beg, end, yscale=None):
560
- """
561
- plots zoomed data
562
-
563
- Parameters
564
- ds: data set name or list or numpy array
565
- beg: begin offset
566
- end: end offset
567
- yscale: y scale
568
- """
569
- self.__printBanner("plotting data", ds)
570
- data = self.getNumericData(ds)
571
- drawLine(data[beg:end], yscale)
572
-
573
- def scatterPlot(self, ds1, ds2):
574
- """
575
- scatter plots data
576
-
577
- Parameters
578
- ds1: data set name or list or numpy array
579
- ds2: data set name or list or numpy array
580
- """
581
- self.__printBanner("scatter plotting data", ds1, ds2)
582
- data1 = self.getNumericData(ds1)
583
- data2 = self.getNumericData(ds2)
584
- self.ensureSameSize([data1, data2])
585
- x = np.arange(1, len(data1)+1, 1)
586
- plt.scatter(x, data1 ,color="red")
587
- plt.scatter(x, data2 ,color="blue")
588
- plt.show()
589
-
590
- def print(self, ds):
591
- """
592
- prunt data
593
-
594
- Parameters
595
- ds: data set name or list or numpy array
596
- """
597
- self.__printBanner("printing data", ds)
598
- assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
599
- data = self.dataSets[ds]
600
- if self.verbore:
601
- print(formatAny(len(data), "size"))
602
- print("showing first 50 elements" )
603
- print(data[:50])
604
-
605
- def plotHist(self, ds, cumulative, density, nbins=20):
606
- """
607
- plots histogram
608
-
609
- Parameters
610
- ds: data set name or list or numpy array
611
- cumulative : True if cumulative
612
- density : True to normalize for probability density
613
- nbins : no of bins
614
- """
615
- self.__printBanner("plotting histogram", ds)
616
- data = self.getNumericData(ds)
617
- plt.hist(data, bins=nbins, cumulative=cumulative, density=density)
618
- plt.show()
619
-
620
- def isMonotonicallyChanging(self, ds):
621
- """
622
- checks if monotonically increasing or decreasing
623
-
624
- Parameters
625
- ds: data set name or list or numpy array
626
- """
627
- self.__printBanner("checking monotonic change", ds)
628
- data = self.getNumericData(ds)
629
- monoIncreasing = all(list(map(lambda i : data[i] >= data[i-1], range(1, len(data), 1))))
630
- monoDecreasing = all(list(map(lambda i : data[i] <= data[i-1], range(1, len(data), 1))))
631
- result = self.__printResult("monoIncreasing", monoIncreasing, "monoDecreasing", monoDecreasing)
632
- return result
633
-
634
- def getFreqDistr(self, ds, nbins=20):
635
- """
636
- get histogram
637
-
638
- Parameters
639
- ds: data set name or list or numpy array
640
- nbins: num of bins
641
- """
642
- self.__printBanner("getting histogram", ds)
643
- data = self.getNumericData(ds)
644
- frequency, lowLimit, binsize, extraPoints = sta.relfreq(data, numbins=nbins)
645
- result = self.__printResult("frequency", frequency, "lowLimit", lowLimit, "binsize", binsize, "extraPoints", extraPoints)
646
- return result
647
-
648
-
649
- def getCumFreqDistr(self, ds, nbins=20):
650
- """
651
- get cumulative freq distribution
652
-
653
- Parameters
654
- ds: data set name or list or numpy array
655
- nbins: num of bins
656
- """
657
- self.__printBanner("getting cumulative freq distribution", ds)
658
- data = self.getNumericData(ds)
659
- cumFrequency, lowLimit, binsize, extraPoints = sta.cumfreq(data, numbins=nbins)
660
- result = self.__printResult("cumFrequency", cumFrequency, "lowLimit", lowLimit, "binsize", binsize, "extraPoints", extraPoints)
661
- return result
662
-
663
- def getExtremeValue(self, ds, ensamp, nsamp, polarity, doPlotDistr, nbins=20):
664
- """
665
- get extreme values
666
-
667
- Parameters
668
- ds: data set name or list or numpy array
669
- ensamp: num of samples for extreme values
670
- nsamp: num of samples
671
- polarity: max or min
672
- doPlotDistr: plot distr
673
- nbins: num of bins
674
- """
675
- self.__printBanner("getting extreme values", ds)
676
- data = self.getNumericData(ds)
677
- evalues = list()
678
- for _ in range(ensamp):
679
- values = selectRandomSubListFromListWithRepl(data, nsamp)
680
- if polarity == "max":
681
- evalues.append(max(values))
682
- else:
683
- evalues.append(min(values))
684
- if doPlotDistr:
685
- plt.hist(evalues, bins=nbins, cumulative=False, density=True)
686
- plt.show()
687
- result = self.__printResult("extremeValues", evalues)
688
- return result
689
-
690
-
691
- def getEntropy(self, ds, nbins=20):
692
- """
693
- get entropy
694
-
695
- Parameters
696
- ds: data set name or list or numpy array
697
- nbins: num of bins
698
- """
699
- self.__printBanner("getting entropy", ds)
700
- data = self.getNumericData(ds)
701
- result = self.getFreqDistr(data, nbins)
702
- entropy = sta.entropy(result["frequency"])
703
- result = self.__printResult("entropy", entropy)
704
- return result
705
-
706
- def getRelEntropy(self, ds1, ds2, nbins=20):
707
- """
708
- get relative entropy or KL divergence with both data sets numeric
709
-
710
- Parameters
711
- ds1: data set name or list or numpy array
712
- ds2: data set name or list or numpy array
713
- nbins: num of bins
714
- """
715
- self.__printBanner("getting relative entropy or KL divergence", ds1, ds2)
716
- data1 = self.getNumericData(ds1)
717
- data2 = self.getNumericData(ds2)
718
- result1 = self .getFeqDistr(data1, nbins)
719
- freq1 = result1["frequency"]
720
- result2 = self .getFeqDistr(data2, nbins)
721
- freq2 = result2["frequency"]
722
- entropy = sta.entropy(freq1, freq2)
723
- result = self.__printResult("relEntropy", entropy)
724
- return result
725
-
726
- def getAnyEntropy(self, ds, dt, nbins=20):
727
- """
728
- get entropy of any data typr numeric or categorical
729
-
730
- Parameters
731
- ds: data set name or list or numpy array
732
- dt : data type num or cat
733
- nbins: num of bins
734
- """
735
- entropy = self.getEntropy(ds, nbins)["entropy"] if dt == "num" else self.getStatsCat(ds)["entropy"]
736
- result = self.__printResult("entropy", entropy)
737
- return result
738
-
739
- def getJointEntropy(self, ds1, ds2, nbins=20):
740
- """
741
- get joint entropy with both data sets numeric
742
-
743
- Parameters
744
- ds1: data set name or list or numpy array
745
- ds2: data set name or list or numpy array
746
- nbins: num of bins
747
- """
748
- self.__printBanner("getting join entropy", ds1, ds2)
749
- data1 = self.getNumericData(ds1)
750
- data2 = self.getNumericData(ds2)
751
- self.ensureSameSize([data1, data2])
752
- hist, xedges, yedges = np.histogram2d(data1, data2, bins=nbins)
753
- hist = hist.flatten()
754
- ssize = len(data1)
755
- hist = hist / ssize
756
- entropy = sta.entropy(hist)
757
- result = self.__printResult("jointEntropy", entropy)
758
- return result
759
-
760
-
761
- def getAllNumMutualInfo(self, ds1, ds2, nbins=20):
762
- """
763
- get mutual information for both numeric data
764
-
765
- Parameters
766
- ds1: data set name or list or numpy array
767
- ds2: data set name or list or numpy array
768
- nbins: num of bins
769
- """
770
- self.__printBanner("getting mutual information", ds1, ds2)
771
- en1 = self.getEntropy(ds1,nbins)
772
- en2 = self.getEntropy(ds2,nbins)
773
- en = self.getJointEntropy(ds1, ds2, nbins)
774
-
775
- mutInfo = en1["entropy"] + en2["entropy"] - en["jointEntropy"]
776
- result = self.__printResult("mutInfo", mutInfo)
777
- return result
778
-
779
-
780
- def getNumCatMutualInfo(self, nds, cds ,nbins=20):
781
- """
782
- get mutiual information between numeric and categorical data
783
-
784
- Parameters
785
- nds: numeric data set name or list or numpy array
786
- cds: categoric data set name or list
787
- nbins: num of bins
788
- """
789
- self.__printBanner("getting mutual information of numerical and categorical data", nds, cds)
790
- ndata = self.getNumericData(nds)
791
- cds = self.getCatData(cds)
792
- nentr = self.getEntropy(nds)["entropy"]
793
-
794
- #conditional entropy
795
- cdistr = self.getStatsCat(cds)["distr"]
796
- grdata = self.getGroupByData(nds, cds, True)["groupedData"]
797
- cnentr = 0
798
- for gr, data in grdata.items():
799
- self.addListNumericData(data, "grdata")
800
- gnentr = self.getEntropy("grdata")["entropy"]
801
- cnentr += gnentr * cdistr[gr]
802
-
803
- mutInfo = nentr - cnentr
804
- result = self.__printResult("mutInfo", mutInfo, "entropy", nentr, "condEntropy", cnentr)
805
- return result
806
-
807
- def getTwoCatMutualInfo(self, cds1, cds2):
808
- """
809
- get mutiual information between 2 categorical data sets
810
-
811
- Parameters
812
- cds1 : categoric data set name or list
813
- cds2 : categoric data set name or list
814
- """
815
- self.__printBanner("getting mutual information of two categorical data sets", cds1, cds2)
816
- cdata1 = self.getCatData(cds1)
817
- cdata2 = self.getCatData(cds1)
818
- centr = self.getStatsCat(cds1)["entropy"]
819
-
820
- #conditional entropy
821
- cdistr = self.getStatsCat(cds2)["distr"]
822
- grdata = self.getGroupByData(cds1, cds2, True)["groupedData"]
823
- ccentr = 0
824
- for gr, data in grdata.items():
825
- self.addListCatData(data, "grdata")
826
- gcentr = self.getStatsCat("grdata")["entropy"]
827
- ccentr += gcentr * cdistr[gr]
828
-
829
- mutInfo = centr - ccentr
830
- result = self.__printResult("mutInfo", mutInfo, "entropy", centr, "condEntropy", ccentr)
831
- return result
832
-
833
- def getMutualInfo(self, dst, nbins=20):
834
- """
835
- get mutiual information between 2 data sets,any combination numerical and categorical
836
-
837
- Parameters
838
- dst : data source , data type, data source , data type
839
- nbins : num of bins
840
- """
841
- assertEqual(len(dst), 4, "invalid data source and data type list size")
842
- dtypes = ["num", "cat"]
843
- assertInList(dst[1], dtypes, "invalid data type")
844
- assertInList(dst[3], dtypes, "invalid data type")
845
- self.__printBanner("getting mutual information of any mix numerical and categorical data", dst[0], dst[2])
846
-
847
- if dst[1] == "num":
848
- mutInfo = self.getAllNumMutualInfo(dst[0], dst[2], nbins)["mutInfo"] if dst[3] == "num" \
849
- else self.getNumCatMutualInfo(dst[0], dst[2], nbins)["mutInfo"]
850
- else:
851
- mutInfo = self.getNumCatMutualInfo(dst[2], dst[0], nbins)["mutInfo"] if dst[3] == "num" \
852
- else self.getTwoCatMutualInfo(dst[2], dst[0])["mutInfo"]
853
-
854
- result = self.__printResult("mutInfo", mutInfo)
855
- return result
856
-
857
-
858
- def getCondMutualInfo(self, dst, nbins=20):
859
- """
860
- get conditional mutiual information between 2 data sets,any combination numerical and categorical
861
-
862
- Parameters
863
- dst : data source , data type, data source , data type, data source , data type
864
- nbins : num of bins
865
- """
866
- assertEqual(len(dst), 6, "invalid data source and data type list size")
867
- dtypes = ["num", "cat"]
868
- assertInList(dst[1], dtypes, "invalid data type")
869
- assertInList(dst[3], dtypes, "invalid data type")
870
- assertInList(dst[5], dtypes, "invalid data type")
871
- self.__printBanner("getting conditional mutual information of any mix numerical and categorical data", dst[0], dst[2])
872
-
873
- if dst[5] == "cat":
874
- cdistr = self.getStatsCat(dst[4])["distr"]
875
- grdata1 = self.getGroupByData(dst[0], dst[4], True)["groupedData"]
876
- grdata2 = self.getGroupByData(dst[2], dst[4], True)["groupedData"]
877
-
878
- else:
879
- gdata = self.getNumericData(dst[4])
880
- hist = Histogram.createWithNumBins(gdata, nbins)
881
- cdistr = hist.distr()
882
- grdata1 = self.getGroupByData(dst[0], dst[4], False)["groupedData"]
883
- grdata2 = self.getGroupByData(dst[2], dst[4], False)["groupedData"]
884
-
885
-
886
- cminfo = 0
887
- for gr in grdata1.keys():
888
- data1 = grdata1[gr]
889
- data2 = grdata2[gr]
890
- if dst[1] == "num":
891
- self.addListNumericData(data1, "grdata1")
892
- else:
893
- self.addListCatData(data1, "grdata1")
894
-
895
- if dst[3] == "num":
896
- self.addListNumericData(data2, "grdata2")
897
- else:
898
- self.addListCatData(data2, "grdata2")
899
- gdst = ["grdata1", dst[1], "grdata2", dst[3]]
900
- minfo = self.getMutualInfo(gdst, nbins)["mutInfo"]
901
- cminfo += minfo * cdistr[gr]
902
-
903
- result = self.__printResult("condMutInfo", cminfo)
904
- return result
905
-
906
- def getPercentile(self, ds, value):
907
- """
908
- gets percentile
909
-
910
- Parameters
911
- ds: data set name or list or numpy array
912
- value: the value
913
- """
914
- self.__printBanner("getting percentile", ds)
915
- data = self.getNumericData(ds)
916
- percent = sta.percentileofscore(data, value)
917
- result = self.__printResult("value", value, "percentile", percent)
918
- return result
919
-
920
- def getValueRangePercentile(self, ds, value1, value2):
921
- """
922
- gets percentile
923
-
924
- Parameters
925
- ds: data set name or list or numpy array
926
- value1: first value
927
- value2: second value
928
- """
929
- self.__printBanner("getting percentile difference for value range", ds)
930
- if value1 < value2:
931
- v1 = value1
932
- v2 = value2
933
- else:
934
- v1 = value2
935
- v2 = value1
936
- data = self.getNumericData(ds)
937
- per1 = sta.percentileofscore(data, v1)
938
- per2 = sta.percentileofscore(data, v2)
939
- result = self.__printResult("valueFirst", value1, "valueSecond", value2, "percentileDiff", per2 - per1)
940
- return result
941
-
942
- def getValueAtPercentile(self, ds, percent):
943
- """
944
- gets value at percentile
945
-
946
- Parameters
947
- ds: data set name or list or numpy array
948
- percent: percentile
949
- """
950
- self.__printBanner("getting value at percentile", ds)
951
- data = self.getNumericData(ds)
952
- assert isInRange(percent, 0, 100), "percent should be between 0 and 100"
953
- value = sta.scoreatpercentile(data, percent)
954
- result = self.__printResult("value", value, "percentile", percent)
955
- return result
956
-
957
- def getLessThanValues(self, ds, cvalue):
958
- """
959
- gets values less than given value
960
-
961
- Parameters
962
- ds: data set name or list or numpy array
963
- cvalue: condition value
964
- """
965
- self.__printBanner("getting values less than", ds)
966
- fdata = self.__getCondValues(ds, cvalue, "lt")
967
- result = self.__printResult("count", len(fdata), "lessThanvalues", fdata )
968
- return result
969
-
970
-
971
- def getGreaterThanValues(self, ds, cvalue):
972
- """
973
- gets values greater than given value
974
-
975
- Parameters
976
- ds: data set name or list or numpy array
977
- cvalue: condition value
978
- """
979
- self.__printBanner("getting values greater than", ds)
980
- fdata = self.__getCondValues(ds, cvalue, "gt")
981
- result = self.__printResult("count", len(fdata), "greaterThanvalues", fdata )
982
- return result
983
-
984
- def __getCondValues(self, ds, cvalue, cond):
985
- """
986
- gets cinditional values
987
-
988
- Parameters
989
- ds: data set name or list or numpy array
990
- cvalue: condition value
991
- cond: condition
992
- """
993
- data = self.getNumericData(ds)
994
- if cond == "lt":
995
- ind = np.where(data < cvalue)
996
- else:
997
- ind = np.where(data > cvalue)
998
- fdata = data[ind]
999
- return fdata
1000
-
1001
- def getUniqueValueCounts(self, ds, maxCnt=10):
1002
- """
1003
- gets unique values and counts
1004
-
1005
- Parameters
1006
- ds: data set name or list or numpy array
1007
- maxCnt; max value count pairs to return
1008
- """
1009
- self.__printBanner("getting unique values and counts", ds)
1010
- data = self.getNumericData(ds)
1011
- values, counts = sta.find_repeats(data)
1012
- cardinality = len(values)
1013
- vc = list(zip(values, counts))
1014
- vc.sort(key = lambda v : v[1], reverse = True)
1015
- result = self.__printResult("cardinality", cardinality, "vunique alues and repeat counts", vc[:maxCnt])
1016
- return result
1017
-
1018
- def getCatUniqueValueCounts(self, ds, maxCnt=10):
1019
- """
1020
- gets unique categorical values and counts
1021
-
1022
- Parameters
1023
- ds: data set name or list or numpy array
1024
- maxCnt: max value count pairs to return
1025
- """
1026
- self.__printBanner("getting unique categorical values and counts", ds)
1027
- data = self.getCatData(ds)
1028
- series = pd.Series(data)
1029
- uvalues = series.value_counts()
1030
- values = uvalues.index.tolist()
1031
- counts = uvalues.tolist()
1032
- vc = list(zip(values, counts))
1033
- vc.sort(key = lambda v : v[1], reverse = True)
1034
- result = self.__printResult("cardinality", len(values), "unique values and repeat counts", vc[:maxCnt])
1035
- return result
1036
-
1037
- def getCatAlphaValueCounts(self, ds):
1038
- """
1039
- gets alphabetic value count
1040
-
1041
- Parameters
1042
- ds: data set name or list or numpy array
1043
- """
1044
- self.__printBanner("getting alphabetic value counts", ds)
1045
- data = self.getCatData(ds)
1046
- series = pd.Series(data)
1047
- flags = series.str.isalpha().tolist()
1048
- count = sum(flags)
1049
- result = self.__printResult("alphabeticValueCount", count)
1050
- return result
1051
-
1052
-
1053
- def getCatNumValueCounts(self, ds):
1054
- """
1055
- gets numeric value count
1056
-
1057
- Parameters
1058
- ds: data set name or list or numpy array
1059
- """
1060
- self.__printBanner("getting numeric value counts", ds)
1061
- data = self.getCatData(ds)
1062
- series = pd.Series(data)
1063
- flags = series.str.isnumeric().tolist()
1064
- count = sum(flags)
1065
- result = self.__printResult("numericValueCount", count)
1066
- return result
1067
-
1068
-
1069
- def getCatAlphaNumValueCounts(self, ds):
1070
- """
1071
- gets alpha numeric value count
1072
-
1073
- Parameters
1074
- ds: data set name or list or numpy array
1075
- """
1076
- self.__printBanner("getting alpha numeric value counts", ds)
1077
- data = self.getCatData(ds)
1078
- series = pd.Series(data)
1079
- flags = series.str.isalnum().tolist()
1080
- count = sum(flags)
1081
- result = self.__printResult("alphaNumericValueCount", count)
1082
- return result
1083
-
1084
- def getCatAllCharCounts(self, ds):
1085
- """
1086
- gets alphabetic, numeric and special char count list
1087
-
1088
- Parameters
1089
- ds: data set name or list or numpy array
1090
- """
1091
- self.__printBanner("getting alphabetic, numeric and special char counts", ds)
1092
- data = self.getCatData(ds)
1093
- counts = list()
1094
- for d in data:
1095
- r = getAlphaNumCharCount(d)
1096
- counts.append(r)
1097
- result = self.__printResult("allTypeCharCounts", counts)
1098
- return result
1099
-
1100
- def getCatAlphaCharCounts(self, ds):
1101
- """
1102
- gets alphabetic char count list
1103
-
1104
- Parameters
1105
- ds: data set name or list or numpy array
1106
- """
1107
- self.__printBanner("getting alphabetic char counts", ds)
1108
- data = self.getCatData(ds)
1109
- counts = self.getCatAllCharCounts(ds)["allTypeCharCounts"]
1110
- counts = list(map(lambda r : r[0], counts))
1111
- result = self.__printResult("alphaCharCounts", counts)
1112
- return result
1113
-
1114
- def getCatNumCharCounts(self, ds):
1115
- """
1116
- gets numeric char count list
1117
-
1118
- Parameters
1119
- ds: data set name or list or numpy array
1120
- """
1121
- self.__printBanner("getting numeric char counts", ds)
1122
- data = self.getCatData(ds)
1123
- counts = self.getCatAllCharCounts(ds)["allTypeCharCounts"]
1124
- counts = list(map(lambda r : r[1], counts))
1125
- result = self.__printResult("numCharCounts", counts)
1126
- return result
1127
-
1128
- def getCatSpecialCharCounts(self, ds):
1129
- """
1130
- gets special char count list
1131
-
1132
- Parameters
1133
- ds: data set name or list or numpy array
1134
- """
1135
- self.__printBanner("getting special char counts", ds)
1136
- counts = self.getCatAllCharCounts(ds)["allTypeCharCounts"]
1137
- counts = list(map(lambda r : r[2], counts))
1138
- result = self.__printResult("specialCharCounts", counts)
1139
- return result
1140
-
1141
- def getCatAlphaCharCountStats(self, ds):
1142
- """
1143
- gets alphabetic char count stats
1144
-
1145
- Parameters
1146
- ds: data set name or list or numpy array
1147
- """
1148
- self.__printBanner("getting alphabetic char count stats", ds)
1149
- counts = self.getCatAlphaCharCounts(ds)["alphaCharCounts"]
1150
- nz = counts.count(0)
1151
- st = self.__getBasicStats(np.array(counts))
1152
- result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3], "zeroCount", nz)
1153
- return result
1154
-
1155
- def getCatNumCharCountStats(self, ds):
1156
- """
1157
- gets numeric char count stats
1158
-
1159
- Parameters
1160
- ds: data set name or list or numpy array
1161
- """
1162
- self.__printBanner("getting numeric char count stats", ds)
1163
- counts = self.getCatNumCharCounts(ds)["numCharCounts"]
1164
- nz = counts.count(0)
1165
- st = self.__getBasicStats(np.array(counts))
1166
- result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3], "zeroCount", nz)
1167
- return result
1168
-
1169
- def getCatSpecialCharCountStats(self, ds):
1170
- """
1171
- gets special char count stats
1172
-
1173
- Parameters
1174
- ds: data set name or list or numpy array
1175
- """
1176
- self.__printBanner("getting special char count stats", ds)
1177
- counts = self.getCatSpecialCharCounts(ds)["specialCharCounts"]
1178
- nz = counts.count(0)
1179
- st = self.__getBasicStats(np.array(counts))
1180
- result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3], "zeroCount", nz)
1181
- return result
1182
-
1183
- def getCatFldLenStats(self, ds):
1184
- """
1185
- gets field length stats
1186
-
1187
- Parameters
1188
- ds: data set name or list or numpy array
1189
- """
1190
- self.__printBanner("getting field length stats", ds)
1191
- data = self.getCatData(ds)
1192
- le = list(map(lambda d: len(d), data))
1193
- st = self.__getBasicStats(np.array(le))
1194
- result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3])
1195
- return result
1196
-
1197
- def getCatCharCountStats(self, ds, ch):
1198
- """
1199
- gets specified char ocuurence count stats
1200
-
1201
- Parameters
1202
- ds: data set name or list or numpy array
1203
- ch : character
1204
- """
1205
- self.__printBanner("getting field length stats", ds)
1206
- data = self.getCatData(ds)
1207
- counts = list(map(lambda d: d.count(ch), data))
1208
- nz = counts.count(0)
1209
- st = self.__getBasicStats(np.array(counts))
1210
- result = self.__printResult("mean", st[0], "std dev", st[1], "max", st[2], "min", st[3], "zeroCount", nz)
1211
- return result
1212
-
1213
- def getStats(self, ds, nextreme=5):
1214
- """
1215
- gets summary statistics
1216
-
1217
- Parameters
1218
- ds: data set name or list or numpy array
1219
- nextreme: num of extreme values
1220
- """
1221
- self.__printBanner("getting summary statistics", ds)
1222
- data = self.getNumericData(ds)
1223
- stat = dict()
1224
- stat["length"] = len(data)
1225
- stat["min"] = data.min()
1226
- stat["max"] = data.max()
1227
- series = pd.Series(data)
1228
- stat["n smallest"] = series.nsmallest(n=nextreme).tolist()
1229
- stat["n largest"] = series.nlargest(n=nextreme).tolist()
1230
- stat["mean"] = data.mean()
1231
- stat["median"] = np.median(data)
1232
- mode, modeCnt = sta.mode(data)
1233
- stat["mode"] = mode[0]
1234
- stat["mode count"] = modeCnt[0]
1235
- stat["std"] = np.std(data)
1236
- stat["skew"] = sta.skew(data)
1237
- stat["kurtosis"] = sta.kurtosis(data)
1238
- stat["mad"] = sta.median_absolute_deviation(data)
1239
- self.pp.pprint(stat)
1240
- return stat
1241
-
1242
- def getStatsCat(self, ds):
1243
- """
1244
- gets summary statistics for categorical data
1245
-
1246
- Parameters
1247
- ds: data set name or list or numpy array
1248
- """
1249
- self.__printBanner("getting summary statistics for categorical data", ds)
1250
- data = self.getCatData(ds)
1251
- ch = CatHistogram()
1252
- for d in data:
1253
- ch.add(d)
1254
- mode = ch.getMode()
1255
- entr = ch.getEntropy()
1256
- uvalues = ch.getUniqueValues()
1257
- distr = ch.getDistr()
1258
- result = self.__printResult("entropy", entr, "mode", mode, "uniqueValues", uvalues, "distr", distr)
1259
- return result
1260
-
1261
-
1262
- def getGroupByData(self, ds, gds, gdtypeCat, numBins=20):
1263
- """
1264
- group by
1265
-
1266
- Parameters
1267
- ds: data set name or list or numpy array
1268
- gds: group by data set name or list or numpy array
1269
- gdtpe : group by data type
1270
- """
1271
- self.__printBanner("getting group by data", ds)
1272
- data = self.getAnyData(ds)
1273
- if gdtypeCat:
1274
- gdata = self.getCatData(gds)
1275
- else:
1276
- gdata = self.getNumericData(gds)
1277
- hist = Histogram.createWithNumBins(gdata, numBins)
1278
- gdata = list(map(lambda d : hist.bin(d), gdata))
1279
-
1280
- self.ensureSameSize([data, gdata])
1281
- groups = dict()
1282
- for g,d in zip(gdata, data):
1283
- appendKeyedList(groups, g, d)
1284
-
1285
- ve = self.verbose
1286
- self.verbose = False
1287
- result = self.__printResult("groupedData", groups)
1288
- self.verbose = ve
1289
- return result
1290
-
1291
- def getDifference(self, ds, order, doPlot=False):
1292
- """
1293
- gets difference of given order
1294
-
1295
- Parameters
1296
- ds: data set name or list or numpy array
1297
- order: order of difference
1298
- doPlot : True for plot
1299
- """
1300
- self.__printBanner("getting difference of given order", ds)
1301
- data = self.getNumericData(ds)
1302
- diff = difference(data, order)
1303
- if doPlot:
1304
- drawLine(diff)
1305
- return diff
1306
-
1307
- def getTrend(self, ds, doPlot=False):
1308
- """
1309
- get trend
1310
-
1311
- Parameters
1312
- ds: data set name or list or numpy array
1313
- doPlot: true if plotting needed
1314
- """
1315
- self.__printBanner("getting trend")
1316
- data = self.getNumericData(ds)
1317
- sz = len(data)
1318
- X = list(range(0, sz))
1319
- X = np.reshape(X, (sz, 1))
1320
- model = LinearRegression()
1321
- model.fit(X, data)
1322
- trend = model.predict(X)
1323
- sc = model.score(X, data)
1324
- coef = model.coef_
1325
- intc = model.intercept_
1326
- result = self.__printResult("coeff", coef, "intercept", intc, "r square error", sc, "trend", trend)
1327
-
1328
- if doPlot:
1329
- plt.plot(data)
1330
- plt.plot(trend)
1331
- plt.show()
1332
- return result
1333
-
1334
- def getDiffSdNoisiness(self, ds):
1335
- """
1336
- get noisiness based on std dev of first order difference
1337
-
1338
- Parameters
1339
- ds: data set name or list or numpy array
1340
- """
1341
- diff = self.getDifference(ds, 1)
1342
- noise = np.std(np.array(diff))
1343
- result = self.__printResult("noisiness", noise)
1344
- return result
1345
-
1346
- def getMaRmseNoisiness(self, ds, wsize=5):
1347
- """
1348
- gets noisiness based on RMSE with moving average
1349
-
1350
- Parameters
1351
- ds: data set name or list or numpy array
1352
- wsize : window size
1353
- """
1354
- assert wsize % 2 == 1, "window size must be odd"
1355
- data = self.getNumericData(ds)
1356
- wind = data[:wsize]
1357
- wstat = SlidingWindowStat.initialize(wind.tolist())
1358
-
1359
- whsize = int(wsize / 2)
1360
- beg = whsize
1361
- end = len(data) - whsize - 1
1362
- sumSq = 0.0
1363
- mean = wstat.getStat()[0]
1364
- diff = data[beg] - mean
1365
- sumSq += diff * diff
1366
- for i in range(beg + 1, end, 1):
1367
- mean = wstat.addGetStat(data[i + whsize])[0]
1368
- diff = data[i] - mean
1369
- sumSq += (diff * diff)
1370
-
1371
- noise = math.sqrt(sumSq / (len(data) - 2 * whsize))
1372
- result = self.__printResult("noisiness", noise)
1373
- return result
1374
-
1375
-
1376
- def deTrend(self, ds, trend, doPlot=False):
1377
- """
1378
- de trend
1379
-
1380
- Parameters
1381
- ds: data set name or list or numpy array
1382
- ternd : trend data
1383
- doPlot: true if plotting needed
1384
- """
1385
- self.__printBanner("doing de trend", ds)
1386
- data = self.getNumericData(ds)
1387
- sz = len(data)
1388
- detrended = list(map(lambda i : data[i]-trend[i], range(sz)))
1389
- if doPlot:
1390
- drawLine(detrended)
1391
- return detrended
1392
-
1393
- def getTimeSeriesComponents(self, ds, model, freq, summaryOnly, doPlot=False):
1394
- """
1395
- extracts trend, cycle and residue components of time series
1396
-
1397
- Parameters
1398
- ds: data set name or list or numpy array
1399
- model : model type
1400
- freq : seasnality period
1401
- summaryOnly : True if only summary needed in output
1402
- doPlot: true if plotting needed
1403
- """
1404
- self.__printBanner("extracting trend, cycle and residue components of time series", ds)
1405
- assert model == "additive" or model == "multiplicative", "model must be additive or multiplicative"
1406
- data = self.getNumericData(ds)
1407
- res = seasonal_decompose(data, model=model, period=freq)
1408
- if doPlot:
1409
- res.plot()
1410
- plt.show()
1411
-
1412
- #summar of componenets
1413
- trend = np.array(removeNan(res.trend))
1414
- trendMean = trend.mean()
1415
- trendSlope = (trend[-1] - trend[0]) / (len(trend) - 1)
1416
- seasonal = np.array(removeNan(res.seasonal))
1417
- seasonalAmp = (seasonal.max() - seasonal.min()) / 2
1418
- resid = np.array(removeNan(res.resid))
1419
- residueMean = resid.mean()
1420
- residueStdDev = np.std(resid)
1421
-
1422
- if summaryOnly:
1423
- result = self.__printResult("trendMean", trendMean, "trendSlope", trendSlope, "seasonalAmp", seasonalAmp,
1424
- "residueMean", residueMean, "residueStdDev", residueStdDev)
1425
- else:
1426
- result = self.__printResult("trendMean", trendMean, "trendSlope", trendSlope, "seasonalAmp", seasonalAmp,
1427
- "residueMean", residueMean, "residueStdDev", residueStdDev, "trend", res.trend, "seasonal", res.seasonal,
1428
- "residual", res.resid)
1429
- return result
1430
-
1431
- def getGausianMixture(self, ncomp, cvType, ninit, *dsl):
1432
- """
1433
- finds gaussian mixture parameters
1434
-
1435
- Parameters
1436
- ncomp : num of gaussian componenets
1437
- cvType : co variance type
1438
- ninit: num of intializations
1439
- dsl: list of data set name or list or numpy array
1440
- """
1441
- self.__printBanner("getting gaussian mixture parameters", *dsl)
1442
- assertInList(cvType, ["full", "tied", "diag", "spherical"], "invalid covariance type")
1443
- dmat = self.__stackData(*dsl)
1444
-
1445
- gm = GaussianMixture(n_components=ncomp, covariance_type=cvType, n_init=ninit)
1446
- gm.fit(dmat)
1447
- weights = gm.weights_
1448
- means = gm.means_
1449
- covars = gm.covariances_
1450
- converged = gm.converged_
1451
- niter = gm.n_iter_
1452
- aic = gm.aic(dmat)
1453
- result = self.__printResult("weights", weights, "mean", means, "covariance", covars, "converged", converged, "num iterations", niter, "aic", aic)
1454
- return result
1455
-
1456
- def getKmeansCluster(self, nclust, ninit, *dsl):
1457
- """
1458
- gets cluster parameters
1459
-
1460
- Parameters
1461
- nclust : num of clusters
1462
- ninit: num of intializations
1463
- dsl: list of data set name or list or numpy array
1464
- """
1465
- self.__printBanner("getting kmean cluster parameters", *dsl)
1466
- dmat = self.__stackData(*dsl)
1467
- nsamp = dmat.shape[0]
1468
-
1469
- km = KMeans(n_clusters=nclust, n_init=ninit)
1470
- km.fit(dmat)
1471
- centers = km.cluster_centers_
1472
- avdist = sqrt(km.inertia_ / nsamp)
1473
- niter = km.n_iter_
1474
- score = km.score(dmat)
1475
- result = self.__printResult("centers", centers, "average distance", avdist, "num iterations", niter, "score", score)
1476
- return result
1477
-
1478
- def getPrincComp(self, ncomp, *dsl):
1479
- """
1480
- finds pricipal componenet parameters
1481
-
1482
- Parameters
1483
- ncomp : num of pricipal componenets
1484
- dsl: list of data set name or list or numpy array
1485
- """
1486
- self.__printBanner("getting principal componenet parameters", *dsl)
1487
- dmat = self.__stackData(*dsl)
1488
- nfeat = dmat.shape[1]
1489
- assertGreater(nfeat, 1, "requires multiple features")
1490
- assertLesserEqual(ncomp, nfeat, "num of componenets greater than num of features")
1491
-
1492
- pca = PCA(n_components=ncomp)
1493
- pca.fit(dmat)
1494
- comps = pca.components_
1495
- var = pca.explained_variance_
1496
- varr = pca.explained_variance_ratio_
1497
- svalues = pca.singular_values_
1498
- result = self.__printResult("componenets", comps, "variance", var, "variance ratio", varr, "singular values", svalues)
1499
- return result
1500
-
1501
- def getOutliersWithIsoForest(self, contamination, *dsl):
1502
- """
1503
- finds outliers using isolation forest
1504
-
1505
- Parameters
1506
- contamination : proportion of outliers in the data set
1507
- dsl: list of data set name or list or numpy array
1508
- """
1509
- self.__printBanner("getting outliers using isolation forest", *dsl)
1510
- assert contamination >= 0 and contamination <= 0.5, "contamination outside valid range"
1511
- dmat = self.__stackData(*dsl)
1512
-
1513
- isf = IsolationForest(contamination=contamination, behaviour="new")
1514
- ypred = isf.fit_predict(dmat)
1515
- mask = ypred == -1
1516
- doul = dmat[mask, :]
1517
- mask = ypred != -1
1518
- dwoul = dmat[mask, :]
1519
- result = self.__printResult("numOutliers", doul.shape[0], "outliers", doul, "dataWithoutOutliers", dwoul)
1520
- return result
1521
-
1522
- def getOutliersWithLocalFactor(self, contamination, *dsl):
1523
- """
1524
- gets outliers using local outlier factor
1525
-
1526
- Parameters
1527
- contamination : proportion of outliers in the data set
1528
- dsl: list of data set name or list or numpy array
1529
- """
1530
- self.__printBanner("getting outliers using local outlier factor", *dsl)
1531
- assert contamination >= 0 and contamination <= 0.5, "contamination outside valid range"
1532
- dmat = self.__stackData(*dsl)
1533
-
1534
- lof = LocalOutlierFactor(contamination=contamination)
1535
- ypred = lof.fit_predict(dmat)
1536
- mask = ypred == -1
1537
- doul = dmat[mask, :]
1538
- mask = ypred != -1
1539
- dwoul = dmat[mask, :]
1540
- result = self.__printResult("numOutliers", doul.shape[0], "outliers", doul, "dataWithoutOutliers", dwoul)
1541
- return result
1542
-
1543
- def getOutliersWithSupVecMach(self, nu, *dsl):
1544
- """
1545
- gets outliers using one class svm
1546
-
1547
- Parameters
1548
- nu : upper bound on the fraction of training errors and a lower bound of the fraction of support vectors
1549
- dsl: list of data set name or list or numpy array
1550
- """
1551
- self.__printBanner("getting outliers using one class svm", *dsl)
1552
- assert nu >= 0 and nu <= 0.5, "error upper bound outside valid range"
1553
- dmat = self.__stackData(*dsl)
1554
-
1555
- svm = OneClassSVM(nu=nu)
1556
- ypred = svm.fit_predict(dmat)
1557
- mask = ypred == -1
1558
- doul = dmat[mask, :]
1559
- mask = ypred != -1
1560
- dwoul = dmat[mask, :]
1561
- result = self.__printResult("numOutliers", doul.shape[0], "outliers", doul, "dataWithoutOutliers", dwoul)
1562
- return result
1563
-
1564
- def getOutliersWithCovarDeterminant(self, contamination, *dsl):
1565
- """
1566
- gets outliers using covariance determinan
1567
-
1568
- Parameters
1569
- contamination : proportion of outliers in the data set
1570
- dsl: list of data set name or list or numpy array
1571
- """
1572
- self.__printBanner("getting outliers using using covariance determinant", *dsl)
1573
- assert contamination >= 0 and contamination <= 0.5, "contamination outside valid range"
1574
- dmat = self.__stackData(*dsl)
1575
-
1576
- lof = EllipticEnvelope(contamination=contamination)
1577
- ypred = lof.fit_predict(dmat)
1578
- mask = ypred == -1
1579
- doul = dmat[mask, :]
1580
- mask = ypred != -1
1581
- dwoul = dmat[mask, :]
1582
- result = self.__printResult("numOutliers", doul.shape[0], "outliers", doul, "dataWithoutOutliers", dwoul)
1583
- return result
1584
-
1585
- def getOutliersWithZscore(self, ds, zthreshold, stats=None):
1586
- """
1587
- gets outliers using zscore
1588
-
1589
- Parameters
1590
- ds: data set name or list or numpy array
1591
- zthreshold : z score threshold
1592
- stats : tuple cintaining mean and std dev
1593
- """
1594
- self.__printBanner("getting outliers using zscore", ds)
1595
- data = self.getNumericData(ds)
1596
- if stats is None:
1597
- mean = data.mean()
1598
- sd = np.std(data)
1599
- else:
1600
- mean = stats[0]
1601
- sd = stats[1]
1602
-
1603
- zs = list(map(lambda d : abs((d - mean) / sd), data))
1604
- outliers = list(filter(lambda r : r[1] > zthreshold, enumerate(zs)))
1605
- result = self.__printResult("outliers", outliers)
1606
- return result
1607
-
1608
- def getOutliersWithRobustZscore(self, ds, zthreshold, stats=None):
1609
- """
1610
- gets outliers using robust zscore
1611
-
1612
- Parameters
1613
- ds: data set name or list or numpy array
1614
- zthreshold : z score threshold
1615
- stats : tuple containing median and median absolute deviation
1616
- """
1617
- self.__printBanner("getting outliers using robust zscore", ds)
1618
- data = self.getNumericData(ds)
1619
- if stats is None:
1620
- med = np.median(data)
1621
- dev = np.array(list(map(lambda d : abs(d - med), data)))
1622
- mad = 1.4296 * np.median(dev)
1623
- else:
1624
- med = stats[0]
1625
- mad = stats[1]
1626
-
1627
- rzs = list(map(lambda d : abs((d - med) / mad), data))
1628
- outliers = list(filter(lambda r : r[1] > zthreshold, enumerate(rzs)))
1629
- result = self.__printResult("outliers", outliers)
1630
- return result
1631
-
1632
-
1633
- def getSubsequenceOutliersWithDissimilarity(self, subSeqSize, ds):
1634
- """
1635
- gets subsequence outlier with subsequence pairwise disimilarity
1636
-
1637
- Parameters
1638
- subSeqSize : sub sequence size
1639
- ds: data set name or list or numpy array
1640
- """
1641
- self.__printBanner("doing sub sequence anomaly detection with dissimilarity", ds)
1642
- data = self.getNumericData(ds)
1643
- sz = len(data)
1644
- dist = dict()
1645
- minDist = dict()
1646
- for i in range(sz - subSeqSize):
1647
- #first window
1648
- w1 = data[i : i + subSeqSize]
1649
- dmin = None
1650
- for j in range(sz - subSeqSize):
1651
- #second window not overlapping with the first
1652
- if j + subSeqSize <=i or j >= i + subSeqSize:
1653
- w2 = data[j : j + subSeqSize]
1654
- k = (j,i)
1655
- if k in dist:
1656
- d = dist[k]
1657
- else:
1658
- d = euclideanDistance(w1,w2)
1659
- k = (i,j)
1660
- dist[k] = d
1661
- if dmin is None:
1662
- dmin = d
1663
- else:
1664
- dmin = d if d < dmin else dmin
1665
- minDist[i] = dmin
1666
-
1667
- #find max of min
1668
- dmax = None
1669
- offset = None
1670
- for k in minDist.keys():
1671
- d = minDist[k]
1672
- if dmax is None:
1673
- dmax = d
1674
- offset = k
1675
- else:
1676
- if d > dmax:
1677
- dmax = d
1678
- offset = k
1679
- result = self.__printResult("subSeqOffset", offset, "outlierScore", dmax)
1680
- return result
1681
-
1682
- def getNullCount(self, ds):
1683
- """
1684
- get count of null fields
1685
-
1686
- Parameters
1687
- ds : data set name or list or numpy array with data
1688
- """
1689
- self.__printBanner("getting null value count", ds)
1690
- if type(ds) == str:
1691
- assert ds in self.dataSets, "data set {} does not exist, please add it first".format(ds)
1692
- data = self.dataSets[ds]
1693
- ser = pd.Series(data)
1694
- elif type(ds) == list or type(ds) == np.ndarray:
1695
- ser = pd.Series(ds)
1696
- data = ds
1697
- else:
1698
- raise ValueError("invalid data type")
1699
- nv = ser.isnull().tolist()
1700
- nullCount = nv.count(True)
1701
- nullFraction = nullCount / len(data)
1702
- result = self.__printResult("nullFraction", nullFraction, "nullCount", nullCount)
1703
- return result
1704
-
1705
-
1706
- def fitLinearReg(self, dsx, ds, doPlot=False):
1707
- """
1708
- fit linear regression
1709
-
1710
- Parameters
1711
- dsx: x data set name or None
1712
- ds: data set name or list or numpy array
1713
- doPlot: true if plotting needed
1714
- """
1715
- self.__printBanner("fitting linear regression", ds)
1716
- data = self.getNumericData(ds)
1717
- if dsx is None:
1718
- x = np.arange(len(data))
1719
- else:
1720
- x = self.getNumericData(dsx)
1721
- slope, intercept, rvalue, pvalue, stderr = sta.linregress(x, data)
1722
- result = self.__printResult("slope", slope, "intercept", intercept, "rvalue", rvalue, "pvalue", pvalue, "stderr", stderr)
1723
- if doPlot:
1724
- self.regFitPlot(x, data, slope, intercept)
1725
- return result
1726
-
1727
- def fitSiegelRobustLinearReg(self, ds, doPlot=False):
1728
- """
1729
- siegel robust linear regression fit based on median
1730
-
1731
- Parameters
1732
- ds: data set name or list or numpy array
1733
- doPlot: true if plotting needed
1734
- """
1735
- self.__printBanner("fitting siegel robust linear regression based on median", ds)
1736
- data = self.getNumericData(ds)
1737
- slope , intercept = sta.siegelslopes(data)
1738
- result = self.__printResult("slope", slope, "intercept", intercept)
1739
- if doPlot:
1740
- x = np.arange(len(data))
1741
- self.regFitPlot(x, data, slope, intercept)
1742
- return result
1743
-
1744
- def fitTheilSenRobustLinearReg(self, ds, doPlot=False):
1745
- """
1746
- thiel sen robust linear fit regression based on median
1747
-
1748
- Parameters
1749
- ds: data set name or list or numpy array
1750
- doPlot: true if plotting needed
1751
- """
1752
- self.__printBanner("fitting thiel sen robust linear regression based on median", ds)
1753
- data = self.getNumericData(ds)
1754
- slope, intercept, loSlope, upSlope = sta.theilslopes(data)
1755
- result = self.__printResult("slope", slope, "intercept", intercept, "lower slope", loSlope, "upper slope", upSlope)
1756
- if doPlot:
1757
- x = np.arange(len(data))
1758
- self.regFitPlot(x, data, slope, intercept)
1759
- return result
1760
-
1761
- def plotRegFit(self, x, y, slope, intercept):
1762
- """
1763
- plot linear rgeression fit line
1764
-
1765
- Parameters
1766
- x : x values
1767
- y : y values
1768
- slope : slope
1769
- intercept : intercept
1770
- """
1771
- self.__printBanner("plotting linear rgeression fit line")
1772
- fig = plt.figure()
1773
- ax = fig.add_subplot(111)
1774
- ax.plot(x, y, "b.")
1775
- ax.plot(x, intercept + slope * x, "r-")
1776
- plt.show()
1777
-
1778
- def getRegFit(self, xvalues, yvalues, slope, intercept):
1779
- """
1780
- gets fitted line and residue
1781
-
1782
- Parameters
1783
- x : x values
1784
- y : y values
1785
- slope : regression slope
1786
- intercept : regressiob intercept
1787
- """
1788
- yfit = list()
1789
- residue = list()
1790
- for x,y in zip(xvalues, yvalues):
1791
- yf = x * slope + intercept
1792
- yfit.append(yf)
1793
- r = y - yf
1794
- residue.append(r)
1795
- result = self.__printResult("fitted line", yfit, "residue", residue)
1796
- return result
1797
-
1798
- def getInfluentialPoints(self, dsx, dsy):
1799
- """
1800
- gets influential points in regression model with Cook's distance
1801
-
1802
- Parameters
1803
- dsx : data set name or list or numpy array for x
1804
- dsy : data set name or list or numpy array for y
1805
- """
1806
- self.__printBanner("finding influential points for linear regression", dsx, dsy)
1807
- y = self.getNumericData(dsy)
1808
- x = np.arange(len(data)) if dsx is None else self.getNumericData(dsx)
1809
- model = sm.OLS(y, x).fit()
1810
- np.set_printoptions(suppress=True)
1811
- influence = model.get_influence()
1812
- cooks = influence.cooks_distance
1813
- result = self.__printResult("Cook distance", cooks)
1814
- return result
1815
-
1816
- def getCovar(self, *dsl):
1817
- """
1818
- gets covariance
1819
-
1820
- Parameters
1821
- dsl: list of data set name or list or numpy array
1822
- """
1823
- self.__printBanner("getting covariance", *dsl)
1824
- data = list(map(lambda ds : self.getNumericData(ds), dsl))
1825
- self.ensureSameSize(data)
1826
- data = np.vstack(data)
1827
- cv = np.cov(data)
1828
- print(cv)
1829
- return cv
1830
-
1831
- def getPearsonCorr(self, ds1, ds2, sigLev=.05):
1832
- """
1833
- gets pearson correlation coefficient
1834
-
1835
- Parameters
1836
- ds1: data set name or list or numpy array
1837
- ds2: data set name or list or numpy array
1838
- """
1839
- self.__printBanner("getting pearson correlation coefficient ", ds1, ds2)
1840
- data1 = self.getNumericData(ds1)
1841
- data2 = self.getNumericData(ds2)
1842
- self.ensureSameSize([data1, data2])
1843
- stat, pvalue = sta.pearsonr(data1, data2)
1844
- result = self.__printResult("stat", stat, "pvalue", pvalue)
1845
- self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1846
- return result
1847
-
1848
-
1849
- def getSpearmanRankCorr(self, ds1, ds2, sigLev=.05):
1850
- """
1851
- gets spearman correlation coefficient
1852
-
1853
- Parameters
1854
- ds1: data set name or list or numpy array
1855
- ds2: data set name or list or numpy array
1856
- sigLev: statistical significance level
1857
- """
1858
- self.__printBanner("getting spearman correlation coefficient",ds1, ds2)
1859
- data1 = self.getNumericData(ds1)
1860
- data2 = self.getNumericData(ds2)
1861
- self.ensureSameSize([data1, data2])
1862
- stat, pvalue = sta.spearmanr(data1, data2)
1863
- result = self.__printResult("stat", stat, "pvalue", pvalue)
1864
- self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1865
- return result
1866
-
1867
- def getKendalRankCorr(self, ds1, ds2, sigLev=.05):
1868
- """
1869
- kendall’s tau, a correlation measure for ordinal data
1870
-
1871
- Parameters
1872
- ds1: data set name or list or numpy array
1873
- ds2: data set name or list or numpy array
1874
- sigLev: statistical significance level
1875
- """
1876
- self.__printBanner("getting kendall’s tau, a correlation measure for ordinal data", ds1, ds2)
1877
- data1 = self.getNumericData(ds1)
1878
- data2 = self.getNumericData(ds2)
1879
- self.ensureSameSize([data1, data2])
1880
- stat, pvalue = sta.kendalltau(data1, data2)
1881
- result = self.__printResult("stat", stat, "pvalue", pvalue)
1882
- self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1883
- return result
1884
-
1885
- def getPointBiserialCorr(self, ds1, ds2, sigLev=.05):
1886
- """
1887
- point biserial correlation between binary and numeric
1888
-
1889
- Parameters
1890
- ds1: data set name or list or numpy array
1891
- ds2: data set name or list or numpy array
1892
- sigLev: statistical significance level
1893
- """
1894
- self.__printBanner("getting point biserial correlation between binary and numeric", ds1, ds2)
1895
- data1 = self.getNumericData(ds1)
1896
- data2 = self.getNumericData(ds2)
1897
- assert isBinary(data1), "first data set is not binary"
1898
- self.ensureSameSize([data1, data2])
1899
- stat, pvalue = sta.pointbiserialr(data1, data2)
1900
- result = self.__printResult("stat", stat, "pvalue", pvalue)
1901
- self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1902
- return result
1903
-
1904
- def getConTab(self, ds1, ds2):
1905
- """
1906
- get contingency table for categorical data pair
1907
-
1908
- Parameters
1909
- ds1: data set name or list or numpy array
1910
- ds2: data set name or list or numpy array
1911
- """
1912
- self.__printBanner("getting contingency table for categorical data", ds1, ds2)
1913
- data1 = self.getCatData(ds1)
1914
- data2 = self.getCatData(ds2)
1915
- self.ensureSameSize([data1, data2])
1916
- crosstab = pd.crosstab(pd.Series(data1), pd.Series(data2), margins = False)
1917
- ctab = crosstab.values
1918
- print("contingency table")
1919
- print(ctab)
1920
- return ctab
1921
-
1922
- def getChiSqCorr(self, ds1, ds2, sigLev=.05):
1923
- """
1924
- chi square correlation for categorical data pair
1925
-
1926
- Parameters
1927
- ds1: data set name or list or numpy array
1928
- ds2: data set name or list or numpy array
1929
- sigLev: statistical significance level
1930
- """
1931
- self.__printBanner("getting chi square correlation for two categorical", ds1, ds2)
1932
- ctab = self.getConTab(ds1, ds2)
1933
- stat, pvalue, dof, expctd = sta.chi2_contingency(ctab)
1934
- result = self.__printResult("stat", stat, "pvalue", pvalue, "dof", dof, "expected", expctd)
1935
- self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1936
- return result
1937
-
1938
- def getSizeCorrectChiSqCorr(self, ds1, ds2, chisq):
1939
- """
1940
- cramerV size corrected chi square correlation for categorical data pair
1941
-
1942
- Parameters
1943
- ds1: data set name or list or numpy array
1944
- ds2: data set name or list or numpy array
1945
- chisq: chisq stat
1946
- """
1947
- self.__printBanner("getting size corrected chi square correlation for two categorical", ds1, ds2)
1948
- c1 = self.getCatUniqueValueCounts(ds1)["cardinality"]
1949
- c2 = self.getCatUniqueValueCounts(ds2)["cardinality"]
1950
- c = min(c1,c2)
1951
- assertGreater(c, 1, "min cardinality should be greater than 1")
1952
- l = len(self.getCatData(ds1))
1953
- t = l * (c - 1)
1954
- stat = math.sqrt(chisq / t)
1955
- result = self.__printResult("stat", stat)
1956
- return result
1957
-
1958
- def getAnovaCorr(self, ds1, ds2, grByCol, sigLev=.05):
1959
- """
1960
- anova correlation for numerical categorical
1961
-
1962
- Parameters
1963
- ds1: data set name or list or numpy array
1964
- ds2: data set name or list or numpy array
1965
- grByCol : group by column
1966
- sigLev: statistical significance level
1967
- """
1968
- self.__printBanner("anova correlation for numerical categorical", ds1, ds2)
1969
- df = self.loadCatFloatDataFrame(ds1, ds2) if grByCol == 0 else self.loadCatFloatDataFrame(ds2, ds1)
1970
- grByCol = 0
1971
- dCol = 1
1972
- grouped = df.groupby([grByCol])
1973
- dlist = list(map(lambda v : v[1].loc[:, dCol].values, grouped))
1974
- stat, pvalue = sta.f_oneway(*dlist)
1975
- result = self.__printResult("stat", stat, "pvalue", pvalue)
1976
- self.__printStat(stat, pvalue, "probably uncorrelated", "probably correlated", sigLev)
1977
- return result
1978
-
1979
-
1980
- def plotAutoCorr(self, ds, lags, alpha, diffOrder=0):
1981
- """
1982
- plots auto correlation
1983
-
1984
- Parameters
1985
- ds: data set name or list or numpy array
1986
- lags: num of lags
1987
- alpha: confidence level
1988
- """
1989
- self.__printBanner("plotting auto correlation", ds)
1990
- data = self.getNumericData(ds)
1991
- ddata = difference(data, diffOrder) if diffOrder > 0 else data
1992
- tsaplots.plot_acf(ddata, lags = lags, alpha = alpha)
1993
- plt.show()
1994
-
1995
- def getAutoCorr(self, ds, lags, alpha=.05):
1996
- """
1997
- gets auts correlation
1998
-
1999
- Parameters
2000
- ds: data set name or list or numpy array
2001
- lags: num of lags
2002
- alpha: confidence level
2003
- """
2004
- self.__printBanner("getting auto correlation", ds)
2005
- data = self.getNumericData(ds)
2006
- autoCorr, confIntv = stt.acf(data, nlags=lags, fft=False, alpha=alpha)
2007
- result = self.__printResult("autoCorr", autoCorr, "confIntv", confIntv)
2008
- return result
2009
-
2010
-
2011
- def plotParAcf(self, ds, lags, alpha):
2012
- """
2013
- partial auto correlation
2014
-
2015
- Parameters
2016
- ds: data set name or list or numpy array
2017
- lags: num of lags
2018
- alpha: confidence level
2019
- """
2020
- self.__printBanner("plotting partial auto correlation", ds)
2021
- data = self.getNumericData(ds)
2022
- tsaplots.plot_pacf(data, lags = lags, alpha = alpha)
2023
- plt.show()
2024
-
2025
- def getParAutoCorr(self, ds, lags, alpha=.05):
2026
- """
2027
- gets partial auts correlation
2028
-
2029
- Parameters
2030
- ds: data set name or list or numpy array
2031
- lags: num of lags
2032
- alpha: confidence level
2033
- """
2034
- self.__printBanner("getting partial auto correlation", ds)
2035
- data = self.getNumericData(ds)
2036
- partAutoCorr, confIntv = stt.pacf(data, nlags=lags, alpha=alpha)
2037
- result = self.__printResult("partAutoCorr", partAutoCorr, "confIntv", confIntv)
2038
- return result
2039
-
2040
- def getHurstExp(self, ds, kind, doPlot=True):
2041
- """
2042
- gets Hurst exponent of time series
2043
-
2044
- Parameters
2045
- ds: data set name or list or numpy array
2046
- kind: kind of data change, random_walk, price
2047
- doPlot: True for plot
2048
- """
2049
- self.__printBanner("getting Hurst exponent", ds)
2050
- data = self.getNumericData(ds)
2051
- h, c, odata = hurst.compute_Hc(data, kind=kind, simplified=False)
2052
- if doPlot:
2053
- f, ax = plt.subplots()
2054
- ax.plot(odata[0], c * odata[0] ** h, color="deepskyblue")
2055
- ax.scatter(odata[0], odata[1], color="purple")
2056
- ax.set_xscale("log")
2057
- ax.set_yscale("log")
2058
- ax.set_xlabel("time interval")
2059
- ax.set_ylabel("cum dev range and std dev ratio")
2060
- ax.grid(True)
2061
- plt.show()
2062
-
2063
- result = self.__printResult("hurstExponent", h, "hurstConstant", c)
2064
- return result
2065
-
2066
- def approxEntropy(self, ds, m, r):
2067
- """
2068
- gets apprx entroty of time series (ref: wikipedia)
2069
-
2070
- Parameters
2071
- ds: data set name or list or numpy array
2072
- m: length of compared run of data
2073
- r: filtering level
2074
- """
2075
- self.__printBanner("getting approximate entropy", ds)
2076
- ldata = self.getNumericData(ds)
2077
- aent = abs(self.__phi(ldata, m + 1, r) - self.__phi(ldata, m, r))
2078
- result = self.__printResult("approxEntropy", aent)
2079
- return result
2080
-
2081
- def __phi(self, ldata, m, r):
2082
- """
2083
- phi function for approximate entropy
2084
-
2085
- Parameters
2086
- ldata: data array
2087
- m: length of compared run of data
2088
- r: filtering level
2089
- """
2090
- le = len(ldata)
2091
- x = [[ldata[j] for j in range(i, i + m - 1 + 1)] for i in range(le - m + 1)]
2092
- lex = len(x)
2093
- c = list()
2094
- for i in range(lex):
2095
- cnt = 0
2096
- for j in range(lex):
2097
- cnt += (1 if maxListDist(x[i], x[j]) <= r else 0)
2098
- cnt /= (le - m + 1.0)
2099
- c.append(cnt)
2100
- return sum(np.log(c)) / (le - m + 1.0)
2101
-
2102
-
2103
- def oneSpaceEntropy(self, ds, scaMethod="zscale"):
2104
- """
2105
- gets one space entroty (ref: Estimating mutual information by Kraskov)
2106
-
2107
- Parameters
2108
- ds: data set name or list or numpy array
2109
- """
2110
- self.__printBanner("getting one space entropy", ds)
2111
- data = self.getNumericData(ds)
2112
- sdata = sorted(data)
2113
- sdata = scaleData(sdata, scaMethod)
2114
- su = 0
2115
- n = len(sdata)
2116
- for i in range(1, n, 1):
2117
- t = abs(sdata[i] - sdata[i-1])
2118
- if t > 0:
2119
- su += log(t)
2120
- su /= (n -1)
2121
- #print(su)
2122
- ose = digammaFun(n) - digammaFun(1) + su
2123
- result = self.__printResult("entropy", ose)
2124
- return result
2125
-
2126
-
2127
- def plotCrossCorr(self, ds1, ds2, normed, lags):
2128
- """
2129
- plots cross correlation
2130
-
2131
- Parameters
2132
- ds1: data set name or list or numpy array
2133
- ds2: data set name or list or numpy array
2134
- normed: If True, input vectors are normalised to unit
2135
- lags: num of lags
2136
- """
2137
- self.__printBanner("plotting cross correlation between two numeric", ds1, ds2)
2138
- data1 = self.getNumericData(ds1)
2139
- data2 = self.getNumericData(ds2)
2140
- self.ensureSameSize([data1, data2])
2141
- plt.xcorr(data1, data2, normed=normed, maxlags=lags)
2142
- plt.show()
2143
-
2144
- def getCrossCorr(self, ds1, ds2):
2145
- """
2146
- gets cross correlation
2147
-
2148
- Parameters
2149
- ds1: data set name or list or numpy array
2150
- ds2: data set name or list or numpy array
2151
- """
2152
- self.__printBanner("getting cross correlation", ds1, ds2)
2153
- data1 = self.getNumericData(ds1)
2154
- data2 = self.getNumericData(ds2)
2155
- self.ensureSameSize([data1, data2])
2156
- crossCorr = stt.ccf(data1, data2)
2157
- result = self.__printResult("crossCorr", crossCorr)
2158
- return result
2159
-
2160
- def getFourierTransform(self, ds):
2161
- """
2162
- gets fast fourier transform
2163
-
2164
- Parameters
2165
- ds: data set name or list or numpy array
2166
- """
2167
- self.__printBanner("getting fourier transform", ds)
2168
- data = self.getNumericData(ds)
2169
- ft = np.fft.rfft(data)
2170
- result = self.__printResult("fourierTransform", ft)
2171
- return result
2172
-
2173
-
2174
- def testStationaryAdf(self, ds, regression, autolag, sigLev=.05):
2175
- """
2176
- Adf stationary test null hyp not stationary
2177
-
2178
- Parameters
2179
- ds: data set name or list or numpy array
2180
- regression: constant and trend order to include in regression
2181
- autolag: method to use when automatically determining the lag
2182
- sigLev: statistical significance level
2183
- """
2184
- self.__printBanner("doing ADF stationary test", ds)
2185
- relist = ["c","ct","ctt","nc"]
2186
- assert regression in relist, "invalid regression value"
2187
- alList = ["AIC", "BIC", "t-stat", None]
2188
- assert autolag in alList, "invalid autolag value"
2189
-
2190
- data = self.getNumericData(ds)
2191
- re = stt.adfuller(data, regression=regression, autolag=autolag)
2192
- result = self.__printResult("stat", re[0], "pvalue", re[1] , "num lags", re[2] , "num observation for regression", re[3],
2193
- "critial values", re[4])
2194
- self.__printStat(re[0], re[1], "probably not stationary", "probably stationary", sigLev)
2195
- return result
2196
-
2197
- def testStationaryKpss(self, ds, regression, nlags, sigLev=.05):
2198
- """
2199
- Kpss stationary test null hyp stationary
2200
-
2201
- Parameters
2202
- ds: data set name or list or numpy array
2203
- regression: constant and trend order to include in regression
2204
- nlags : no of lags
2205
- sigLev: statistical significance level
2206
- """
2207
- self.__printBanner("doing KPSS stationary test", ds)
2208
- relist = ["c","ct"]
2209
- assert regression in relist, "invalid regression value"
2210
- nlList =[None, "auto", "legacy"]
2211
- assert nlags in nlList or type(nlags) == int, "invalid nlags value"
2212
-
2213
-
2214
- data = self.getNumericData(ds)
2215
- stat, pvalue, nLags, criticalValues = stt.kpss(data, regression=regression, lags=nlags)
2216
- result = self.__printResult("stat", stat, "pvalue", pvalue, "num lags", nLags, "critial values", criticalValues)
2217
- self.__printStat(stat, pvalue, "probably stationary", "probably not stationary", sigLev)
2218
- return result
2219
-
2220
- def testNormalJarqBera(self, ds, sigLev=.05):
2221
- """
2222
- jarque bera normalcy test
2223
-
2224
- Parameters
2225
- ds: data set name or list or numpy array
2226
- sigLev: statistical significance level
2227
- """
2228
- self.__printBanner("doing ajrque bera normalcy test", ds)
2229
- data = self.getNumericData(ds)
2230
- jb, jbpv, skew, kurtosis = sstt.jarque_bera(data)
2231
- result = self.__printResult("stat", jb, "pvalue", jbpv, "skew", skew, "kurtosis", kurtosis)
2232
- self.__printStat(jb, jbpv, "probably gaussian", "probably not gaussian", sigLev)
2233
- return result
2234
-
2235
-
2236
- def testNormalShapWilk(self, ds, sigLev=.05):
2237
- """
2238
- shapiro wilks normalcy test
2239
-
2240
- Parameters
2241
- ds: data set name or list or numpy array
2242
- sigLev: statistical significance level
2243
- """
2244
- self.__printBanner("doing shapiro wilks normalcy test", ds)
2245
- data = self.getNumericData(ds)
2246
- stat, pvalue = sta.shapiro(data)
2247
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2248
- self.__printStat(stat, pvalue, "probably gaussian", "probably not gaussian", sigLev)
2249
- return result
2250
-
2251
- def testNormalDagast(self, ds, sigLev=.05):
2252
- """
2253
- D’Agostino’s K square normalcy test
2254
-
2255
- Parameters
2256
- ds: data set name or list or numpy array
2257
- sigLev: statistical significance level
2258
- """
2259
- self.__printBanner("doing D’Agostino’s K square normalcy test", ds)
2260
- data = self.getNumericData(ds)
2261
- stat, pvalue = sta.normaltest(data)
2262
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2263
- self.__printStat(stat, pvalue, "probably gaussian", "probably not gaussian", sigLev)
2264
- return result
2265
-
2266
- def testDistrAnderson(self, ds, dist, sigLev=.05):
2267
- """
2268
- Anderson test for normal, expon, logistic, gumbel, gumbel_l, gumbel_r
2269
-
2270
- Parameters
2271
- ds: data set name or list or numpy array
2272
- dist: type of distribution
2273
- sigLev: statistical significance level
2274
- """
2275
- self.__printBanner("doing Anderson test for for various distributions", ds)
2276
- diList = ["norm", "expon", "logistic", "gumbel", "gumbel_l", "gumbel_r", "extreme1"]
2277
- assert dist in diList, "invalid distribution"
2278
-
2279
- data = self.getNumericData(ds)
2280
- re = sta.anderson(data)
2281
- slAlpha = int(100 * sigLev)
2282
- msg = "significnt value not found"
2283
- for i in range(len(re.critical_values)):
2284
- sl, cv = re.significance_level[i], re.critical_values[i]
2285
- if int(sl) == slAlpha:
2286
- if re.statistic < cv:
2287
- msg = "probably {} at the {:.3f} siginificance level".format(dist, sl)
2288
- else:
2289
- msg = "probably not {} at the {:.3f} siginificance level".format(dist, sl)
2290
- result = self.__printResult("stat", re.statistic, "test", msg)
2291
- print(msg)
2292
- return result
2293
-
2294
- def testSkew(self, ds, sigLev=.05):
2295
- """
2296
- test skew wrt normal distr
2297
-
2298
- Parameters
2299
- ds: data set name or list or numpy array
2300
- sigLev: statistical significance level
2301
- """
2302
- self.__printBanner("testing skew wrt normal distr", ds)
2303
- data = self.getNumericData(ds)
2304
- stat, pvalue = sta.skewtest(data)
2305
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2306
- self.__printStat(stat, pvalue, "probably same skew as normal distribution", "probably not same skew as normal distribution", sigLev)
2307
- return result
2308
-
2309
- def testTwoSampleStudent(self, ds1, ds2, sigLev=.05):
2310
- """
2311
- student t 2 sample test
2312
-
2313
- Parameters
2314
- ds1: data set name or list or numpy array
2315
- ds2: data set name or list or numpy array
2316
- sigLev: statistical significance level
2317
- """
2318
- self.__printBanner("doing student t 2 sample test", ds1, ds2)
2319
- data1 = self.getNumericData(ds1)
2320
- data2 = self.getNumericData(ds2)
2321
- stat, pvalue = sta.ttest_ind(data1, data2)
2322
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2323
- self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2324
- return result
2325
-
2326
- def testTwoSampleKs(self, ds1, ds2, sigLev=.05):
2327
- """
2328
- Kolmogorov Sminov 2 sample statistic
2329
-
2330
- Parameters
2331
- ds1: data set name or list or numpy array
2332
- ds2: data set name or list or numpy array
2333
- sigLev: statistical significance level
2334
- """
2335
- self.__printBanner("doing Kolmogorov Sminov 2 sample test", ds1, ds2)
2336
- data1 = self.getNumericData(ds1)
2337
- data2 = self.getNumericData(ds2)
2338
- stat, pvalue = sta.ks_2samp(data1, data2)
2339
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2340
- self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2341
-
2342
-
2343
- def testTwoSampleMw(self, ds1, ds2, sigLev=.05):
2344
- """
2345
- Mann-Whitney 2 sample statistic
2346
-
2347
- Parameters
2348
- ds1: data set name or list or numpy array
2349
- ds2: data set name or list or numpy array
2350
- sigLev: statistical significance level
2351
- """
2352
- self.__printBanner("doing Mann-Whitney 2 sample test", ds1, ds2)
2353
- data1 = self.getNumericData(ds1)
2354
- data2 = self.getNumericData(ds2)
2355
- stat, pvalue = sta.mannwhitneyu(data1, data2)
2356
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2357
- self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2358
-
2359
- def testTwoSampleWilcox(self, ds1, ds2, sigLev=.05):
2360
- """
2361
- Wilcoxon Signed-Rank 2 sample statistic
2362
-
2363
- Parameters
2364
- ds1: data set name or list or numpy array
2365
- ds2: data set name or list or numpy array
2366
- sigLev: statistical significance level
2367
- """
2368
- self.__printBanner("doing Wilcoxon Signed-Rank 2 sample test", ds1, ds2)
2369
- data1 = self.getNumericData(ds1)
2370
- data2 = self.getNumericData(ds2)
2371
- stat, pvalue = sta.wilcoxon(data1, data2)
2372
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2373
- self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2374
-
2375
-
2376
- def testTwoSampleKw(self, ds1, ds2, sigLev=.05):
2377
- """
2378
- Kruskal-Wallis 2 sample statistic
2379
-
2380
- Parameters
2381
- ds1: data set name or list or numpy array
2382
- ds2: data set name or list or numpy array
2383
- sigLev: statistical significance level
2384
- """
2385
- self.__printBanner("doing Kruskal-Wallis 2 sample test", ds1, ds2)
2386
- data1 = self.getNumericData(ds1)
2387
- data2 = self.getNumericData(ds2)
2388
- stat, pvalue = sta.kruskal(data1, data2)
2389
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2390
- self.__printStat(stat, pvalue, "probably same distribution", "probably snot ame distribution", sigLev)
2391
-
2392
- def testTwoSampleFriedman(self, ds1, ds2, ds3, sigLev=.05):
2393
- """
2394
- Friedman 2 sample statistic
2395
-
2396
- Parameters
2397
- ds1: data set name or list or numpy array
2398
- ds2: data set name or list or numpy array
2399
- sigLev: statistical significance level
2400
- """
2401
- self.__printBanner("doing Friedman 2 sample test", ds1, ds2)
2402
- data1 = self.getNumericData(ds1)
2403
- data2 = self.getNumericData(ds2)
2404
- data3 = self.getNumericData(ds3)
2405
- stat, pvalue = sta.friedmanchisquare(data1, data2, data3)
2406
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2407
- self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2408
-
2409
- def testTwoSampleEs(self, ds1, ds2, sigLev=.05):
2410
- """
2411
- Epps Singleton 2 sample statistic
2412
-
2413
- Parameters
2414
- ds1: data set name or list or numpy array
2415
- ds2: data set name or list or numpy array
2416
- sigLev: statistical significance level
2417
- """
2418
- self.__printBanner("doing Epps Singleton 2 sample test", ds1, ds2)
2419
- data1 = self.getNumericData(ds1)
2420
- data2 = self.getNumericData(ds2)
2421
- stat, pvalue = sta.epps_singleton_2samp(data1, data2)
2422
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2423
- self.__printStat(stat, pvalue, "probably same distribution", "probably not same distribution", sigLev)
2424
-
2425
- def testTwoSampleAnderson(self, ds1, ds2, sigLev=.05):
2426
- """
2427
- Anderson 2 sample statistic
2428
-
2429
- Parameters
2430
- ds1: data set name or list or numpy array
2431
- ds2: data set name or list or numpy array
2432
- sigLev: statistical significance level
2433
- """
2434
- self.__printBanner("doing Anderson 2 sample test", ds1, ds2)
2435
- data1 = self.getNumericData(ds1)
2436
- data2 = self.getNumericData(ds2)
2437
- dseq = (data1, data2)
2438
- stat, critValues, sLev = sta.anderson_ksamp(dseq)
2439
- slAlpha = 100 * sigLev
2440
-
2441
- if slAlpha == 10:
2442
- cv = critValues[1]
2443
- elif slAlpha == 5:
2444
- cv = critValues[2]
2445
- elif slAlpha == 2.5:
2446
- cv = critValues[3]
2447
- elif slAlpha == 1:
2448
- cv = critValues[4]
2449
- else:
2450
- cv = None
2451
-
2452
- result = self.__printResult("stat", stat, "critValues", critValues, "critValue", cv, "significanceLevel", sLev)
2453
- print("stat: {:.3f}".format(stat))
2454
- if cv is None:
2455
- msg = "critical values value not found for provided siginificance level"
2456
- else:
2457
- if stat < cv:
2458
- msg = "probably same distribution at the {:.3f} siginificance level".format(sigLev)
2459
- else:
2460
- msg = "probably not same distribution at the {:.3f} siginificance level".format(sigLev)
2461
- print(msg)
2462
- return result
2463
-
2464
-
2465
- def testTwoSampleScaleAb(self, ds1, ds2, sigLev=.05):
2466
- """
2467
- Ansari Bradley 2 sample scale statistic
2468
-
2469
- Parameters
2470
- ds1: data set name or list or numpy array
2471
- ds2: data set name or list or numpy array
2472
- sigLev: statistical significance level
2473
- """
2474
- self.__printBanner("doing Ansari Bradley 2 sample scale test", ds1, ds2)
2475
- data1 = self.getNumericData(ds1)
2476
- data2 = self.getNumericData(ds2)
2477
- stat, pvalue = sta.ansari(data1, data2)
2478
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2479
- self.__printStat(stat, pvalue, "probably same scale", "probably not same scale", sigLev)
2480
- return result
2481
-
2482
- def testTwoSampleScaleMood(self, ds1, ds2, sigLev=.05):
2483
- """
2484
- Mood 2 sample scale statistic
2485
-
2486
- Parameters
2487
- ds1: data set name or list or numpy array
2488
- ds2: data set name or list or numpy array
2489
- sigLev: statistical significance level
2490
- """
2491
- self.__printBanner("doing Mood 2 sample scale test", ds1, ds2)
2492
- data1 = self.getNumericData(ds1)
2493
- data2 = self.getNumericData(ds2)
2494
- stat, pvalue = sta.mood(data1, data2)
2495
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2496
- self.__printStat(stat, pvalue, "probably same scale", "probably not same scale", sigLev)
2497
- return result
2498
-
2499
- def testTwoSampleVarBartlet(self, ds1, ds2, sigLev=.05):
2500
- """
2501
- Ansari Bradley 2 sample scale statistic
2502
-
2503
- Parameters
2504
- ds1: data set name or list or numpy array
2505
- ds2: data set name or list or numpy array
2506
- sigLev: statistical significance level
2507
- """
2508
- self.__printBanner("doing Ansari Bradley 2 sample scale test", ds1, ds2)
2509
- data1 = self.getNumericData(ds1)
2510
- data2 = self.getNumericData(ds2)
2511
- stat, pvalue = sta.bartlett(data1, data2)
2512
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2513
- self.__printStat(stat, pvalue, "probably same variance", "probably not same variance", sigLev)
2514
- return result
2515
-
2516
- def testTwoSampleVarLevene(self, ds1, ds2, sigLev=.05):
2517
- """
2518
- Levene 2 sample variance statistic
2519
-
2520
- Parameters
2521
- ds1: data set name or list or numpy array
2522
- ds2: data set name or list or numpy array
2523
- sigLev: statistical significance level
2524
- """
2525
- self.__printBanner("doing Levene 2 sample variance test", ds1, ds2)
2526
- data1 = self.getNumericData(ds1)
2527
- data2 = self.getNumericData(ds2)
2528
- stat, pvalue = sta.levene(data1, data2)
2529
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2530
- self.__printStat(stat, pvalue, "probably same variance", "probably not same variance", sigLev)
2531
- return result
2532
-
2533
- def testTwoSampleVarFk(self, ds1, ds2, sigLev=.05):
2534
- """
2535
- Fligner-Killeen 2 sample variance statistic
2536
-
2537
- Parameters
2538
- ds1: data set name or list or numpy array
2539
- ds2: data set name or list or numpy array
2540
- sigLev: statistical significance level
2541
- """
2542
- self.__printBanner("doing Fligner-Killeen 2 sample variance test", ds1, ds2)
2543
- data1 = self.getNumericData(ds1)
2544
- data2 = self.getNumericData(ds2)
2545
- stat, pvalue = sta.fligner(data1, data2)
2546
- result = self.__printResult("stat", stat, "pvalue", pvalue)
2547
- self.__printStat(stat, pvalue, "probably same variance", "probably not same variance", sigLev)
2548
- return result
2549
-
2550
- def testTwoSampleMedMood(self, ds1, ds2, sigLev=.05):
2551
- """
2552
- Mood 2 sample median statistic
2553
-
2554
- Parameters
2555
- ds1: data set name or list or numpy array
2556
- ds2: data set name or list or numpy array
2557
- sigLev: statistical significance level
2558
- """
2559
- self.__printBanner("doing Mood 2 sample median test", ds1, ds2)
2560
- data1 = self.getNumericData(ds1)
2561
- data2 = self.getNumericData(ds2)
2562
- stat, pvalue, median, ctable = sta.median_test(data1, data2)
2563
- result = self.__printResult("stat", stat, "pvalue", pvalue, "median", median, "contigencyTable", ctable)
2564
- self.__printStat(stat, pvalue, "probably same median", "probably not same median", sigLev)
2565
- return result
2566
-
2567
- def testTwoSampleZc(self, ds1, ds2, sigLev=.05):
2568
- """
2569
- Zhang-C 2 sample statistic
2570
-
2571
- Parameters
2572
- ds1: data set name or list or numpy array
2573
- ds2: data set name or list or numpy array
2574
- sigLev: statistical significance level
2575
- """
2576
- self.__printBanner("doing Zhang-C 2 sample test", ds1, ds2)
2577
- data1 = self.getNumericData(ds1)
2578
- data2 = self.getNumericData(ds2)
2579
- l1 = len(data1)
2580
- l2 = len(data2)
2581
- l = l1 + l2
2582
-
2583
- #find ranks
2584
- pooled = np.concatenate([data1, data2])
2585
- ranks = findRanks(data1, pooled)
2586
- ranks.extend(findRanks(data2, pooled))
2587
-
2588
- s1 = 0.0
2589
- for i in range(1, l1+1):
2590
- s1 += math.log(l1 / (i - 0.5) - 1.0) * math.log(l / (ranks[i-1] - 0.5) - 1.0)
2591
-
2592
- s2 = 0.0
2593
- for i in range(1, l2+1):
2594
- s2 += math.log(l2 / (i - 0.5) - 1.0) * math.log(l / (ranks[l1 + i - 1] - 0.5) - 1.0)
2595
- stat = (s1 + s2) / l
2596
- print(formatFloat(3, stat, "stat:"))
2597
- return stat
2598
-
2599
- def testTwoSampleZa(self, ds1, ds2, sigLev=.05):
2600
- """
2601
- Zhang-A 2 sample statistic
2602
-
2603
- Parameters
2604
- ds1: data set name or list or numpy array
2605
- ds2: data set name or list or numpy array
2606
- sigLev: statistical significance level
2607
- """
2608
- self.__printBanner("doing Zhang-A 2 sample test", ds1, ds2)
2609
- data1 = self.getNumericData(ds1)
2610
- data2 = self.getNumericData(ds2)
2611
- l1 = len(data1)
2612
- l2 = len(data2)
2613
- l = l1 + l2
2614
- pooled = np.concatenate([data1, data2])
2615
- cd1 = CumDistr(data1)
2616
- cd2 = CumDistr(data2)
2617
- sum = 0.0
2618
- for i in range(1, l+1):
2619
- v = pooled[i-1]
2620
- f1 = cd1.getDistr(v)
2621
- f2 = cd2.getDistr(v)
2622
-
2623
- t1 = f1 * math.log(f1)
2624
- t2 = 0 if f1 == 1.0 else (1.0 - f1) * math.log(1.0 - f1)
2625
- sum += l1 * (t1 + t2) / ((i - 0.5) * (l - i + 0.5))
2626
- t1 = f2 * math.log(f2)
2627
- t2 = 0 if f2 == 1.0 else (1.0 - f2) * math.log(1.0 - f2)
2628
- sum += l2 * (t1 + t2) / ((i - 0.5) * (l - i + 0.5))
2629
- stat = -sum
2630
- print(formatFloat(3, stat, "stat:"))
2631
- return stat
2632
-
2633
- def testTwoSampleZk(self, ds1, ds2, sigLev=.05):
2634
- """
2635
- Zhang-K 2 sample statistic
2636
-
2637
- Parameters
2638
- ds1: data set name or list or numpy array
2639
- ds2: data set name or list or numpy array
2640
- sigLev: statistical significance level
2641
- """
2642
- self.__printBanner("doing Zhang-K 2 sample test", ds1, ds2)
2643
- data1 = self.getNumericData(ds1)
2644
- data2 = self.getNumericData(ds2)
2645
- l1 = len(data1)
2646
- l2 = len(data2)
2647
- l = l1 + l2
2648
- pooled = np.concatenate([data1, data2])
2649
- cd1 = CumDistr(data1)
2650
- cd2 = CumDistr(data2)
2651
- cd = CumDistr(pooled)
2652
-
2653
- maxStat = None
2654
- for i in range(1, l+1):
2655
- v = pooled[i-1]
2656
- f1 = cd1.getDistr(v)
2657
- f2 = cd2.getDistr(v)
2658
- f = cd.getDistr(v)
2659
-
2660
- t1 = 0 if f1 == 0 else f1 * math.log(f1 / f)
2661
- t2 = 0 if f1 == 1.0 else (1.0 - f1) * math.log((1.0 - f1) / (1.0 - f))
2662
- stat = l1 * (t1 + t2)
2663
- t1 = 0 if f2 == 0 else f2 * math.log(f2 / f)
2664
- t2 = 0 if f2 == 1.0 else (1.0 - f2) * math.log((1.0 - f2) / (1.0 - f))
2665
- stat += l2 * (t1 + t2)
2666
- if maxStat is None or stat > maxStat:
2667
- maxStat = stat
2668
- print(formatFloat(3, maxStat, "stat:"))
2669
- return maxStat
2670
-
2671
-
2672
- def testTwoSampleCvm(self, ds1, ds2, sigLev=.05):
2673
- """
2674
- 2 sample cramer von mises
2675
-
2676
- Parameters
2677
- ds1: data set name or list or numpy array
2678
- ds2: data set name or list or numpy array
2679
- sigLev: statistical significance level
2680
- """
2681
- self.__printBanner("doing 2 sample CVM test", ds1, ds2)
2682
- data1 = self.getNumericData(ds1)
2683
- data2 = self.getNumericData(ds2)
2684
- data = np.concatenate((data1,data2))
2685
- rdata = sta.rankdata(data)
2686
- n = len(data1)
2687
- m = len(data2)
2688
- l = n + m
2689
-
2690
- s1 = 0
2691
- for i in range(n):
2692
- t = rdata[i] - (i+1)
2693
- s1 += (t * t)
2694
- s1 *= n
2695
-
2696
- s2 = 0
2697
- for i in range(m):
2698
- t = rdata[i + n] - (i+1)
2699
- s2 += (t * t)
2700
- s2 *= m
2701
-
2702
- u = s1 + s2
2703
- stat = u / (n * m * l) - (4 * m * n - 1) / (6 * l)
2704
- result = self.__printResult("stat", stat)
2705
- return result
2706
-
2707
- def ensureSameSize(self, dlist):
2708
- """
2709
- ensures all data sets are of same size
2710
-
2711
- Parameters
2712
- dlist : data source list
2713
- """
2714
- le = None
2715
- for d in dlist:
2716
- cle = len(d)
2717
- if le is None:
2718
- le = cle
2719
- else:
2720
- assert cle == le, "all data sets need to be of same size"
2721
-
2722
-
2723
- def testTwoSampleWasserstein(self, ds1, ds2):
2724
- """
2725
- Wasserstein 2 sample statistic
2726
-
2727
- Parameters
2728
- ds1: data set name or list or numpy array
2729
- ds2: data set name or list or numpy array
2730
- """
2731
- self.__printBanner("doing Wasserstein distance2 sample test", ds1, ds2)
2732
- data1 = self.getNumericData(ds1)
2733
- data2 = self.getNumericData(ds2)
2734
- stat = sta.wasserstein_distance(data1, data2)
2735
- sd = np.std(np.concatenate([data1, data2]))
2736
- nstat = stat / sd
2737
- result = self.__printResult("stat", stat, "normalizedStat", nstat)
2738
- return result
2739
-
2740
- def getMaxRelMinRedFeatures(self, fdst, tdst, nfeatures, nbins=20):
2741
- """
2742
- get top n features based on max relevance and min redudancy algorithm
2743
-
2744
- Parameters
2745
- fdst: list of pair of data set name or list or numpy array and data type
2746
- tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2747
- nfeatures : desired no of features
2748
- nbins : no of bins for numerical data
2749
- """
2750
- self.__printBanner("doing max relevance min redundancy feature selection")
2751
- return self.getMutInfoFeatures(fdst, tdst, nfeatures, "mrmr", nbins)
2752
-
2753
- def getJointMutInfoFeatures(self, fdst, tdst, nfeatures, nbins=20):
2754
- """
2755
- get top n features based on joint mutual infoormation algorithm
2756
-
2757
- Parameters
2758
- fdst: list of pair of data set name or list or numpy array and data type
2759
- tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2760
- nfeatures : desired no of features
2761
- nbins : no of bins for numerical data
2762
- """
2763
- self.__printBanner("doingjoint mutual info feature selection")
2764
- return self.getMutInfoFeatures(fdst, tdst, nfeatures, "jmi", nbins)
2765
-
2766
- def getCondMutInfoMaxFeatures(self, fdst, tdst, nfeatures, nbins=20):
2767
- """
2768
- get top n features based on condition mutual information maximization algorithm
2769
-
2770
- Parameters
2771
- fdst: list of pair of data set name or list or numpy array and data type
2772
- tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2773
- nfeatures : desired no of features
2774
- nbins : no of bins for numerical data
2775
- """
2776
- self.__printBanner("doing conditional mutual info max feature selection")
2777
- return self.getMutInfoFeatures(fdst, tdst, nfeatures, "cmim", nbins)
2778
-
2779
- def getInteractCapFeatures(self, fdst, tdst, nfeatures, nbins=20):
2780
- """
2781
- get top n features based on interaction capping algorithm
2782
-
2783
- Parameters
2784
- fdst: list of pair of data set name or list or numpy array and data type
2785
- tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2786
- nfeatures : desired no of features
2787
- nbins : no of bins for numerical data
2788
- """
2789
- self.__printBanner("doing interaction capped feature selection")
2790
- return self.getMutInfoFeatures(fdst, tdst, nfeatures, "icap", nbins)
2791
-
2792
- def getMutInfoFeatures(self, fdst, tdst, nfeatures, algo, nbins=20):
2793
- """
2794
- get top n features based on various mutual information based algorithm
2795
- ref: Conditional likelihood maximisation : A unifying framework for information
2796
- theoretic feature selection, Gavin Brown
2797
-
2798
- Parameters
2799
- fdst: list of pair of data set name or list or numpy array and data type
2800
- tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2801
- nfeatures : desired no of features
2802
- algo: mi based feature selection algorithm
2803
- nbins : no of bins for numerical data
2804
- """
2805
- #verify data source types types
2806
- le = len(fdst)
2807
- nfeatGiven = int(le / 2)
2808
- assertGreater(nfeatGiven, nfeatures, "no of features should be greater than no of features to be selected")
2809
- fds = list()
2810
- types = ["num", "cat"]
2811
- for i in range (0, le, 2):
2812
- ds = fdst[i]
2813
- dt = fdst[i+1]
2814
- assertInList(dt, types, "invalid type for data source " + dt)
2815
- data = self.getNumericData(ds) if dt == "num" else self.getCatData(ds)
2816
- p =(ds, dt)
2817
- fds.append(p)
2818
- algos = ["mrmr", "jmi", "cmim", "icap"]
2819
- assertInList(algo, algos, "invalid feature selection algo " + algo)
2820
-
2821
- assertInList(tdst[1], types, "invalid type for data source " + tdst[1])
2822
- data = self.getNumericData(tdst[0]) if tdst[1] == "num" else self.getCatData(tdst[0])
2823
- #print(fds)
2824
-
2825
- sfds = list()
2826
- selected = set()
2827
- relevancies = dict()
2828
- for i in range(nfeatures):
2829
- #print(i)
2830
- scorem = None
2831
- dsm = None
2832
- dsmt = None
2833
- for ds, dt in fds:
2834
- #print(ds, dt)
2835
- if ds not in selected:
2836
- #relevancy
2837
- if ds in relevancies:
2838
- mutInfo = relevancies[ds]
2839
- else:
2840
- mutInfo = self.getMutualInfo([ds, dt, tdst[0], tdst[1]], nbins)["mutInfo"]
2841
- relevancies[ds] = mutInfo
2842
- relev = mutInfo
2843
- #print("relev", relev)
2844
-
2845
- #redundancy
2846
- smi = 0
2847
- reds = list()
2848
- for sds, sdt, _ in sfds:
2849
- #print(sds, sdt)
2850
- mutInfo = self.getMutualInfo([ds, dt, sds, sdt], nbins)["mutInfo"]
2851
- mutInfoCnd = self.getCondMutualInfo([ds, dt, sds, sdt, tdst[0], tdst[1]], nbins)["condMutInfo"] \
2852
- if algo != "mrmr" else 0
2853
-
2854
- red = mutInfo - mutInfoCnd
2855
- reds.append(red)
2856
-
2857
- if algo == "mrmr" or algo == "jmi":
2858
- redun = sum(reds) / len(sfds) if len(sfds) > 0 else 0
2859
- elif algo == "cmim" or algo == "icap":
2860
- redun = max(reds) if len(sfds) > 0 else 0
2861
- if algo == "icap":
2862
- redun = max(0, redun)
2863
- #print("redun", redun)
2864
- score = relev - redun
2865
- if scorem is None or score > scorem:
2866
- scorem = score
2867
- dsm = ds
2868
- dsmt = dt
2869
-
2870
- pa = (dsm, dsmt, scorem)
2871
- #print(pa)
2872
- sfds.append(pa)
2873
- selected.add(dsm)
2874
-
2875
- selFeatures = list(map(lambda r : (r[0], r[2]), sfds))
2876
- result = self.__printResult("selFeatures", selFeatures)
2877
- return result
2878
-
2879
-
2880
- def getFastCorrFeatures(self, fdst, tdst, delta, nbins=20):
2881
- """
2882
- get top features based on Fast Correlation Based Filter (FCBF)
2883
- ref: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution
2884
- Lei Yu
2885
-
2886
- Parameters
2887
- fdst: list of pair of data set name or list or numpy array and data type
2888
- tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2889
- delta : feature, target correlation threshold
2890
- nbins : no of bins for numerical data
2891
- """
2892
- le = len(fdst)
2893
- nfeatGiven = int(le / 2)
2894
- fds = list()
2895
- types = ["num", "cat"]
2896
- for i in range (0, le, 2):
2897
- ds = fdst[i]
2898
- dt = fdst[i+1]
2899
- assertInList(dt, types, "invalid type for data source " + dt)
2900
- data = self.getNumericData(ds) if dt == "num" else self.getCatData(ds)
2901
- p =(ds, dt)
2902
- fds.append(p)
2903
-
2904
- assertInList(tdst[1], types, "invalid type for data source " + tdst[1])
2905
- data = self.getNumericData(tdst[0]) if tdst[1] == "num" else self.getCatData(tdst[0])
2906
-
2907
- # get features with symetric uncertainty above threshold
2908
- tentr = self.getAnyEntropy(tdst[0], tdst[1], nbins)["entropy"]
2909
- rfeatures = list()
2910
- fentrs = dict()
2911
- for ds, dt in fds:
2912
- mutInfo = self.getMutualInfo([ds, dt, tdst[0], tdst[1]], nbins)["mutInfo"]
2913
- fentr = self.getAnyEntropy(ds, dt, nbins)["entropy"]
2914
- sunc = 2 * mutInfo / (tentr + fentr)
2915
- #print("ds {} sunc {:.3f}".format(ds, sunc))
2916
- if sunc >= delta:
2917
- f = [ds, dt, sunc, False]
2918
- rfeatures.append(f)
2919
- fentrs[ds] = fentr
2920
-
2921
- # sort descending of sym uncertainty
2922
- rfeatures.sort(key=lambda e : e[2], reverse=True)
2923
-
2924
- #disccard redundant features
2925
- le = len(rfeatures)
2926
- for i in range(le):
2927
- if rfeatures[i][3]:
2928
- continue
2929
- for j in range(i+1, le, 1):
2930
- if rfeatures[j][3]:
2931
- continue
2932
- mutInfo = self.getMutualInfo([rfeatures[i][0], rfeatures[i][1], rfeatures[j][0], rfeatures[j][1]], nbins)["mutInfo"]
2933
- sunc = 2 * mutInfo / (fentrs[rfeatures[i][0]] + fentrs[rfeatures[j][0]])
2934
- if sunc >= rfeatures[j][2]:
2935
- rfeatures[j][3] = True
2936
-
2937
- frfeatures = list(filter(lambda f : not f[3], rfeatures))
2938
- selFeatures = list(map(lambda f : [f[0], f[2]], frfeatures))
2939
- result = self.__printResult("selFeatures", selFeatures)
2940
- return result
2941
-
2942
- def getInfoGainFeatures(self, fdst, tdst, nfeatures, nsplit, nbins=20):
2943
- """
2944
- get top n features based on information gain or entropy loss
2945
-
2946
- Parameters
2947
- fdst: list of pair of data set name or list or numpy array and data type
2948
- tdst: target data set name or list or numpy array and data type (cat for classification num for regression)
2949
- nsplit : num of splits
2950
- nfeatures : desired no of features
2951
- nbins : no of bins for numerical data
2952
- """
2953
- le = len(fdst)
2954
- nfeatGiven = int(le / 2)
2955
- assertGreater(nfeatGiven, nfeatures, "available features should be greater than desired")
2956
- fds = list()
2957
- types = ["num", "cat"]
2958
- for i in range (0, le, 2):
2959
- ds = fdst[i]
2960
- dt = fdst[i+1]
2961
- assertInList(dt, types, "invalid type for data source " + dt)
2962
- data = self.getNumericData(ds) if dt == "num" else self.getCatData(ds)
2963
- p =(ds, dt)
2964
- fds.append(p)
2965
-
2966
- assertInList(tdst[1], types, "invalid type for data source " + tdst[1])
2967
- assertGreater(nsplit, 3, "minimum 4 splits necessary")
2968
- tdata = self.getNumericData(tdst[0]) if tdst[1] == "num" else self.getCatData(tdst[0])
2969
- tentr = self.getAnyEntropy(tdst[0], tdst[1], nbins)["entropy"]
2970
- sz =len(tdata)
2971
-
2972
- sfds = list()
2973
- for ds, dt in fds:
2974
- #print(ds, dt)
2975
- if dt == "num":
2976
- fd = self.getNumericData(ds)
2977
- _ , _ , vmax, vmin = self.__getBasicStats(fd)
2978
- intv = (vmax - vmin) / nsplit
2979
- maxig = None
2980
- spmin = vmin + intv
2981
- spmax = vmax - 0.9 * intv
2982
-
2983
- #iterate all splits
2984
- for sp in np.arange(spmin, spmax, intv):
2985
- ltvals = list()
2986
- gevals = list()
2987
- for i in range(len(fd)):
2988
- if fd[i] < sp:
2989
- ltvals.append(tdata[i])
2990
- else:
2991
- gevals.append(tdata[i])
2992
-
2993
- self.addListNumericData(ltvals, "spds") if tdst[1] == "num" else self.addListCatData(ltvals, "spds")
2994
- lten = self.getAnyEntropy("spds", tdst[1], nbins)["entropy"]
2995
- self.addListNumericData(gevals, "spds") if tdst[1] == "num" else self.addListCatData(gevals, "spds")
2996
- geen = self.getAnyEntropy("spds", tdst[1], nbins)["entropy"]
2997
-
2998
- #info gain
2999
- ig = tentr - (len(ltvals) * lten / sz + len(gevals) * geen / sz)
3000
- if maxig is None or ig > maxig:
3001
- maxig = ig
3002
-
3003
- pa = (ds, maxig)
3004
- sfds.append(pa)
3005
- else:
3006
- fd = self.getCatData(ds)
3007
- fds = set(fd)
3008
- fdps = genPowerSet(fds)
3009
- maxig = None
3010
-
3011
- #iterate all subsets
3012
- for s in fdps:
3013
- if len(s) == len(fds):
3014
- continue
3015
- invals = list()
3016
- exvals = list()
3017
- for i in range(len(fd)):
3018
- if fd[i] in s:
3019
- invals.append(tdata[i])
3020
- else:
3021
- exvals.append(tdata[i])
3022
-
3023
- self.addListNumericData(invals, "spds") if tdst[1] == "num" else self.addListCatData(invals, "spds")
3024
- inen = self.getAnyEntropy("spds", tdst[1], nbins)["entropy"]
3025
- self.addListNumericData(exvals, "spds") if tdst[1] == "num" else self.addListCatData(exvals, "spds")
3026
- exen = self.getAnyEntropy("spds", tdst[1], nbins)["entropy"]
3027
-
3028
- ig = tentr - (len(invals) * inen / sz + len(exvals) * exen / sz)
3029
- if maxig is None or ig > maxig:
3030
- maxig = ig
3031
-
3032
- pa = (ds, maxig)
3033
- sfds.append(pa)
3034
-
3035
- #sort of info gain
3036
- sfds.sort(key = lambda v : v[1], reverse = True)
3037
-
3038
- result = self.__printResult("selFeatures", sfds[:nfeatures])
3039
- return result
3040
-
3041
- def __stackData(self, *dsl):
3042
- """
3043
- stacks collumd to create matrix
3044
-
3045
- Parameters
3046
- dsl: data source list
3047
- """
3048
- dlist = tuple(map(lambda ds : self.getNumericData(ds), dsl))
3049
- self.ensureSameSize(dlist)
3050
- dmat = np.column_stack(dlist)
3051
- return dmat
3052
-
3053
- def __printBanner(self, msg, *dsl):
3054
- """
3055
- print banner for any function
3056
-
3057
- Parameters
3058
- msg: message
3059
- dsl: list of data set name or list or numpy array
3060
- """
3061
- tags = list(map(lambda ds : ds if type(ds) == str else "annoynymous", dsl))
3062
- forData = " for data sets " if tags else ""
3063
- msg = msg + forData + " ".join(tags)
3064
- if self.verbose:
3065
- print("\n== " + msg + " ==")
3066
-
3067
-
3068
- def __printDone(self):
3069
- """
3070
- print banner for any function
3071
- """
3072
- if self.verbose:
3073
- print("done")
3074
-
3075
- def __printStat(self, stat, pvalue, nhMsg, ahMsg, sigLev=.05):
3076
- """
3077
- generic stat and pvalue output
3078
-
3079
- Parameters
3080
- stat : stat value
3081
- pvalue : p value
3082
- nhMsg : null hypothesis violation message
3083
- ahMsg : null hypothesis message
3084
- sigLev : significance level
3085
- """
3086
- if self.verbose:
3087
- print("\ntest result:")
3088
- print("stat: {:.3f}".format(stat))
3089
- print("pvalue: {:.3f}".format(pvalue))
3090
- print("significance level: {:.3f}".format(sigLev))
3091
- print(nhMsg if pvalue > sigLev else ahMsg)
3092
-
3093
- def __printResult(self, *values):
3094
- """
3095
- print results
3096
-
3097
- Parameters
3098
- values : flattened kay and value pairs
3099
- """
3100
- result = dict()
3101
- assert len(values) % 2 == 0, "key value list should have even number of items"
3102
- for i in range(0, len(values), 2):
3103
- result[values[i]] = values[i+1]
3104
- if self.verbose:
3105
- print("result details:")
3106
- self.pp.pprint(result)
3107
- return result
3108
-
3109
- def __getBasicStats(self, data):
3110
- """
3111
- get mean and std dev
3112
-
3113
- Parameters
3114
- data : numpy array
3115
- """
3116
- mean = np.average(data)
3117
- sd = np.std(data)
3118
- r = (mean, sd, np.max(data), np.min(data))
3119
- return r
3120
-
3121
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/matumizi/mcsim.py DELETED
@@ -1,552 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # avenir-python: Machine Learning
4
- # Author: Pranab Ghosh
5
- #
6
- # Licensed under the Apache License, Version 2.0 (the "License"); you
7
- # may not use this file except in compliance with the License. You may
8
- # obtain a copy of the License at
9
- #
10
- # http://www.apache.org/licenses/LICENSE-2.0
11
- #
12
- # Unless required by applicable law or agreed to in writing, software
13
- # distributed under the License is distributed on an "AS IS" BASIS,
14
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
15
- # implied. See the License for the specific language governing
16
- # permissions and limitations under the License.
17
-
18
- # Package imports
19
- import os
20
- import sys
21
- import matplotlib.pyplot as plt
22
- import numpy as np
23
- import matplotlib
24
- import random
25
- import jprops
26
- import statistics
27
- from matplotlib import pyplot
28
- from .util import *
29
- from .mlutil import *
30
- from .sampler import *
31
-
32
- class MonteCarloSimulator(object):
33
- """
34
- monte carlo simulator for intergation, various statistic for complex fumctions
35
- """
36
- def __init__(self, numIter, callback, logFilePath, logLevName):
37
- """
38
- constructor
39
-
40
- Parameters
41
- numIter :num of iterations
42
- callback : call back method
43
- logFilePath : log file path
44
- logLevName : log level
45
- """
46
- self.samplers = list()
47
- self.numIter = numIter;
48
- self.callback = callback
49
- self.extraArgs = None
50
- self.output = list()
51
- self.sum = None
52
- self.mean = None
53
- self.sd = None
54
- self.replSamplers = dict()
55
- self.prSamples = None
56
-
57
- self.logger = None
58
- if logFilePath is not None:
59
- self.logger = createLogger(__name__, logFilePath, logLevName)
60
- self.logger.info("******** stating new session of MonteCarloSimulator")
61
-
62
-
63
- def registerBernoulliTrialSampler(self, pr):
64
- """
65
- bernoulli trial sampler
66
-
67
- Parameters
68
- pr : probability
69
- """
70
- self.samplers.append(BernoulliTrialSampler(pr))
71
-
72
- def registerPoissonSampler(self, rateOccur, maxSamp):
73
- """
74
- poisson sampler
75
-
76
- Parameters
77
- rateOccur : rate of occurence
78
- maxSamp : max limit on no of samples
79
- """
80
- self.samplers.append(PoissonSampler(rateOccur, maxSamp))
81
-
82
- def registerUniformSampler(self, minv, maxv):
83
- """
84
- uniform sampler
85
-
86
- Parameters
87
- minv : min value
88
- maxv : max value
89
- """
90
- self.samplers.append(UniformNumericSampler(minv, maxv))
91
-
92
- def registerTriangularSampler(self, min, max, vertexValue, vertexPos=None):
93
- """
94
- triangular sampler
95
-
96
- Parameters
97
- xmin : min value
98
- xmax : max value
99
- vertexValue : distr value at vertex
100
- vertexPos : vertex pposition
101
- """
102
- self.samplers.append(TriangularRejectSampler(min, max, vertexValue, vertexPos))
103
-
104
- def registerGaussianSampler(self, mean, sd):
105
- """
106
- gaussian sampler
107
-
108
- Parameters
109
- mean : mean
110
- sd : std deviation
111
- """
112
- self.samplers.append(GaussianRejectSampler(mean, sd))
113
-
114
- def registerNormalSampler(self, mean, sd):
115
- """
116
- gaussian sampler using numpy
117
-
118
- Parameters
119
- mean : mean
120
- sd : std deviation
121
- """
122
- self.samplers.append(NormalSampler(mean, sd))
123
-
124
- def registerLogNormalSampler(self, mean, sd):
125
- """
126
- log normal sampler using numpy
127
-
128
- Parameters
129
- mean : mean
130
- sd : std deviation
131
- """
132
- self.samplers.append(LogNormalSampler(mean, sd))
133
-
134
- def registerParetoSampler(self, mode, shape):
135
- """
136
- pareto sampler using numpy
137
-
138
- Parameters
139
- mode : mode
140
- shape : shape
141
- """
142
- self.samplers.append(ParetoSampler(mode, shape))
143
-
144
- def registerGammaSampler(self, shape, scale):
145
- """
146
- gamma sampler using numpy
147
-
148
- Parameters
149
- shape : shape
150
- scale : scale
151
- """
152
- self.samplers.append(GammaSampler(shape, scale))
153
-
154
- def registerDiscreteRejectSampler(self, xmin, xmax, step, *values):
155
- """
156
- disccrete int sampler
157
-
158
- Parameters
159
- xmin : min value
160
- xmax : max value
161
- step : discrete step
162
- values : distr values
163
- """
164
- self.samplers.append(DiscreteRejectSampler(xmin, xmax, step, *values))
165
-
166
- def registerNonParametricSampler(self, minv, binWidth, *values):
167
- """
168
- nonparametric sampler
169
-
170
- Parameters
171
- xmin : min value
172
- binWidth : bin width
173
- values : distr values
174
- """
175
- sampler = NonParamRejectSampler(minv, binWidth, *values)
176
- sampler.sampleAsFloat()
177
- self.samplers.append(sampler)
178
-
179
- def registerMultiVarNormalSampler(self, numVar, *values):
180
- """
181
- multi var gaussian sampler using numpy
182
-
183
- Parameters
184
- numVar : no of variables
185
- values : numVar mean values followed by numVar x numVar values for covar matrix
186
- """
187
- self.samplers.append(MultiVarNormalSampler(numVar, *values))
188
-
189
- def registerJointNonParamRejectSampler(self, xmin, xbinWidth, xnbin, ymin, ybinWidth, ynbin, *values):
190
- """
191
- joint nonparametric sampler
192
-
193
- Parameters
194
- xmin : min value for x
195
- xbinWidth : bin width for x
196
- xnbin : no of bins for x
197
- ymin : min value for y
198
- ybinWidth : bin width for y
199
- ynbin : no of bins for y
200
- values : distr values
201
- """
202
- self.samplers.append(JointNonParamRejectSampler(xmin, xbinWidth, xnbin, ymin, ybinWidth, ynbin, *values))
203
-
204
- def registerRangePermutationSampler(self, minv, maxv, *numShuffles):
205
- """
206
- permutation sampler with range
207
-
208
- Parameters
209
- minv : min of range
210
- maxv : max of range
211
- numShuffles : no of shuffles or range of no of shuffles
212
- """
213
- self.samplers.append(PermutationSampler.createSamplerWithRange(minv, maxv, *numShuffles))
214
-
215
- def registerValuesPermutationSampler(self, values, *numShuffles):
216
- """
217
- permutation sampler with values
218
-
219
- Parameters
220
- values : list data
221
- numShuffles : no of shuffles or range of no of shuffles
222
- """
223
- self.samplers.append(PermutationSampler.createSamplerWithValues(values, *numShuffles))
224
-
225
- def registerNormalSamplerWithTrendCycle(self, mean, stdDev, trend, cycle, step=1):
226
- """
227
- normal sampler with trend and cycle
228
-
229
- Parameters
230
- mean : mean
231
- stdDev : std deviation
232
- dmean : trend delta
233
- cycle : cycle values wrt base mean
234
- step : adjustment step for cycle and trend
235
- """
236
- self.samplers.append(NormalSamplerWithTrendCycle(mean, stdDev, trend, cycle, step))
237
-
238
- def registerCustomSampler(self, sampler):
239
- """
240
- eventsampler
241
-
242
- Parameters
243
- sampler : sampler with sample() method
244
- """
245
- self.samplers.append(sampler)
246
-
247
- def registerEventSampler(self, intvSampler, valSampler=None):
248
- """
249
- event sampler
250
-
251
- Parameters
252
- intvSampler : interval sampler
253
- valSampler : value sampler
254
- """
255
- self.samplers.append(EventSampler(intvSampler, valSampler))
256
-
257
- def registerMetropolitanSampler(self, propStdDev, minv, binWidth, values):
258
- """
259
- metropolitan sampler
260
-
261
- Parameters
262
- propStdDev : proposal distr std dev
263
- minv : min domain value for target distr
264
- binWidth : bin width
265
- values : target distr values
266
- """
267
- self.samplers.append(MetropolitanSampler(propStdDev, minv, binWidth, values))
268
-
269
- def setSampler(self, var, iter, sampler):
270
- """
271
- set sampler for some variable when iteration reaches certain point
272
-
273
- Parameters
274
- var : sampler index
275
- iter : iteration count
276
- sampler : new sampler
277
- """
278
- key = (var, iter)
279
- self.replSamplers[key] = sampler
280
-
281
- def registerExtraArgs(self, *args):
282
- """
283
- extra args
284
-
285
- Parameters
286
- args : extra argument list
287
- """
288
- self.extraArgs = args
289
-
290
- def replSampler(self, iter):
291
- """
292
- replace samper for this iteration
293
-
294
- Parameters
295
- iter : iteration number
296
- """
297
- if len(self.replSamplers) > 0:
298
- for v in range(self.numVars):
299
- key = (v, iter)
300
- if key in self.replSamplers:
301
- sampler = self.replSamplers[key]
302
- self.samplers[v] = sampler
303
-
304
- def run(self):
305
- """
306
- run simulator
307
- """
308
- self.sum = None
309
- self.mean = None
310
- self.sd = None
311
- self.numVars = len(self.samplers)
312
- vOut = 0
313
-
314
- #print(formatAny(self.numIter, "num iterations"))
315
- for i in range(self.numIter):
316
- self.replSampler(i)
317
- args = list()
318
- for s in self.samplers:
319
- arg = s.sample()
320
- if type(arg) is list:
321
- args.extend(arg)
322
- else:
323
- args.append(arg)
324
-
325
- slen = len(args)
326
- if self.extraArgs:
327
- args.extend(self.extraArgs)
328
- args.append(self)
329
- args.append(i)
330
- vOut = self.callback(args)
331
- self.output.append(vOut)
332
- self.prSamples = args[:slen]
333
-
334
- def getOutput(self):
335
- """
336
- get raw output
337
- """
338
- return self.output
339
-
340
- def setOutput(self, values):
341
- """
342
- set raw output
343
-
344
- Parameters
345
- values : output values
346
- """
347
- self.output = values
348
- self.numIter = len(values)
349
-
350
- def drawHist(self, myTitle, myXlabel, myYlabel):
351
- """
352
- draw histogram
353
-
354
- Parameters
355
- myTitle : title
356
- myXlabel : label for x
357
- myYlabel : label for y
358
- """
359
- pyplot.hist(self.output, density=True)
360
- pyplot.title(myTitle)
361
- pyplot.xlabel(myXlabel)
362
- pyplot.ylabel(myYlabel)
363
- pyplot.show()
364
-
365
- def getSum(self):
366
- """
367
- get sum
368
- """
369
- if not self.sum:
370
- self.sum = sum(self.output)
371
- return self.sum
372
-
373
- def getMean(self):
374
- """
375
- get average
376
- """
377
- if self.mean is None:
378
- self.mean = statistics.mean(self.output)
379
- return self.mean
380
-
381
- def getStdDev(self):
382
- """
383
- get std dev
384
- """
385
- if self.sd is None:
386
- self.sd = statistics.stdev(self.output, xbar=self.mean) if self.mean else statistics.stdev(self.output)
387
- return self.sd
388
-
389
-
390
- def getMedian(self):
391
- """
392
- get average
393
- """
394
- med = statistics.median(self.output)
395
- return med
396
-
397
- def getMax(self):
398
- """
399
- get max
400
- """
401
- return max(self.output)
402
-
403
- def getMin(self):
404
- """
405
- get min
406
- """
407
- return min(self.output)
408
-
409
- def getIntegral(self, bounds):
410
- """
411
- integral
412
-
413
- Parameters
414
- bounds : bound on sum
415
- """
416
- if not self.sum:
417
- self.sum = sum(self.output)
418
- return self.sum * bounds / self.numIter
419
-
420
- def getLowerTailStat(self, zvalue, numIntPoints=50):
421
- """
422
- get lower tail stat
423
-
424
- Parameters
425
- zvalue : zscore upper bound
426
- numIntPoints : no of interpolation point for cum distribution
427
- """
428
- mean = self.getMean()
429
- sd = self.getStdDev()
430
- tailStart = self.getMin()
431
- tailEnd = mean - zvalue * sd
432
- cvaCounts = self.cumDistr(tailStart, tailEnd, numIntPoints)
433
-
434
- reqConf = floatRange(0.0, 0.150, .01)
435
- msg = "p value outside interpolation range, reduce zvalue and try again {:.5f} {:.5f}".format(reqConf[-1], cvaCounts[-1][1])
436
- assert reqConf[-1] < cvaCounts[-1][1], msg
437
- critValues = self.interpolateCritValues(reqConf, cvaCounts, True, tailStart, tailEnd)
438
- return critValues
439
-
440
- def getPercentile(self, cvalue):
441
- """
442
- percentile
443
-
444
- Parameters
445
- cvalue : value for percentile
446
- """
447
- count = 0
448
- for v in self.output:
449
- if v < cvalue:
450
- count += 1
451
- percent = int(count * 100.0 / self.numIter)
452
- return percent
453
-
454
-
455
- def getCritValue(self, pvalue):
456
- """
457
- critical value for probabaility threshold
458
-
459
- Parameters
460
- pvalue : pvalue
461
- """
462
- assertWithinRange(pvalue, 0.0, 1.0, "invalid probabaility value")
463
- svalues = self.output.sorted()
464
- ppval = None
465
- cpval = None
466
- intv = 1.0 / (self.numIter - 1)
467
- for i in range(self.numIter - 1):
468
- cpval = (i + 1) / self.numIter
469
- if cpval > pvalue:
470
- sl = svalues[i] - svalues[i-1]
471
- cval = svalues[i-1] + sl * (pvalue - ppval)
472
- break
473
- ppval = cpval
474
- return cval
475
-
476
-
477
- def getUpperTailStat(self, zvalue, numIntPoints=50):
478
- """
479
- upper tail stat
480
-
481
- Parameters
482
- zvalue : zscore upper bound
483
- numIntPoints : no of interpolation point for cum distribution
484
- """
485
- mean = self.getMean()
486
- sd = self.getStdDev()
487
- tailStart = mean + zvalue * sd
488
- tailEnd = self.getMax()
489
- cvaCounts = self.cumDistr(tailStart, tailEnd, numIntPoints)
490
-
491
- reqConf = floatRange(0.85, 1.0, .01)
492
- msg = "p value outside interpolation range, reduce zvalue and try again {:.5f} {:.5f}".format(reqConf[0], cvaCounts[0][1])
493
- assert reqConf[0] > cvaCounts[0][1], msg
494
- critValues = self.interpolateCritValues(reqConf, cvaCounts, False, tailStart, tailEnd)
495
- return critValues
496
-
497
- def cumDistr(self, tailStart, tailEnd, numIntPoints):
498
- """
499
- cumulative distribution at tail
500
-
501
- Parameters
502
- tailStart : tail start
503
- tailEnd : tail end
504
- numIntPoints : no of interpolation points
505
- """
506
- delta = (tailEnd - tailStart) / numIntPoints
507
- cvalues = floatRange(tailStart, tailEnd, delta)
508
- cvaCounts = list()
509
- for cv in cvalues:
510
- count = 0
511
- for v in self.output:
512
- if v < cv:
513
- count += 1
514
- p = (cv, count/self.numIter)
515
- if self.logger is not None:
516
- self.logger.info("{:.3f} {:.3f}".format(p[0], p[1]))
517
- cvaCounts.append(p)
518
- return cvaCounts
519
-
520
- def interpolateCritValues(self, reqConf, cvaCounts, lowertTail, tailStart, tailEnd):
521
- """
522
- interpolate for spefici confidence limits
523
-
524
- Parameters
525
- reqConf : confidence level values
526
- cvaCounts : cum values
527
- lowertTail : True if lower tail
528
- tailStart ; tail start
529
- tailEnd : tail end
530
- """
531
- critValues = list()
532
- if self.logger is not None:
533
- self.logger.info("target conf limit " + str(reqConf))
534
- reqConfSub = reqConf[1:] if lowertTail else reqConf[:-1]
535
- for rc in reqConfSub:
536
- for i in range(len(cvaCounts) -1):
537
- if rc >= cvaCounts[i][1] and rc < cvaCounts[i+1][1]:
538
- #print("interpoltate between " + str(cvaCounts[i]) + " and " + str(cvaCounts[i+1]))
539
- slope = (cvaCounts[i+1][0] - cvaCounts[i][0]) / (cvaCounts[i+1][1] - cvaCounts[i][1])
540
- cval = cvaCounts[i][0] + slope * (rc - cvaCounts[i][1])
541
- p = (rc, cval)
542
- if self.logger is not None:
543
- self.logger.debug("interpolated crit values {:.3f} {:.3f}".format(p[0], p[1]))
544
- critValues.append(p)
545
- break
546
- if lowertTail:
547
- p = (0.0, tailStart)
548
- critValues.insert(0, p)
549
- else:
550
- p = (1.0, tailEnd)
551
- critValues.append(p)
552
- return critValues
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/matumizi/mlutil.py DELETED
@@ -1,1500 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # avenir-python: Machine Learning
4
- # Author: Pranab Ghosh
5
- #
6
- # Licensed under the Apache License, Version 2.0 (the "License"); you
7
- # may not use this file except in compliance with the License. You may
8
- # obtain a copy of the License at
9
- #
10
- # http://www.apache.org/licenses/LICENSE-2.0
11
- #
12
- # Unless required by applicable law or agreed to in writing, software
13
- # distributed under the License is distributed on an "AS IS" BASIS,
14
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
15
- # implied. See the License for the specific language governing
16
- # permissions and limitations under the License.
17
-
18
- # Package imports
19
- import os
20
- import sys
21
- import numpy as np
22
- from sklearn import preprocessing
23
- from sklearn import metrics
24
- from sklearn.datasets import make_blobs
25
- from sklearn.datasets import make_classification
26
- import random
27
- from math import *
28
- from decimal import Decimal
29
- import statistics
30
- import jprops
31
- from Levenshtein import distance as ld
32
- from .util import *
33
- from .sampler import *
34
-
35
- class Configuration:
36
- """
37
- Configuration management. Supports default value, mandatory value and typed value.
38
- """
39
- def __init__(self, configFile, defValues, verbose=False):
40
- """
41
- initializer
42
-
43
- Parameters
44
- configFile : config file path
45
- defValues : dictionary of default values
46
- verbose : verbosity flag
47
- """
48
- configs = {}
49
- with open(configFile) as fp:
50
- for key, value in jprops.iter_properties(fp):
51
- configs[key] = value
52
- self.configs = configs
53
- self.defValues = defValues
54
- self.verbose = verbose
55
-
56
- def override(self, configFile):
57
- """
58
- over ride configuration from file
59
-
60
- Parameters
61
- configFile : override config file path
62
- """
63
- with open(configFile) as fp:
64
- for key, value in jprops.iter_properties(fp):
65
- self.configs[key] = value
66
-
67
-
68
- def setParam(self, name, value):
69
- """
70
- override individual configuration
71
-
72
- Parameters
73
- name : config param name
74
- value : config param value
75
- """
76
- self.configs[name] = value
77
-
78
-
79
- def getStringConfig(self, name):
80
- """
81
- get string param
82
-
83
- Parameters
84
- name : config param name
85
- """
86
- if self.isNone(name):
87
- val = (None, False)
88
- elif self.isDefault(name):
89
- val = (self.handleDefault(name), True)
90
- else:
91
- val = (self.configs[name], False)
92
- if self.verbose:
93
- print( "{} {} {}".format(name, self.configs[name], val[0]))
94
- return val
95
-
96
-
97
- def getIntConfig(self, name):
98
- """
99
- get int param
100
-
101
- Parameters
102
- name : config param name
103
- """
104
- #print "%s %s" %(name,self.configs[name])
105
- if self.isNone(name):
106
- val = (None, False)
107
- elif self.isDefault(name):
108
- val = (self.handleDefault(name), True)
109
- else:
110
- val = (int(self.configs[name]), False)
111
- if self.verbose:
112
- print( "{} {} {}".format(name, self.configs[name], val[0]))
113
- return val
114
-
115
-
116
- def getFloatConfig(self, name):
117
- """
118
- get float param
119
-
120
- Parameters
121
- name : config param name
122
- """
123
- #print "%s %s" %(name,self.configs[name])
124
- if self.isNone(name):
125
- val = (None, False)
126
- elif self.isDefault(name):
127
- val = (self.handleDefault(name), True)
128
- else:
129
- val = (float(self.configs[name]), False)
130
- if self.verbose:
131
- print( "{} {} {:06.3f}".format(name, self.configs[name], val[0]))
132
- return val
133
-
134
-
135
- def getBooleanConfig(self, name):
136
- """
137
- #get boolean param
138
-
139
- Parameters
140
- name : config param name
141
- """
142
- if self.isNone(name):
143
- val = (None, False)
144
- elif self.isDefault(name):
145
- val = (self.handleDefault(name), True)
146
- else:
147
- bVal = self.configs[name].lower() == "true"
148
- val = (bVal, False)
149
- if self.verbose:
150
- print( "{} {} {}".format(name, self.configs[name], val[0]))
151
- return val
152
-
153
-
154
- def getIntListConfig(self, name, delim=","):
155
- """
156
- get int list param
157
-
158
- Parameters
159
- name : config param name
160
- delim : delemeter
161
- """
162
- if self.isNone(name):
163
- val = (None, False)
164
- elif self.isDefault(name):
165
- val = (self.handleDefault(name), True)
166
- else:
167
- delSepStr = self.getStringConfig(name)
168
-
169
- #specified as range
170
- intList = strListOrRangeToIntArray(delSepStr[0])
171
- val =(intList, delSepStr[1])
172
- return val
173
-
174
- def getFloatListConfig(self, name, delim=","):
175
- """
176
- get float list param
177
-
178
- Parameters
179
- name : config param name
180
- delim : delemeter
181
- """
182
- delSepStr = self.getStringConfig(name)
183
- if self.isNone(name):
184
- val = (None, False)
185
- elif self.isDefault(name):
186
- val = (self.handleDefault(name), True)
187
- else:
188
- flList = strToFloatArray(delSepStr[0], delim)
189
- val =(flList, delSepStr[1])
190
- return val
191
-
192
-
193
- def getStringListConfig(self, name, delim=","):
194
- """
195
- get string list param
196
-
197
- Parameters
198
- name : config param name
199
- delim : delemeter
200
- """
201
- delSepStr = self.getStringConfig(name)
202
- if self.isNone(name):
203
- val = (None, False)
204
- elif self.isDefault(name):
205
- val = (self.handleDefault(name), True)
206
- else:
207
- strList = delSepStr[0].split(delim)
208
- val = (strList, delSepStr[1])
209
- return val
210
-
211
- def handleDefault(self, name):
212
- """
213
- handles default
214
-
215
- Parameters
216
- name : config param name
217
- """
218
- dVal = self.defValues[name]
219
- if (dVal[1] is None):
220
- val = dVal[0]
221
- else:
222
- raise ValueError(dVal[1])
223
- return val
224
-
225
-
226
- def isNone(self, name):
227
- """
228
- true is value is None
229
-
230
- Parameters
231
- name : config param name
232
- """
233
- return self.configs[name].lower() == "none"
234
-
235
-
236
- def isDefault(self, name):
237
- """
238
- true if the value is default
239
-
240
- Parameters
241
- name : config param name
242
- """
243
- de = self.configs[name] == "_"
244
- #print de
245
- return de
246
-
247
-
248
- def eitherOrStringConfig(self, firstName, secondName):
249
- """
250
- returns one of two string parameters
251
-
252
- Parameters
253
- firstName : first parameter name
254
- secondName : second parameter name
255
- """
256
- if not self.isNone(firstName):
257
- first = self.getStringConfig(firstName)[0]
258
- second = None
259
- if not self.isNone(secondName):
260
- raise ValueError("only one of the two parameters should be set and not both " + firstName + " " + secondName)
261
- else:
262
- if not self.isNone(secondName):
263
- second = self.getStringConfig(secondtName)[0]
264
- first = None
265
- else:
266
- raise ValueError("at least one of the two parameters should be set " + firstName + " " + secondName)
267
- return (first, second)
268
-
269
-
270
- def eitherOrIntConfig(self, firstName, secondName):
271
- """
272
- returns one of two int parameters
273
-
274
- Parameters
275
- firstName : first parameter name
276
- secondName : second parameter name
277
- """
278
- if not self.isNone(firstName):
279
- first = self.getIntConfig(firstName)[0]
280
- second = None
281
- if not self.isNone(secondName):
282
- raise ValueError("only one of the two parameters should be set and not both " + firstName + " " + secondName)
283
- else:
284
- if not self.isNone(secondName):
285
- second = self.getIntConfig(secondsName)[0]
286
- first = None
287
- else:
288
- raise ValueError("at least one of the two parameters should be set " + firstName + " " + secondName)
289
- return (first, second)
290
-
291
-
292
- class CatLabelGenerator:
293
- """
294
- label generator for categorical variables
295
- """
296
- def __init__(self, catValues, delim):
297
- """
298
- initilizers
299
-
300
- Parameters
301
- catValues : dictionary of categorical values
302
- delim : delemeter
303
- """
304
- self.encoders = {}
305
- self.catValues = catValues
306
- self.delim = delim
307
- for k in self.catValues.keys():
308
- le = preprocessing.LabelEncoder()
309
- le.fit(self.catValues[k])
310
- self.encoders[k] = le
311
-
312
- def processRow(self, row):
313
- """
314
- encode row categorical values
315
-
316
- Parameters:
317
- row : data row
318
- """
319
- #print row
320
- rowArr = row.split(self.delim)
321
- for i in range(len(rowArr)):
322
- if (i in self.catValues):
323
- curVal = rowArr[i]
324
- assert curVal in self.catValues[i], "categorival value invalid"
325
- encVal = self.encoders[i].transform([curVal])
326
- rowArr[i] = str(encVal[0])
327
- return self.delim.join(rowArr)
328
-
329
- def getOrigLabels(self, indx):
330
- """
331
- get original labels
332
-
333
- Parameters:
334
- indx : column index
335
- """
336
- return self.encoders[indx].classes_
337
-
338
-
339
- class SupvLearningDataGenerator:
340
- """
341
- data generator for supervised learning
342
- """
343
- def __init__(self, configFile):
344
- """
345
- initilizers
346
-
347
- Parameters
348
- configFile : config file path
349
- """
350
- defValues = dict()
351
- defValues["common.num.samp"] = (100, None)
352
- defValues["common.num.feat"] = (5, None)
353
- defValues["common.feat.trans"] = (None, None)
354
- defValues["common.feat.types"] = (None, "missing feature types")
355
- defValues["common.cat.feat.distr"] = (None, None)
356
- defValues["common.output.precision"] = (3, None)
357
- defValues["common.error"] = (0.01, None)
358
- defValues["class.gen.technique"] = ("blob", None)
359
- defValues["class.num.feat.informative"] = (2, None)
360
- defValues["class.num.feat.redundant"] = (2, None)
361
- defValues["class.num.feat.repeated"] = (0, None)
362
- defValues["class.num.feat.cat"] = (0, None)
363
- defValues["class.num.class"] = (2, None)
364
-
365
- self.config = Configuration(configFile, defValues)
366
-
367
- def genClassifierData(self):
368
- """
369
- generates classifier data
370
- """
371
- nsamp = self.config.getIntConfig("common.num.samp")[0]
372
- nfeat = self.config.getIntConfig("common.num.feat")[0]
373
- nclass = self.config.getIntConfig("class.num.class")[0]
374
- #transform with shift and scale
375
- ftrans = self.config.getFloatListConfig("common.feat.trans")[0]
376
- feTrans = dict()
377
- for i in range(0, len(ftrans), 2):
378
- tr = (ftrans[i], ftrans[i+1])
379
- indx = int(i/2)
380
- feTrans[indx] = tr
381
-
382
- ftypes = self.config.getStringListConfig("common.feat.types")[0]
383
-
384
- # categorical feature distribution
385
- feCatDist = dict()
386
- fcatdl = self.config.getStringListConfig("common.cat.feat.distr")[0]
387
- for fcatds in fcatdl:
388
- fcatd = fcatds.split(":")
389
- feInd = int(fcatd[0])
390
- clVal = int(fcatd[1])
391
- key = (feInd, clVal) #feature index and class value
392
- dist = list(map(lambda i : (fcatd[i], float(fcatd[i+1])), range(2, len(fcatd), 2)))
393
- feCatDist[key] = CategoricalRejectSampler(*dist)
394
-
395
- #shift and scale
396
- genTechnique = self.config.getStringConfig("class.gen.technique")[0]
397
- error = self.config.getFloatConfig("common.error")[0]
398
- if genTechnique == "blob":
399
- features, claz = make_blobs(n_samples=nsamp, centers=nclass, n_features=nfeat)
400
- for i in range(nsamp): #shift and scale
401
- for j in range(nfeat):
402
- tr = feTrans[j]
403
- features[i,j] = (features[i,j] + tr[0]) * tr[1]
404
- claz = np.array(list(map(lambda c : random.randint(0, nclass-1) if random.random() < error else c, claz)))
405
- elif genTechnique == "classify":
406
- nfeatInfo = self.config.getIntConfig("class.num.feat.informative")[0]
407
- nfeatRed = self.config.getIntConfig("class.num.feat.redundant")[0]
408
- nfeatRep = self.config.getIntConfig("class.num.feat.repeated")[0]
409
- shifts = list(map(lambda i : feTrans[i][0], range(nfeat)))
410
- scales = list(map(lambda i : feTrans[i][1], range(nfeat)))
411
- features, claz = make_classification(n_samples=nsamp, n_features=nfeat, n_informative=nfeatInfo, n_redundant=nfeatRed,
412
- n_repeated=nfeatRep, n_classes=nclass, flip_y=error, shift=shifts, scale=scales)
413
- else:
414
- raise "invalid genaration technique"
415
-
416
- # add categorical features and format
417
- nCatFeat = self.config.getIntConfig("class.num.feat.cat")[0]
418
- prec = self.config.getIntConfig("common.output.precision")[0]
419
- for f , c in zip(features, claz):
420
- nfs = list(map(lambda i : self.numFeToStr(i, f[i], c, ftypes[i], prec), range(nfeat)))
421
- if nCatFeat > 0:
422
- cfs = list(map(lambda i : self.catFe(i, c, ftypes[i], feCatDist), range(nfeat, nfeat + nCatFeat, 1)))
423
- rec = ",".join(nfs) + "," + ",".join(cfs) + "," + str(c)
424
- else:
425
- rec = ",".join(nfs) + "," + str(c)
426
- yield rec
427
-
428
- def numFeToStr(self, fv, ft, prec):
429
- """
430
- nummeric feature value to string
431
-
432
- Parameters
433
- fv : field value
434
- ft : field data type
435
- prec : precision
436
- """
437
- if ft == "float":
438
- s = formatFloat(prec, fv)
439
- elif ft =="int":
440
- s = str(int(fv))
441
- else:
442
- raise "invalid type expecting float or int"
443
- return s
444
-
445
- def catFe(self, i, cv, ft, feCatDist):
446
- """
447
- generate categorical feature
448
-
449
- Parameters
450
- i : col index
451
- cv : class value
452
- ft : field data type
453
- feCatDist : cat value distribution
454
- """
455
- if ft == "cat":
456
- key = (i, cv)
457
- s = feCatDist[key].sample()
458
- else:
459
- raise "invalid type expecting categorical"
460
- return s
461
-
462
- class RegressionDataGenerator:
463
- """
464
- data generator for regression, including square terms, cross terms, bias, noise, correlated variables
465
- and user defined function
466
- """
467
- def __init__(self, configFile, callback=None):
468
- """
469
- initilizers
470
-
471
- Parameters
472
- configFile : config file path
473
- callback : user defined function
474
- """
475
- defValues = dict()
476
- defValues["common.pvar.samplers"] = (None, None)
477
- defValues["common.pvar.ranges"] = (None, None)
478
- defValues["common.linear.weights"] = (None, None)
479
- defValues["common.square.weights"] = (None, None)
480
- defValues["common.crterm.weights"] = (None, None)
481
- defValues["common.corr.params"] = (None, None)
482
- defValues["common.bias"] = (0, None)
483
- defValues["common.noise"] = (None, None)
484
- defValues["common.tvar.range"] = (None, None)
485
- defValues["common.weight.niter"] = (20, None)
486
- self.config = Configuration(configFile, defValues)
487
- self.callback = callback
488
-
489
- #samplers for predictor variables
490
- items = self.config.getStringListConfig("common.pvar.samplers")[0]
491
- self.samplers = list(map(lambda s : createSampler(s), items))
492
- self.npvar = len(self.samplers)
493
-
494
- #values range for predictor variables
495
- items = self.config.getStringListConfig("common.pvar.ranges")[0]
496
- self.pvranges = list()
497
- for i in range(0, len(items), 2):
498
- if items[i] =="none":
499
- r = None
500
- else:
501
- vmin = float(items[i])
502
- vmax = float(items[i+1])
503
- r = (vmin, vmax, vmax-vmin)
504
- self.pvranges.append(r)
505
- assertEqual(len(self.pvranges), self.npvar, "no of predicatble var ranges provided is inavalid")
506
-
507
-
508
- #linear weights for predictor variables
509
- self.lweights = self.config.getFloatListConfig("common.linear.weights")[0]
510
- assertEqual(len(self.lweights), self.npvar, "no of linear weights provided is inavalid")
511
-
512
-
513
- #square weights for predictor variables
514
- items = self.config.getStringListConfig("common.square.weights")[0]
515
- self.sqweight = dict()
516
- for i in range(0, len(items), 2):
517
- vi = int(items[i])
518
- assertLesser(vi, self.npvar, "invalid predictor var index")
519
- wt = float(items[i+1])
520
- self.sqweight[vi] = wt
521
-
522
- #crossterm weights for predictor variables
523
- items = self.config.getStringListConfig("common.crterm.weights")[0]
524
- self.crweight = dict()
525
- for i in range(0, len(items), 3):
526
- vi = int(items[i])
527
- assertLesser(vi, self.npvar, "invalid predictor var index")
528
- vj = int(items[i+1])
529
- assertLesser(vj, self.npvar, "invalid predictor var index")
530
- wt = float(items[i+2])
531
- vp = (vi, vj)
532
- self.crweight[vp] = wt
533
-
534
- #correlated variables
535
- items = self.config.getStringListConfig("common.corr.params")[0]
536
- self.corrparams = dict()
537
- for co in items:
538
- cparam = co.split(":")
539
- vi = int(cparam[0])
540
- vj = int(cparam[1])
541
- k = (vi,vj)
542
- bias = float(cparam[2])
543
- wt = float(cparam[3])
544
- noise = float(cparam[4])
545
- roundoff = cparam[5] == "true"
546
- v = (bias, wt, noise, roundoff)
547
- self.corrparams[k] = v
548
-
549
-
550
- #boas, noise and target range values
551
- self.bias = self.config.getFloatConfig("common.bias")[0]
552
- noise = self.config.getStringListConfig("common.noise")[0]
553
- self.ndistr = noise[0]
554
- self.noise = float(noise[1])
555
- self.tvarlim = self.config.getFloatListConfig("common.tvar.range")[0]
556
-
557
- #sample
558
- niter = self.config.getIntConfig("common.weight.niter")[0]
559
- yvals = list()
560
- for i in range(niter):
561
- y = self.sample()[1]
562
- yvals.append(y)
563
-
564
- #scale weights by sampled mean and target mean
565
- my = statistics.mean(yvals)
566
- myt =(self.tvarlim[1] - self.tvarlim[0]) / 2
567
- sc = (myt - self.bias) / (my - self.bias)
568
- #print("weight scale {:.3f}".format(sc))
569
- self.lweights = list(map(lambda w : w * sc, self.lweights))
570
- #print("weights {}".format(toStrFromList(self.lweights, 3)))
571
-
572
- for k in self.sqweight.keys():
573
- self.sqweight[k] *= sc
574
-
575
- for k in self.crweight.keys():
576
- self.crweight[k] *= sc
577
-
578
-
579
- def sample(self):
580
- """
581
- sample predictor variables and target variable
582
-
583
- """
584
- pvd = list(map(lambda s : s.sample(), self.samplers))
585
-
586
- #correct for correlated variables
587
- for k in self.corrparams.keys():
588
- vi = k[0]
589
- vj = k[1]
590
- v = self.corrparams[k]
591
- bias = v[0]
592
- wt = v[1]
593
- noise = v[2]
594
- roundoff = v[3]
595
- nv = bias + wt * pvd[vi]
596
- pvd[vj] = preturbScalar(nv, noise, "normal")
597
- if roundoff:
598
- pvd[vj] = round(pvd[vj])
599
-
600
- spvd = list()
601
- lsum = self.bias
602
- for i in range(self.npvar):
603
- #range limit
604
- if self.pvranges[i] is not None:
605
- pvd[i] = rangeLimit(pvd[i], self.pvranges[i][0], self.pvranges[i][1])
606
- spvd.append(pvd[i])
607
-
608
- #scale
609
- pvd[i] = scaleMinMaxScaData(pvd[i], self.pvranges[i])
610
- lsum += self.lweights[i] * pvd[i]
611
-
612
- #square terms
613
- ssum = 0
614
- for k in self.sqweight.keys():
615
- ssum += self.sqweight[k] + pvd[k] * pvd[k]
616
-
617
- #cross terms
618
- crsum = 0
619
- for k in self.crweight.keys():
620
- vi = k[0]
621
- vj = k[1]
622
- crsum += self.crweight[k] * pvd[vi] * pvd[vj]
623
-
624
- y = lsum + ssum + crsum
625
- y = preturbScalar(y, self.noise, self.ndistr)
626
- if self.callback is not None:
627
- ufy = self.callback(spvd)
628
- y += ufy
629
- r = (spvd, y)
630
- return r
631
-
632
-
633
- def loadDataFile(file, delim, cols, colIndices):
634
- """
635
- loads delim separated file and extracts columns
636
-
637
- Parameters
638
- file : file path
639
- delim : delemeter
640
- cols : columns to use from file
641
- colIndices ; columns to extract
642
- """
643
- data = np.loadtxt(file, delimiter=delim, usecols=cols)
644
- extrData = data[:,colIndices]
645
- return (data, extrData)
646
-
647
- def loadFeatDataFile(file, delim, cols):
648
- """
649
- loads delim separated file and extracts columns
650
-
651
- Parameters
652
- file : file path
653
- delim : delemeter
654
- cols : columns to use from file
655
- """
656
- data = np.loadtxt(file, delimiter=delim, usecols=cols)
657
- return data
658
-
659
- def extrColumns(arr, columns):
660
- """
661
- extracts columns
662
-
663
- Parameters
664
- arr : 2D array
665
- columns : columns
666
- """
667
- return arr[:, columns]
668
-
669
- def subSample(featData, clsData, subSampleRate, withReplacement):
670
- """
671
- subsample feature and class label data
672
-
673
- Parameters
674
- featData : 2D array of feature data
675
- clsData : arrray of class labels
676
- subSampleRate : fraction to be sampled
677
- withReplacement : true if sampling with replacement
678
- """
679
- sampSize = int(featData.shape[0] * subSampleRate)
680
- sampledIndx = np.random.choice(featData.shape[0],sampSize, replace=withReplacement)
681
- sampFeat = featData[sampledIndx]
682
- sampCls = clsData[sampledIndx]
683
- return(sampFeat, sampCls)
684
-
685
- def euclideanDistance(x,y):
686
- """
687
- euclidean distance
688
-
689
- Parameters
690
- x : first vector
691
- y : second fvector
692
- """
693
- return sqrt(sum(pow(a-b, 2) for a, b in zip(x, y)))
694
-
695
- def squareRooted(x):
696
- """
697
- square root of sum square
698
-
699
- Parameters
700
- x : data vector
701
- """
702
- return round(sqrt(sum([a*a for a in x])),3)
703
-
704
- def cosineSimilarity(x,y):
705
- """
706
- cosine similarity
707
-
708
- Parameters
709
- x : first vector
710
- y : second fvector
711
- """
712
- numerator = sum(a*b for a,b in zip(x,y))
713
- denominator = squareRooted(x) * squareRooted(y)
714
- return round(numerator / float(denominator), 3)
715
-
716
- def cosineDistance(x,y):
717
- """
718
- cosine distance
719
-
720
- Parameters
721
- x : first vector
722
- y : second fvector
723
- """
724
- return 1.0 - cosineSimilarity(x,y)
725
-
726
- def manhattanDistance(x,y):
727
- """
728
- manhattan distance
729
-
730
- Parameters
731
- x : first vector
732
- y : second fvector
733
- """
734
- return sum(abs(a-b) for a,b in zip(x,y))
735
-
736
- def nthRoot(value, nRoot):
737
- """
738
- nth root
739
-
740
- Parameters
741
- value : data value
742
- nRoot : root
743
- """
744
- rootValue = 1/float(nRoot)
745
- return round (Decimal(value) ** Decimal(rootValue),3)
746
-
747
- def minkowskiDistance(x,y,pValue):
748
- """
749
- minkowski distance
750
-
751
- Parameters
752
- x : first vector
753
- y : second fvector
754
- pValue : power factor
755
- """
756
- return nthRoot(sum(pow(abs(a-b),pValue) for a,b in zip(x, y)), pValue)
757
-
758
- def jaccardSimilarityX(x,y):
759
- """
760
- jaccard similarity
761
-
762
- Parameters
763
- x : first vector
764
- y : second fvector
765
- """
766
- intersectionCardinality = len(set.intersection(*[set(x), set(y)]))
767
- unionCardinality = len(set.union(*[set(x), set(y)]))
768
- return intersectionCardinality/float(unionCardinality)
769
-
770
- def jaccardSimilarity(x,y,wx=1.0,wy=1.0):
771
- """
772
- jaccard similarity
773
-
774
- Parameters
775
- x : first vector
776
- y : second fvector
777
- wx : weight for x
778
- wy : weight for y
779
- """
780
- sx = set(x)
781
- sy = set(y)
782
- sxyInt = sx.intersection(sy)
783
- intCardinality = len(sxyInt)
784
- sxIntDiff = sx.difference(sxyInt)
785
- syIntDiff = sy.difference(sxyInt)
786
- unionCardinality = len(sx.union(sy))
787
- return intCardinality/float(intCardinality + wx * len(sxIntDiff) + wy * len(syIntDiff))
788
-
789
- def levenshteinSimilarity(s1, s2):
790
- """
791
- Levenshtein similarity for strings
792
-
793
- Parameters
794
- sx : first string
795
- sy : second string
796
- """
797
- assert type(s1) == str and type(s2) == str, "Levenshtein similarity is for string only"
798
- d = ld(s1,s2)
799
- #print(d)
800
- l = max(len(s1),len(s2))
801
- d = 1.0 - min(d/l, 1.0)
802
- return d
803
-
804
- def norm(values, po=2):
805
- """
806
- norm
807
-
808
- Parameters
809
- values : list of values
810
- po : power
811
- """
812
- no = sum(list(map(lambda v: pow(v,po), values)))
813
- no = pow(no,1.0/po)
814
- return list(map(lambda v: v/no, values))
815
-
816
- def createOneHotVec(size, indx = -1):
817
- """
818
- random one hot vector
819
-
820
- Parameters
821
- size : vector size
822
- indx : one hot position
823
- """
824
- vec = [0] * size
825
- s = random.randint(0, size - 1) if indx < 0 else indx
826
- vec[s] = 1
827
- return vec
828
-
829
- def createAllOneHotVec(size):
830
- """
831
- create all one hot vectors
832
-
833
- Parameters
834
- size : vector size and no of vectors
835
- """
836
- vecs = list()
837
- for i in range(size):
838
- vec = [0] * size
839
- vec[i] = 1
840
- vecs.append(vec)
841
- return vecs
842
-
843
- def blockShuffle(data, blockSize):
844
- """
845
- block shuffle
846
-
847
- Parameters
848
- data : list data
849
- blockSize : block size
850
- """
851
- numBlock = int(len(data) / blockSize)
852
- remain = len(data) % blockSize
853
- numBlock += (1 if remain > 0 else 0)
854
- shuffled = list()
855
- for i in range(numBlock):
856
- b = random.randint(0, numBlock-1)
857
- beg = b * blockSize
858
- if (b < numBlock-1):
859
- end = beg + blockSize
860
- shuffled.extend(data[beg:end])
861
- else:
862
- shuffled.extend(data[beg:])
863
- return shuffled
864
-
865
- def shuffle(data, numShuffle):
866
- """
867
- shuffle data by randonm swapping
868
-
869
- Parameters
870
- data : list data
871
- numShuffle : no of pairwise swaps
872
- """
873
- sz = len(data)
874
- if numShuffle is None:
875
- numShuffle = int(sz / 2)
876
- for i in range(numShuffle):
877
- fi = random.randint(0, sz -1)
878
- se = random.randint(0, sz -1)
879
- tmp = data[fi]
880
- data[fi] = data[se]
881
- data[se] = tmp
882
-
883
- def randomWalk(size, start, lowStep, highStep):
884
- """
885
- random walk
886
-
887
- Parameters
888
- size : list data
889
- start : initial position
890
- lowStep : step min
891
- highStep : step max
892
- """
893
- cur = start
894
- for i in range(size):
895
- yield cur
896
- cur += randomFloat(lowStep, highStep)
897
-
898
- def binaryEcodeCategorical(values, value):
899
- """
900
- one hot binary encoding
901
-
902
- Parameters
903
- values : list of values
904
- value : value to be replaced with 1
905
- """
906
- size = len(values)
907
- vec = [0] * size
908
- for i in range(size):
909
- if (values[i] == value):
910
- vec[i] = 1
911
- return vec
912
-
913
- def createLabeledSeq(inputData, tw):
914
- """
915
- Creates feature, label pair from sequence data, where we have tw number of features followed by output
916
-
917
- Parameters
918
- values : list containing feature and label
919
- tw : no of features
920
- """
921
- features = list()
922
- labels = list()
923
- l = len(inputDta)
924
- for i in range(l - tw):
925
- trainSeq = inputData[i:i+tw]
926
- trainLabel = inputData[i+tw]
927
- features.append(trainSeq)
928
- labels.append(trainLabel)
929
- return (features, labels)
930
-
931
- def createLabeledSeq(filePath, delim, index, tw):
932
- """
933
- Creates feature, label pair from 1D sequence data in file
934
-
935
- Parameters
936
- filePath : file path
937
- delim : delemeter
938
- index : column index
939
- tw : no of features
940
- """
941
- seqData = getFileColumnAsFloat(filePath, delim, index)
942
- return createLabeledSeq(seqData, tw)
943
-
944
- def fromMultDimSeqToTabular(data, inpSize, seqLen):
945
- """
946
- Input shape (nrow, inpSize * seqLen) output shape(nrow * seqLen, inpSize)
947
-
948
- Parameters
949
- data : 2D array
950
- inpSize : each input size in sequence
951
- seqLen : sequence length
952
- """
953
- nrow = data.shape[0]
954
- assert data.shape[1] == inpSize * seqLen, "invalid input size or sequence length"
955
- return data.reshape(nrow * seqLen, inpSize)
956
-
957
- def fromTabularToMultDimSeq(data, inpSize, seqLen):
958
- """
959
- Input shape (nrow * seqLen, inpSize) output shape (nrow, inpSize * seqLen)
960
-
961
- Parameters
962
- data : 2D array
963
- inpSize : each input size in sequence
964
- seqLen : sequence length
965
- """
966
- nrow = int(data.shape[0] / seqLen)
967
- assert data.shape[1] == inpSize, "invalid input size"
968
- return data.reshape(nrow, seqLen * inpSize)
969
-
970
- def difference(data, interval=1):
971
- """
972
- takes difference in time series data
973
-
974
- Parameters
975
- data :list data
976
- interval : interval for difference
977
- """
978
- diff = list()
979
- for i in range(interval, len(data)):
980
- value = data[i] - data[i - interval]
981
- diff.append(value)
982
- return diff
983
-
984
- def normalizeMatrix(data, norm, axis=1):
985
- """
986
- normalized each row of the matrix
987
-
988
- Parameters
989
- data : 2D data
990
- nporm : normalization method
991
- axis : row or column
992
- """
993
- normalized = preprocessing.normalize(data,norm=norm, axis=axis)
994
- return normalized
995
-
996
- def standardizeMatrix(data, axis=0):
997
- """
998
- standardizes each column of the matrix with mean and std deviation
999
-
1000
- Parameters
1001
- data : 2D data
1002
- axis : row or column
1003
- """
1004
- standardized = preprocessing.scale(data, axis=axis)
1005
- return standardized
1006
-
1007
- def asNumpyArray(data):
1008
- """
1009
- converts to numpy array
1010
-
1011
- Parameters
1012
- data : array
1013
- """
1014
- return np.array(data)
1015
-
1016
- def perfMetric(metric, yActual, yPred, clabels=None):
1017
- """
1018
- predictive model accuracy metric
1019
-
1020
- Parameters
1021
- metric : accuracy metric
1022
- yActual : actual values array
1023
- yPred : predicted values array
1024
- clabels : class labels
1025
- """
1026
- if metric == "rsquare":
1027
- score = metrics.r2_score(yActual, yPred)
1028
- elif metric == "mae":
1029
- score = metrics.mean_absolute_error(yActual, yPred)
1030
- elif metric == "mse":
1031
- score = metrics.mean_squared_error(yActual, yPred)
1032
- elif metric == "acc":
1033
- yPred = np.rint(yPred)
1034
- score = metrics.accuracy_score(yActual, yPred)
1035
- elif metric == "mlAcc":
1036
- yPred = np.argmax(yPred, axis=1)
1037
- score = metrics.accuracy_score(yActual, yPred)
1038
- elif metric == "prec":
1039
- yPred = np.argmax(yPred, axis=1)
1040
- score = metrics.precision_score(yActual, yPred)
1041
- elif metric == "rec":
1042
- yPred = np.argmax(yPred, axis=1)
1043
- score = metrics.recall_score(yActual, yPred)
1044
- elif metric == "fone":
1045
- yPred = np.argmax(yPred, axis=1)
1046
- score = metrics.f1_score(yActual, yPred)
1047
- elif metric == "confm":
1048
- yPred = np.argmax(yPred, axis=1)
1049
- score = metrics.confusion_matrix(yActual, yPred)
1050
- elif metric == "clarep":
1051
- yPred = np.argmax(yPred, axis=1)
1052
- score = metrics.classification_report(yActual, yPred)
1053
- elif metric == "bce":
1054
- if clabels is None:
1055
- clabels = [0, 1]
1056
- score = metrics.log_loss(yActual, yPred, labels=clabels)
1057
- elif metric == "ce":
1058
- assert clabels is not None, "labels must be provided"
1059
- score = metrics.log_loss(yActual, yPred, labels=clabels)
1060
- else:
1061
- exitWithMsg("invalid prediction performance metric " + metric)
1062
- return score
1063
-
1064
- def scaleData(data, method):
1065
- """
1066
- scales feature data column wise
1067
-
1068
- Parameters
1069
- data : 2D array
1070
- method : scaling method
1071
- """
1072
- if method == "minmax":
1073
- scaler = preprocessing.MinMaxScaler()
1074
- data = scaler.fit_transform(data)
1075
- elif method == "zscale":
1076
- data = preprocessing.scale(data)
1077
- else:
1078
- raise ValueError("invalid scaling method")
1079
- return data
1080
-
1081
- def scaleDataWithParams(data, method, scParams):
1082
- """
1083
- scales feature data column wise
1084
-
1085
- Parameters
1086
- data : 2D array
1087
- method : scaling method
1088
- scParams : scaling parameters
1089
- """
1090
- if method == "minmax":
1091
- data = scaleMinMaxTabData(data, scParams)
1092
- elif method == "zscale":
1093
- raise ValueError("invalid scaling method")
1094
- else:
1095
- raise ValueError("invalid scaling method")
1096
- return data
1097
-
1098
- def scaleMinMaxScaData(data, minMax):
1099
- """
1100
- minmax scales scalar data
1101
-
1102
- Parameters
1103
- data : scalar data
1104
- minMax : min, max and range for each column
1105
- """
1106
- sd = (data - minMax[0]) / minMax[2]
1107
- return sd
1108
-
1109
-
1110
- def scaleMinMaxTabData(tdata, minMax):
1111
- """
1112
- for tabular scales feature data column wise using min max values for each field
1113
-
1114
- Parameters
1115
- tdata : 2D array
1116
- minMax : min, max and range for each column
1117
- """
1118
- stdata = list()
1119
- for r in tdata:
1120
- srdata = list()
1121
- for i, c in enumerate(r):
1122
- sd = (c - minMax[i][0]) / minMax[i][2]
1123
- srdata.append(sd)
1124
- stdata.append(srdata)
1125
- return stdata
1126
-
1127
- def scaleMinMax(rdata, minMax):
1128
- """
1129
- scales feature data column wise using min max values for each field
1130
-
1131
- Parameters
1132
- rdata : data array
1133
- minMax : min, max and range for each column
1134
- """
1135
- srdata = list()
1136
- for i in range(len(rdata)):
1137
- d = rdata[i]
1138
- sd = (d - minMax[i][0]) / minMax[i][2]
1139
- srdata.append(sd)
1140
- return srdata
1141
-
1142
- def harmonicNum(n):
1143
- """
1144
- harmonic number
1145
-
1146
- Parameters
1147
- n : number
1148
- """
1149
- h = 0
1150
- for i in range(1, n+1, 1):
1151
- h += 1.0 / i
1152
- return h
1153
-
1154
- def digammaFun(n):
1155
- """
1156
- figamma function
1157
-
1158
- Parameters
1159
- n : number
1160
- """
1161
- #Euler Mascheroni constant
1162
- ec = 0.577216
1163
- return harmonicNum(n - 1) - ec
1164
-
1165
- def getDataPartitions(tdata, types, columns = None):
1166
- """
1167
- partitions data with the given columns and random split point defined with predicates
1168
-
1169
- Parameters
1170
- tdata : 2D array
1171
- types : data typers
1172
- columns : column indexes
1173
- """
1174
- (dtypes, cvalues) = extractTypesFromString(types)
1175
- if columns is None:
1176
- ncol = len(data[0])
1177
- columns = list(range(ncol))
1178
- ncol = len(columns)
1179
- #print(columns)
1180
-
1181
- # partition predicates
1182
- partitions = None
1183
- for c in columns:
1184
- #print(c)
1185
- dtype = dtypes[c]
1186
- pred = list()
1187
- if dtype == "int" or dtype == "float":
1188
- (vmin, vmax) = getColMinMax(tdata, c)
1189
- r = vmax - vmin
1190
- rmin = vmin + .2 * r
1191
- rmax = vmax - .2 * r
1192
- sp = randomFloat(rmin, rmax)
1193
- if dtype == "int":
1194
- sp = int(sp)
1195
- else:
1196
- sp = "{:.3f}".format(sp)
1197
- sp = float(sp)
1198
- pred.append([c, "LT", sp])
1199
- pred.append([c, "GE", sp])
1200
- elif dtype == "cat":
1201
- cv = cvalues[c]
1202
- card = len(cv)
1203
- if card < 3:
1204
- num = 1
1205
- else:
1206
- num = randomInt(1, card - 1)
1207
- sp = selectRandomSubListFromList(cv, num)
1208
- sp = " ".join(sp)
1209
- pred.append([c, "IN", sp])
1210
- pred.append([c, "NOTIN", sp])
1211
-
1212
- #print(pred)
1213
- if partitions is None:
1214
- partitions = pred.copy()
1215
- #print("initial")
1216
- #print(partitions)
1217
- else:
1218
- #print("extension")
1219
- tparts = list()
1220
- for p in partitions:
1221
- #print(p)
1222
- l1 = p.copy()
1223
- l1.extend(pred[0])
1224
- l2 = p.copy()
1225
- l2.extend(pred[1])
1226
- #print("after extension")
1227
- #print(l1)
1228
- #print(l2)
1229
- tparts.append(l1)
1230
- tparts.append(l2)
1231
- partitions = tparts
1232
- #print("extending")
1233
- #print(partitions)
1234
-
1235
- #for p in partitions:
1236
- #print(p)
1237
- return partitions
1238
-
1239
- def genAlmostUniformDistr(size, nswap=50):
1240
- """
1241
- generate probability distribution
1242
-
1243
- Parameters
1244
- size : distr size
1245
- nswap : no of mass swaps
1246
- """
1247
- un = 1.0 / size
1248
- distr = [un] * size
1249
- distr = mutDistr(distr, 0.1 * un, nswap)
1250
- return distr
1251
-
1252
- def mutDistr(distr, shift, nswap=50):
1253
- """
1254
- mutates a probability distribution
1255
-
1256
- Parameters
1257
- distr distribution
1258
- shift : amount of shift for swap
1259
- nswap : no of mass swaps
1260
- """
1261
- size = len(distr)
1262
- for _ in range(nswap):
1263
- fi = randomInt(0, size -1)
1264
- si = randomInt(0, size -1)
1265
- while fi == si:
1266
- fi = randomInt(0, size -1)
1267
- si = randomInt(0, size -1)
1268
-
1269
- shift = randomFloat(0, shift)
1270
- t = distr[fi]
1271
- distr[fi] -= shift
1272
- if (distr[fi] < 0):
1273
- distr[fi] = 0.0
1274
- shift = t
1275
- distr[si] += shift
1276
- return distr
1277
-
1278
- def generateBinDistribution(size, ntrue):
1279
- """
1280
- generate binary array with some elements set to 1
1281
-
1282
- Parameters
1283
- size : distr size
1284
- ntrue : no of true values
1285
- """
1286
- distr = [0] * size
1287
- idxs = selectRandomSubListFromList(list(range(size)), ntrue)
1288
- for i in idxs:
1289
- distr[i] = 1
1290
- return distr
1291
-
1292
- def mutBinaryDistr(distr, nmut):
1293
- """
1294
- mutate binary distribution
1295
-
1296
- Parameters
1297
- distr : distr
1298
- nmut : no of mutations
1299
- """
1300
- idxs = selectRandomSubListFromList(list(range(len(distr))), nmut)
1301
- for i in idxs:
1302
- distr[i] = distr[i] ^ 1
1303
- return distr
1304
-
1305
- def fileSelFieldSubSeqModifierGen(filePath, column, offset, seqLen, modifier, precision, delim=","):
1306
- """
1307
- file record generator that superimposes given data in the specified segment of a column
1308
-
1309
- Parameters
1310
- filePath ; file path
1311
- column : column index
1312
- offset : offset into column values
1313
- seqLen : length of subseq
1314
- modifier : data to be superimposed either list or a sampler object
1315
- precision : floating point precision
1316
- delim : delemeter
1317
- """
1318
- beg = offset
1319
- end = beg + seqLen
1320
- isList = type(modifier) == list
1321
- i = 0
1322
- for rec in fileRecGen(filePath, delim):
1323
- if i >= beg and i < end:
1324
- va = float(rec[column])
1325
- if isList:
1326
- va += modifier[i - beg]
1327
- else:
1328
- va += modifier.sample()
1329
- rec[column] = formatFloat(precision, va)
1330
- yield delim.join(rec)
1331
- i += 1
1332
-
1333
- class ShiftedDataGenerator:
1334
- """
1335
- transforms data for distribution shift
1336
- """
1337
- def __init__(self, types, tdata, addFact, multFact):
1338
- """
1339
- initializer
1340
-
1341
- Parameters
1342
- types data types
1343
- tdata : 2D array
1344
- addFact ; factor for data shift
1345
- multFact ; factor for data scaling
1346
- """
1347
- (self.dtypes, self.cvalues) = extractTypesFromString(types)
1348
-
1349
- self.limits = dict()
1350
- for k,v in self.dtypes.items():
1351
- if v == "int" or v == "false":
1352
- (vmax, vmin) = getColMinMax(tdata, k)
1353
- self.limits[k] = vmax - vmin
1354
- self.addMin = - addFact / 2
1355
- self.addMax = addFact / 2
1356
- self.multMin = 1.0 - multFact / 2
1357
- self.multMax = 1.0 + multFact / 2
1358
-
1359
-
1360
-
1361
-
1362
- def transform(self, tdata):
1363
- """
1364
- linear transforms data to create distribution shift with random shift and scale
1365
-
1366
- Parameters
1367
- types : data types
1368
- """
1369
- transforms = dict()
1370
- for k,v in self.dtypes.items():
1371
- if v == "int" or v == "false":
1372
- shift = randomFloat(self.addMin, self.addMax) * self.limits[k]
1373
- scale = randomFloat(self.multMin, self.multMax)
1374
- trns = (shift, scale)
1375
- transforms[k] = trns
1376
- elif v == "cat":
1377
- transforms[k] = isEventSampled(50)
1378
-
1379
- ttdata = list()
1380
- for rec in tdata:
1381
- nrec = rec.copy()
1382
- for c in range(len(rec)):
1383
- if c in self.dtypes:
1384
- dtype = self.dtypes[c]
1385
- if dtype == "int" or dtype == "float":
1386
- (shift, scale) = transforms[c]
1387
- nval = shift + rec[c] * scale
1388
- if dtype == "int":
1389
- nrec[c] = int(nval)
1390
- else:
1391
- nrec[c] = nval
1392
- elif dtype == "cat":
1393
- cv = self.cvalues[c]
1394
- if transforms[c]:
1395
- nval = selectOtherRandomFromList(cv, rec[c])
1396
- nrec[c] = nval
1397
-
1398
- ttdata.append(nrec)
1399
-
1400
- return ttdata
1401
-
1402
- def transformSpecified(self, tdata, sshift, scale):
1403
- """
1404
- linear transforms data to create distribution shift shift specified shift and scale
1405
-
1406
- Parameters
1407
- types : data types
1408
- sshift : shift factor
1409
- scale : scale factor
1410
- """
1411
- transforms = dict()
1412
- for k,v in self.dtypes.items():
1413
- if v == "int" or v == "false":
1414
- shift = sshift * self.limits[k]
1415
- trns = (shift, scale)
1416
- transforms[k] = trns
1417
- elif v == "cat":
1418
- transforms[k] = isEventSampled(50)
1419
-
1420
- ttdata = self.__scaleShift(tdata, transforms)
1421
- return ttdata
1422
-
1423
- def __scaleShift(self, tdata, transforms):
1424
- """
1425
- shifts and scales tabular data
1426
-
1427
- Parameters
1428
- tdata : 2D array
1429
- transforms : transforms to apply
1430
- """
1431
- ttdata = list()
1432
- for rec in tdata:
1433
- nrec = rec.copy()
1434
- for c in range(len(rec)):
1435
- if c in self.dtypes:
1436
- dtype = self.dtypes[c]
1437
- if dtype == "int" or dtype == "float":
1438
- (shift, scale) = transforms[c]
1439
- nval = shift + rec[c] * scale
1440
- if dtype == "int":
1441
- nrec[c] = int(nval)
1442
- else:
1443
- nrec[c] = nval
1444
- elif dtype == "cat":
1445
- cv = self.cvalues[c]
1446
- if transforms[c]:
1447
- #nval = selectOtherRandomFromList(cv, rec[c])
1448
- #nrec[c] = nval
1449
- pass
1450
-
1451
- ttdata.append(nrec)
1452
- return ttdata
1453
-
1454
- class RollingStat(object):
1455
- """
1456
- stats for rolling windowt
1457
- """
1458
- def __init__(self, wsize):
1459
- """
1460
- initializer
1461
-
1462
- Parameters
1463
- wsize : window size
1464
- """
1465
- self.window = list()
1466
- self.wsize = wsize
1467
- self.mean = None
1468
- self.sd = None
1469
-
1470
- def add(self, value):
1471
- """
1472
- add a value
1473
-
1474
- Parameters
1475
- value : value to add
1476
- """
1477
- self.window.append(value)
1478
- if len(self.window) > self.wsize:
1479
- self.window = self.window[1:]
1480
-
1481
- def getStat(self):
1482
- """
1483
- get rolling window mean and std deviation
1484
- """
1485
- assertGreater(len(self.window), 0, "window is empty")
1486
- if len(self.window) == 1:
1487
- self.mean = self.window[0]
1488
- self.sd = 0
1489
- else:
1490
- self.mean = statistics.mean(self.window)
1491
- self.sd = statistics.stdev(self.window, xbar=self.mean)
1492
- re = (self.mean, self.sd)
1493
- return re
1494
-
1495
- def getSize(self):
1496
- """
1497
- return window size
1498
- """
1499
- return len(self.window)
1500
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/matumizi/sampler.py DELETED
@@ -1,1455 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # avenir-python: Machine Learning
4
- # Author: Pranab Ghosh
5
- #
6
- # Licensed under the Apache License, Version 2.0 (the "License"); you
7
- # may not use this file except in compliance with the License. You may
8
- # obtain a copy of the License at
9
- #
10
- # http://www.apache.org/licenses/LICENSE-2.0
11
- #
12
- # Unless required by applicable law or agreed to in writing, software
13
- # distributed under the License is distributed on an "AS IS" BASIS,
14
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
15
- # implied. See the License for the specific language governing
16
- # permissions and limitations under the License.
17
-
18
- import sys
19
- import random
20
- import time
21
- import math
22
- import random
23
- import numpy as np
24
- from scipy import stats
25
- from random import randint
26
- from .util import *
27
- from .stats import Histogram
28
-
29
- def randomFloat(low, high):
30
- """
31
- sample float within range
32
-
33
- Parameters
34
- low : low valuee
35
- high : high valuee
36
- """
37
- return random.random() * (high-low) + low
38
-
39
- def randomInt(minv, maxv):
40
- """
41
- sample int within range
42
-
43
- Parameters
44
- minv : low valuee
45
- maxv : high valuee
46
- """
47
- return randint(minv, maxv)
48
-
49
- def randIndex(lData):
50
- """
51
- random index of a list
52
-
53
- Parameters
54
- lData : list data
55
- """
56
- return randint(0, len(lData)-1)
57
-
58
- def randomUniformSampled(low, high):
59
- """
60
- sample float within range
61
-
62
- Parameters
63
- low : low value
64
- high : high value
65
- """
66
- return np.random.uniform(low, high)
67
-
68
- def randomUniformSampledList(low, high, size):
69
- """
70
- sample floats within range to create list
71
-
72
- Parameters
73
- low : low value
74
- high : high value
75
- size ; size of list to be returned
76
- """
77
- return np.random.uniform(low, high, size)
78
-
79
- def randomNormSampled(mean, sd):
80
- """
81
- sample float from normal
82
-
83
- Parameters
84
- mean : mean
85
- sd : std deviation
86
- """
87
- return np.random.normal(mean, sd)
88
-
89
- def randomNormSampledList(mean, sd, size):
90
- """
91
- sample float list from normal
92
-
93
- Parameters
94
- mean : mean
95
- sd : std deviation
96
- size : size of list to be returned
97
- """
98
- return np.random.normal(mean, sd, size)
99
-
100
- def randomSampledList(sampler, size):
101
- """
102
- sample list from given sampler
103
-
104
- Parameters
105
- sampler : sampler object
106
- size : size of list to be returned
107
- """
108
- return list(map(lambda i : sampler.sample(), range(size)))
109
-
110
-
111
- def minLimit(val, minv):
112
- """
113
- min limit
114
-
115
- Parameters
116
- val : value
117
- minv : min limit
118
- """
119
- if (val < minv):
120
- val = minv
121
- return val
122
-
123
-
124
- def rangeLimit(val, minv, maxv):
125
- """
126
- range limit
127
-
128
- Parameters
129
- val : value
130
- minv : min limit
131
- maxv : max limit
132
- """
133
- if (val < minv):
134
- val = minv
135
- elif (val > maxv):
136
- val = maxv
137
- return val
138
-
139
-
140
- def sampleUniform(minv, maxv):
141
- """
142
- sample int within range
143
-
144
- Parameters
145
- minv ; int min limit
146
- maxv : int max limit
147
- """
148
- return randint(minv, maxv)
149
-
150
-
151
- def sampleFromBase(value, dev):
152
- """
153
- sample int wrt base
154
-
155
- Parameters
156
- value : base value
157
- dev : deviation
158
- """
159
- return randint(value - dev, value + dev)
160
-
161
-
162
- def sampleFloatFromBase(value, dev):
163
- """
164
- sample float wrt base
165
-
166
- Parameters
167
- value : base value
168
- dev : deviation
169
- """
170
- return randomFloat(value - dev, value + dev)
171
-
172
-
173
- def distrUniformWithRanndom(total, numItems, noiseLevel):
174
- """
175
- uniformly distribute with some randomness and preserves total
176
-
177
- Parameters
178
- total : total count
179
- numItems : no of bins
180
- noiseLevel : noise level fraction
181
- """
182
- perItem = total / numItems
183
- var = perItem * noiseLevel
184
- items = []
185
- for i in range(numItems):
186
- item = perItem + randomFloat(-var, var)
187
- items.append(item)
188
-
189
- #adjust last item
190
- sm = sum(items[:-1])
191
- items[-1] = total - sm
192
- return items
193
-
194
-
195
- def isEventSampled(threshold, maxv=100):
196
- """
197
- sample event which occurs if sampled below threshold
198
-
199
- Parameters
200
- threshold : threshold for sampling
201
- maxv : maximum values
202
- """
203
- return randint(0, maxv) < threshold
204
-
205
-
206
- def sampleBinaryEvents(events, probPercent):
207
- """
208
- sample binary events
209
-
210
- Parameters
211
- events : two events
212
- probPercent : probability as percentage
213
- """
214
- if (randint(0, 100) < probPercent):
215
- event = events[0]
216
- else:
217
- event = events[1]
218
- return event
219
-
220
-
221
- def addNoiseNum(value, sampler):
222
- """
223
- add noise to numeric value
224
-
225
- Parameters
226
- value : base value
227
- sampler : sampler for noise
228
- """
229
- return value * (1 + sampler.sample())
230
-
231
-
232
- def addNoiseCat(value, values, noise):
233
- """
234
- add noise to categorical value i.e with some probability change value
235
-
236
- Parameters
237
- value : cat value
238
- values : cat values
239
- noise : noise level fraction
240
- """
241
- newValue = value
242
- threshold = int(noise * 100)
243
- if (isEventSampled(threshold)):
244
- newValue = selectRandomFromList(values)
245
- while newValue == value:
246
- newValue = selectRandomFromList(values)
247
- return newValue
248
-
249
-
250
- def sampleWithReplace(data, sampSize):
251
- """
252
- sample with replacement
253
-
254
- Parameters
255
- data : array
256
- sampSize : sample size
257
- """
258
- sampled = list()
259
- le = len(data)
260
- if sampSize is None:
261
- sampSize = le
262
- for i in range(sampSize):
263
- j = random.randint(0, le - 1)
264
- sampled.append(data[j])
265
- return sampled
266
-
267
- class CumDistr:
268
- """
269
- cumulative distr
270
- """
271
-
272
- def __init__(self, data, numBins = None):
273
- """
274
- initializer
275
-
276
- Parameters
277
- data : array
278
- numBins : no of bins
279
- """
280
- if not numBins:
281
- numBins = int(len(data) / 5)
282
- res = stats.cumfreq(data, numbins=numBins)
283
- self.cdistr = res.cumcount / len(data)
284
- self.loLim = res.lowerlimit
285
- self.upLim = res.lowerlimit + res.binsize * res.cumcount.size
286
- self.binWidth = res.binsize
287
-
288
- def getDistr(self, value):
289
- """
290
- get cumulative distribution
291
-
292
- Parameters
293
- value : value
294
- """
295
- if value <= self.loLim:
296
- d = 0.0
297
- elif value >= self.upLim:
298
- d = 1.0
299
- else:
300
- bin = int((value - self.loLim) / self.binWidth)
301
- d = self.cdistr[bin]
302
- return d
303
-
304
- class BernoulliTrialSampler:
305
- """
306
- bernoulli trial sampler return True or False
307
- """
308
-
309
- def __init__(self, pr, events=None):
310
- """
311
- initializer
312
-
313
- Parameters
314
- pr : probability
315
- events : event values
316
- """
317
- self.pr = pr
318
- self.retEvent = False if events is None else True
319
- self.events = events
320
-
321
-
322
- def sample(self):
323
- """
324
- samples value
325
- """
326
- res = random.random() < self.pr
327
- if self.retEvent:
328
- res = self.events[0] if res else self.events[1]
329
- return res
330
-
331
- class PoissonSampler:
332
- """
333
- poisson sampler returns number of events
334
- """
335
- def __init__(self, rateOccur, maxSamp):
336
- """
337
- initializer
338
-
339
- Parameters
340
- rateOccur : rate of occurence
341
- maxSamp : max limit on no of samples
342
- """
343
- self.rateOccur = rateOccur
344
- self.maxSamp = int(maxSamp)
345
- self.pmax = self.calculatePr(rateOccur)
346
-
347
- def calculatePr(self, numOccur):
348
- """
349
- calulates probability
350
-
351
- Parameters
352
- numOccur : no of occurence
353
- """
354
- p = (self.rateOccur ** numOccur) * math.exp(-self.rateOccur) / math.factorial(numOccur)
355
- return p
356
-
357
- def sample(self):
358
- """
359
- samples value
360
- """
361
- done = False
362
- samp = 0
363
- while not done:
364
- no = randint(0, self.maxSamp)
365
- sp = randomFloat(0.0, self.pmax)
366
- ap = self.calculatePr(no)
367
- if sp < ap:
368
- done = True
369
- samp = no
370
- return samp
371
-
372
- class ExponentialSampler:
373
- """
374
- returns interval between events
375
- """
376
- def __init__(self, rateOccur, maxSamp = None):
377
- """
378
- initializer
379
-
380
- Parameters
381
- rateOccur : rate of occurence
382
- maxSamp : max limit on interval
383
- """
384
- self.interval = 1.0 / rateOccur
385
- self.maxSamp = int(maxSamp) if maxSamp is not None else None
386
-
387
- def sample(self):
388
- """
389
- samples value
390
- """
391
- sampled = np.random.exponential(scale=self.interval)
392
- if self.maxSamp is not None:
393
- while sampled > self.maxSamp:
394
- sampled = np.random.exponential(scale=self.interval)
395
- return sampled
396
-
397
- class UniformNumericSampler:
398
- """
399
- uniform sampler for numerical values
400
- """
401
- def __init__(self, minv, maxv):
402
- """
403
- initializer
404
-
405
- Parameters
406
- minv : min value
407
- maxv : max value
408
- """
409
- self.minv = minv
410
- self.maxv = maxv
411
-
412
- def isNumeric(self):
413
- """
414
- returns true
415
- """
416
- return True
417
-
418
- def sample(self):
419
- """
420
- samples value
421
- """
422
- samp = sampleUniform(self.minv, self.maxv) if isinstance(self.minv, int) else randomFloat(self.minv, self.maxv)
423
- return samp
424
-
425
- class UniformCategoricalSampler:
426
- """
427
- uniform sampler for categorical values
428
- """
429
- def __init__(self, cvalues):
430
- """
431
- initializer
432
-
433
- Parameters
434
- cvalues : categorical value list
435
- """
436
- self.cvalues = cvalues
437
-
438
- def isNumeric(self):
439
- return False
440
-
441
- def sample(self):
442
- """
443
- samples value
444
- """
445
- return selectRandomFromList(self.cvalues)
446
-
447
- class NormalSampler:
448
- """
449
- normal sampler
450
- """
451
- def __init__(self, mean, stdDev):
452
- """
453
- initializer
454
-
455
- Parameters
456
- mean : mean
457
- stdDev : std deviation
458
- """
459
- self.mean = mean
460
- self.stdDev = stdDev
461
- self.sampleAsInt = False
462
-
463
- def isNumeric(self):
464
- return True
465
-
466
- def sampleAsIntValue(self):
467
- """
468
- set True to sample as int
469
- """
470
- self.sampleAsInt = True
471
-
472
- def sample(self):
473
- """
474
- samples value
475
- """
476
- samp = np.random.normal(self.mean, self.stdDev)
477
- if self.sampleAsInt:
478
- samp = int(samp)
479
- return samp
480
-
481
- class LogNormalSampler:
482
- """
483
- log normal sampler
484
- """
485
- def __init__(self, mean, stdDev):
486
- """
487
- initializer
488
-
489
- Parameters
490
- mean : mean
491
- stdDev : std deviation
492
- """
493
- self.mean = mean
494
- self.stdDev = stdDev
495
-
496
- def isNumeric(self):
497
- return True
498
-
499
- def sample(self):
500
- """
501
- samples value
502
- """
503
- return np.random.lognormal(self.mean, self.stdDev)
504
-
505
- class NormalSamplerWithTrendCycle:
506
- """
507
- normal sampler with cycle and trend
508
- """
509
- def __init__(self, mean, stdDev, dmean, cycle, step=1):
510
- """
511
- initializer
512
-
513
- Parameters
514
- mean : mean
515
- stdDev : std deviation
516
- dmean : trend delta
517
- cycle : cycle values wrt base mean
518
- step : adjustment step for cycle and trend
519
- """
520
- self.mean = mean
521
- self.cmean = mean
522
- self.stdDev = stdDev
523
- self.dmean = dmean
524
- self.cycle = cycle
525
- self.clen = len(cycle) if cycle is not None else 0
526
- self.step = step
527
- self.count = 0
528
-
529
- def isNumeric(self):
530
- return True
531
-
532
- def sample(self):
533
- """
534
- samples value
535
- """
536
- s = np.random.normal(self.cmean, self.stdDev)
537
- self.count += 1
538
- if self.count % self.step == 0:
539
- cy = 0
540
- if self.clen > 1:
541
- coff = self.count % self.clen
542
- cy = self.cycle[coff]
543
- tr = self.count * self.dmean
544
- self.cmean = self.mean + tr + cy
545
- return s
546
-
547
-
548
- class ParetoSampler:
549
- """
550
- pareto sampler
551
- """
552
- def __init__(self, mode, shape):
553
- """
554
- initializer
555
-
556
- Parameters
557
- mode : mode
558
- shape : shape
559
- """
560
- self.mode = mode
561
- self.shape = shape
562
-
563
- def isNumeric(self):
564
- return True
565
-
566
- def sample(self):
567
- """
568
- samples value
569
- """
570
- return (np.random.pareto(self.shape) + 1) * self.mode
571
-
572
- class GammaSampler:
573
- """
574
- pareto sampler
575
- """
576
- def __init__(self, shape, scale):
577
- """
578
- initializer
579
-
580
- Parameters
581
- shape : shape
582
- scale : scale
583
- """
584
- self.shape = shape
585
- self.scale = scale
586
-
587
- def isNumeric(self):
588
- return True
589
-
590
- def sample(self):
591
- """
592
- samples value
593
- """
594
- return np.random.gamma(self.shape, self.scale)
595
-
596
- class GaussianRejectSampler:
597
- """
598
- gaussian sampling based on rejection sampling
599
- """
600
- def __init__(self, mean, stdDev):
601
- """
602
- initializer
603
-
604
- Parameters
605
- mean : mean
606
- stdDev : std deviation
607
- """
608
- self.mean = mean
609
- self.stdDev = stdDev
610
- self.xmin = mean - 3 * stdDev
611
- self.xmax = mean + 3 * stdDev
612
- self.ymin = 0.0
613
- self.fmax = 1.0 / (math.sqrt(2.0 * 3.14) * stdDev)
614
- self.ymax = 1.05 * self.fmax
615
- self.sampleAsInt = False
616
-
617
- def isNumeric(self):
618
- return True
619
-
620
- def sampleAsIntValue(self):
621
- """
622
- sample as int value
623
- """
624
- self.sampleAsInt = True
625
-
626
- def sample(self):
627
- """
628
- samples value
629
- """
630
- done = False
631
- samp = 0
632
- while not done:
633
- x = randomFloat(self.xmin, self.xmax)
634
- y = randomFloat(self.ymin, self.ymax)
635
- f = self.fmax * math.exp(-(x - self.mean) * (x - self.mean) / (2.0 * self.stdDev * self.stdDev))
636
- if (y < f):
637
- done = True
638
- samp = x
639
- if self.sampleAsInt:
640
- samp = int(samp)
641
- return samp
642
-
643
- class DiscreteRejectSampler:
644
- """
645
- non parametric sampling for discrete values using given distribution based
646
- on rejection sampling
647
- """
648
- def __init__(self, xmin, xmax, step, *values):
649
- """
650
- initializer
651
-
652
- Parameters
653
- xmin : min value
654
- xmax : max value
655
- step : discrete step
656
- values : distr values
657
- """
658
- self.xmin = xmin
659
- self.xmax = xmax
660
- self.step = step
661
- self.distr = values
662
- if (len(self.distr) == 1):
663
- self.distr = self.distr[0]
664
- numSteps = int((self.xmax - self.xmin) / self.step)
665
- #print("{:.3f} {:.3f} {:.3f} {}".format(self.xmin, self.xmax, self.step, numSteps))
666
- assert len(self.distr) == numSteps + 1, "invalid number of distr values expected {}".format(numSteps + 1)
667
- self.ximin = 0
668
- self.ximax = numSteps
669
- self.pmax = float(max(self.distr))
670
-
671
- def isNumeric(self):
672
- return True
673
-
674
- def sample(self):
675
- """
676
- samples value
677
- """
678
- done = False
679
- samp = None
680
- while not done:
681
- xi = randint(self.ximin, self.ximax)
682
- #print(formatAny(xi, "xi"))
683
- ps = randomFloat(0.0, self.pmax)
684
- pa = self.distr[xi]
685
- if ps < pa:
686
- samp = self.xmin + xi * self.step
687
- done = True
688
- return samp
689
-
690
-
691
- class TriangularRejectSampler:
692
- """
693
- non parametric sampling using triangular distribution based on rejection sampling
694
- """
695
- def __init__(self, xmin, xmax, vertexValue, vertexPos=None):
696
- """
697
- initializer
698
-
699
- Parameters
700
- xmin : min value
701
- xmax : max value
702
- vertexValue : distr value at vertex
703
- vertexPos : vertex pposition
704
- """
705
- self.xmin = xmin
706
- self.xmax = xmax
707
- self.vertexValue = vertexValue
708
- if vertexPos:
709
- assert vertexPos > xmin and vertexPos < xmax, "vertex position outside bound"
710
- self.vertexPos = vertexPos
711
- else:
712
- self.vertexPos = 0.5 * (xmin + xmax)
713
- self.s1 = vertexValue / (self.vertexPos - xmin)
714
- self.s2 = vertexValue / (xmax - self.vertexPos)
715
-
716
- def isNumeric(self):
717
- return True
718
-
719
- def sample(self):
720
- """
721
- samples value
722
- """
723
- done = False
724
- samp = None
725
- while not done:
726
- x = randomFloat(self.xmin, self.xmax)
727
- y = randomFloat(0.0, self.vertexValue)
728
- f = (x - self.xmin) * self.s1 if x < self.vertexPos else (self.xmax - x) * self.s2
729
- if (y < f):
730
- done = True
731
- samp = x
732
-
733
- return samp;
734
-
735
- class NonParamRejectSampler:
736
- """
737
- non parametric sampling using given distribution based on rejection sampling
738
- """
739
- def __init__(self, xmin, binWidth, *values):
740
- """
741
- initializer
742
-
743
- Parameters
744
- xmin : min value
745
- binWidth : bin width
746
- values : distr values
747
- """
748
- self.values = values
749
- if (len(self.values) == 1):
750
- self.values = self.values[0]
751
- self.xmin = xmin
752
- self.xmax = xmin + binWidth * (len(self.values) - 1)
753
- #print(self.xmin, self.xmax, binWidth)
754
- self.binWidth = binWidth
755
- self.fmax = 0
756
- for v in self.values:
757
- if (v > self.fmax):
758
- self.fmax = v
759
- self.ymin = 0
760
- self.ymax = self.fmax
761
- self.sampleAsInt = True
762
-
763
- def isNumeric(self):
764
- return True
765
-
766
- def sampleAsFloat(self):
767
- self.sampleAsInt = False
768
-
769
- def sample(self):
770
- """
771
- samples value
772
- """
773
- done = False
774
- samp = 0
775
- while not done:
776
- if self.sampleAsInt:
777
- x = random.randint(self.xmin, self.xmax)
778
- y = random.randint(self.ymin, self.ymax)
779
- else:
780
- x = randomFloat(self.xmin, self.xmax)
781
- y = randomFloat(self.ymin, self.ymax)
782
- bin = int((x - self.xmin) / self.binWidth)
783
- f = self.values[bin]
784
- if (y < f):
785
- done = True
786
- samp = x
787
- return samp
788
-
789
- class JointNonParamRejectSampler:
790
- """
791
- non parametric sampling using given distribution based on rejection sampling
792
- """
793
- def __init__(self, xmin, xbinWidth, xnbin, ymin, ybinWidth, ynbin, *values):
794
- """
795
- initializer
796
-
797
- Parameters
798
- xmin : min value for x
799
- xbinWidth : bin width for x
800
- xnbin : no of bins for x
801
- ymin : min value for y
802
- ybinWidth : bin width for y
803
- ynbin : no of bins for y
804
- values : distr values
805
- """
806
- self.values = values
807
- if (len(self.values) == 1):
808
- self.values = self.values[0]
809
- assert len(self.values) == xnbin * ynbin, "wrong number of values for joint distr"
810
- self.xmin = xmin
811
- self.xmax = xmin + xbinWidth * xnbin
812
- self.xbinWidth = xbinWidth
813
- self.ymin = ymin
814
- self.ymax = ymin + ybinWidth * ynbin
815
- self.ybinWidth = ybinWidth
816
- self.pmax = max(self.values)
817
- self.values = np.array(self.values).reshape(xnbin, ynbin)
818
-
819
- def isNumeric(self):
820
- return True
821
-
822
- def sample(self):
823
- """
824
- samples value
825
- """
826
- done = False
827
- samp = 0
828
- while not done:
829
- x = randomFloat(self.xmin, self.xmax)
830
- y = randomFloat(self.ymin, self.ymax)
831
- xbin = int((x - self.xmin) / self.xbinWidth)
832
- ybin = int((y - self.ymin) / self.ybinWidth)
833
- ap = self.values[xbin][ybin]
834
- sp = randomFloat(0.0, self.pmax)
835
- if (sp < ap):
836
- done = True
837
- samp = [x,y]
838
- return samp
839
-
840
-
841
- class JointNormalSampler:
842
- """
843
- joint normal sampler
844
- """
845
- def __init__(self, *values):
846
- """
847
- initializer
848
-
849
- Parameters
850
- values : 2 mean values followed by 4 values for covar matrix
851
- """
852
- lvalues = list(values)
853
- assert len(lvalues) == 6, "incorrect number of arguments for joint normal sampler"
854
- mean = lvalues[:2]
855
- self.mean = np.array(mean)
856
- sd = lvalues[2:]
857
- self.sd = np.array(sd).reshape(2,2)
858
-
859
- def isNumeric(self):
860
- return True
861
-
862
- def sample(self):
863
- """
864
- samples value
865
- """
866
- return list(np.random.multivariate_normal(self.mean, self.sd))
867
-
868
-
869
- class MultiVarNormalSampler:
870
- """
871
- muti variate normal sampler
872
- """
873
- def __init__(self, numVar, *values):
874
- """
875
- initializer
876
-
877
- Parameters
878
- numVar : no of variables
879
- values : numVar mean values followed by numVar x numVar values for covar matrix
880
- """
881
- lvalues = list(values)
882
- assert len(lvalues) == numVar + numVar * numVar, "incorrect number of arguments for multi var normal sampler"
883
- mean = lvalues[:numVar]
884
- self.mean = np.array(mean)
885
- sd = lvalues[numVar:]
886
- self.sd = np.array(sd).reshape(numVar,numVar)
887
-
888
- def isNumeric(self):
889
- return True
890
-
891
- def sample(self):
892
- """
893
- samples value
894
- """
895
- return list(np.random.multivariate_normal(self.mean, self.sd))
896
-
897
- class CategoricalRejectSampler:
898
- """
899
- non parametric sampling for categorical attributes using given distribution based
900
- on rejection sampling
901
- """
902
- def __init__(self, *values):
903
- """
904
- initializer
905
-
906
- Parameters
907
- values : list of tuples which contains a categorical value and the corresponsding distr value
908
- """
909
- self.distr = values
910
- if (len(self.distr) == 1):
911
- self.distr = self.distr[0]
912
- maxv = 0
913
- for t in self.distr:
914
- if t[1] > maxv:
915
- maxv = t[1]
916
- self.maxv = maxv
917
-
918
- def sample(self):
919
- """
920
- samples value
921
- """
922
- done = False
923
- samp = ""
924
- while not done:
925
- t = self.distr[randint(0, len(self.distr)-1)]
926
- d = randomFloat(0, self.maxv)
927
- if (d <= t[1]):
928
- done = True
929
- samp = t[0]
930
- return samp
931
-
932
-
933
- class CategoricalSetSampler:
934
- """
935
- non parametric sampling for categorical attributes using uniform distribution based for
936
- sampling a set of values from all values
937
- """
938
- def __init__(self, *values):
939
- """
940
- initializer
941
-
942
- Parameters
943
- values : list which contains a categorical values
944
- """
945
- self.values = values
946
- if (len(self.values) == 1):
947
- self.values = self.values[0]
948
- self.sampled = list()
949
-
950
- def sample(self):
951
- """
952
- samples value only from previously unsamopled
953
- """
954
- samp = selectRandomFromList(self.values)
955
- while True:
956
- if samp in self.sampled:
957
- samp = selectRandomFromList(self.values)
958
- else:
959
- self.sampled.append(samp)
960
- break
961
- return samp
962
-
963
- def setSampled(self, sampled):
964
- """
965
- set already sampled
966
-
967
- Parameters
968
- sampled : already sampled list
969
- """
970
- self.sampled = sampled
971
-
972
- def unsample(self, sample=None):
973
- """
974
- rempve from sample history
975
-
976
- Parameters
977
- sample : sample to be removed
978
- """
979
- if sample is None:
980
- self.sampled.clear()
981
- else:
982
- self.sampled.remove(sample)
983
-
984
- class DistrMixtureSampler:
985
- """
986
- distr mixture sampler
987
- """
988
- def __init__(self, mixtureWtDistr, *compDistr):
989
- """
990
- initializer
991
-
992
- Parameters
993
- mixtureWtDistr : sampler that returns index into sampler list
994
- compDistr : sampler list
995
- """
996
- self.mixtureWtDistr = mixtureWtDistr
997
- self.compDistr = compDistr
998
- if (len(self.compDistr) == 1):
999
- self.compDistr = self.compDistr[0]
1000
-
1001
- def isNumeric(self):
1002
- return True
1003
-
1004
- def sample(self):
1005
- """
1006
- samples value
1007
- """
1008
- comp = self.mixtureWtDistr.sample()
1009
-
1010
- #sample sampled comp distr
1011
- return self.compDistr[comp].sample()
1012
-
1013
- class AncestralSampler:
1014
- """
1015
- ancestral sampler using conditional distribution
1016
- """
1017
- def __init__(self, parentDistr, childDistr, numChildren):
1018
- """
1019
- initializer
1020
-
1021
- Parameters
1022
- parentDistr : parent distr
1023
- childDistr : childdren distribution dictionary
1024
- numChildren : no of children
1025
- """
1026
- self.parentDistr = parentDistr
1027
- self.childDistr = childDistr
1028
- self.numChildren = numChildren
1029
-
1030
- def sample(self):
1031
- """
1032
- samples value
1033
- """
1034
- parent = self.parentDistr.sample()
1035
-
1036
- #sample all children conditioned on parent
1037
- children = []
1038
- for i in range(self.numChildren):
1039
- key = (parent, i)
1040
- child = self.childDistr[key].sample()
1041
- children.append(child)
1042
- return (parent, children)
1043
-
1044
- class ClusterSampler:
1045
- """
1046
- sample cluster and then sample member of sampled cluster
1047
- """
1048
- def __init__(self, clusters, *clustDistr):
1049
- """
1050
- initializer
1051
-
1052
- Parameters
1053
- clusters : dictionary clusters
1054
- clustDistr : distr for clusters
1055
- """
1056
- self.sampler = CategoricalRejectSampler(*clustDistr)
1057
- self.clusters = clusters
1058
-
1059
- def sample(self):
1060
- """
1061
- samples value
1062
- """
1063
- cluster = self.sampler.sample()
1064
- member = random.choice(self.clusters[cluster])
1065
- return (cluster, member)
1066
-
1067
-
1068
- class MetropolitanSampler:
1069
- """
1070
- metropolitan sampler
1071
- """
1072
- def __init__(self, propStdDev, min, binWidth, values):
1073
- """
1074
- initializer
1075
-
1076
- Parameters
1077
- propStdDev : proposal distr std dev
1078
- min : min domain value for target distr
1079
- binWidth : bin width
1080
- values : target distr values
1081
- """
1082
- self.targetDistr = Histogram.createInitialized(min, binWidth, values)
1083
- self.propsalDistr = GaussianRejectSampler(0, propStdDev)
1084
- self.proposalMixture = False
1085
-
1086
- # bootstrap sample
1087
- (minv, maxv) = self.targetDistr.getMinMax()
1088
- self.curSample = random.randint(minv, maxv)
1089
- self.curDistr = self.targetDistr.value(self.curSample)
1090
- self.transCount = 0
1091
-
1092
- def initialize(self):
1093
- """
1094
- initialize
1095
- """
1096
- (minv, maxv) = self.targetDistr.getMinMax()
1097
- self.curSample = random.randint(minv, maxv)
1098
- self.curDistr = self.targetDistr.value(self.curSample)
1099
- self.transCount = 0
1100
-
1101
- def setProposalDistr(self, propsalDistr):
1102
- """
1103
- set custom proposal distribution
1104
-
1105
- Parameters
1106
- propsalDistr : proposal distribution
1107
- """
1108
- self.propsalDistr = propsalDistr
1109
-
1110
-
1111
- def setGlobalProposalDistr(self, globPropStdDev, proposalChoiceThreshold):
1112
- """
1113
- set custom proposal distribution
1114
-
1115
- Parameters
1116
- globPropStdDev : global proposal distr std deviation
1117
- proposalChoiceThreshold : threshold for using global proposal distribution
1118
- """
1119
- self.globalProposalDistr = GaussianRejectSampler(0, globPropStdDev)
1120
- self.proposalChoiceThreshold = proposalChoiceThreshold
1121
- self.proposalMixture = True
1122
-
1123
- def sample(self):
1124
- """
1125
- samples value
1126
- """
1127
- nextSample = self.proposalSample(1)
1128
- self.targetSample(nextSample)
1129
- return self.curSample;
1130
-
1131
- def proposalSample(self, skip):
1132
- """
1133
- sample from proposal distribution
1134
-
1135
- Parameters
1136
- skip : no of samples to skip
1137
- """
1138
- for i in range(skip):
1139
- if not self.proposalMixture:
1140
- #one proposal distr
1141
- nextSample = self.curSample + self.propsalDistr.sample()
1142
- nextSample = self.targetDistr.boundedValue(nextSample)
1143
- else:
1144
- #mixture of proposal distr
1145
- if random.random() < self.proposalChoiceThreshold:
1146
- nextSample = self.curSample + self.propsalDistr.sample()
1147
- else:
1148
- nextSample = self.curSample + self.globalProposalDistr.sample()
1149
- nextSample = self.targetDistr.boundedValue(nextSample)
1150
-
1151
- return nextSample
1152
-
1153
- def targetSample(self, nextSample):
1154
- """
1155
- target sample
1156
-
1157
- Parameters
1158
- nextSample : proposal distr sample
1159
- """
1160
- nextDistr = self.targetDistr.value(nextSample)
1161
-
1162
- transition = False
1163
- if nextDistr > self.curDistr:
1164
- transition = True
1165
- else:
1166
- distrRatio = float(nextDistr) / self.curDistr
1167
- if random.random() < distrRatio:
1168
- transition = True
1169
-
1170
- if transition:
1171
- self.curSample = nextSample
1172
- self.curDistr = nextDistr
1173
- self.transCount += 1
1174
-
1175
-
1176
- def subSample(self, skip):
1177
- """
1178
- sub sample
1179
-
1180
- Parameters
1181
- skip : no of samples to skip
1182
- """
1183
- nextSample = self.proposalSample(skip)
1184
- self.targetSample(nextSample)
1185
- return self.curSample;
1186
-
1187
- def setMixtureProposal(self, globPropStdDev, mixtureThreshold):
1188
- """
1189
- mixture proposal
1190
-
1191
- Parameters
1192
- globPropStdDev : global proposal distr std deviation
1193
- mixtureThreshold : threshold for using global proposal distribution
1194
- """
1195
- self.globalProposalDistr = GaussianRejectSampler(0, globPropStdDev)
1196
- self.mixtureThreshold = mixtureThreshold
1197
-
1198
- def samplePropsal(self):
1199
- """
1200
- sample from proposal distr
1201
-
1202
- """
1203
- if self.globalPropsalDistr is None:
1204
- proposal = self.propsalDistr.sample()
1205
- else:
1206
- if random.random() < self.mixtureThreshold:
1207
- proposal = self.propsalDistr.sample()
1208
- else:
1209
- proposal = self.globalProposalDistr.sample()
1210
-
1211
- return proposal
1212
-
1213
- class PermutationSampler:
1214
- """
1215
- permutation sampler by shuffling a list
1216
- """
1217
- def __init__(self):
1218
- """
1219
- initialize
1220
- """
1221
- self.values = None
1222
- self.numShuffles = None
1223
-
1224
- @staticmethod
1225
- def createSamplerWithValues(values, *numShuffles):
1226
- """
1227
- creator with values
1228
-
1229
- Parameters
1230
- values : list data
1231
- numShuffles : no of shuffles or range of no of shuffles
1232
- """
1233
- sampler = PermutationSampler()
1234
- sampler.values = values
1235
- sampler.numShuffles = numShuffles
1236
- return sampler
1237
-
1238
- @staticmethod
1239
- def createSamplerWithRange(minv, maxv, *numShuffles):
1240
- """
1241
- creator with ramge min and max
1242
-
1243
- Parameters
1244
- minv : min of range
1245
- maxv : max of range
1246
- numShuffles : no of shuffles or range of no of shuffles
1247
- """
1248
- sampler = PermutationSampler()
1249
- sampler.values = list(range(minv, maxv + 1))
1250
- sampler.numShuffles = numShuffles
1251
- return sampler
1252
-
1253
- def sample(self):
1254
- """
1255
- sample new permutation
1256
- """
1257
- cloned = self.values.copy()
1258
- shuffle(cloned, *self.numShuffles)
1259
- return cloned
1260
-
1261
- class SpikeyDataSampler:
1262
- """
1263
- samples spikey data
1264
- """
1265
- def __init__(self, intvMean, intvScale, distr, spikeValueMean, spikeValueStd, spikeMaxDuration, baseValue = 0):
1266
- """
1267
- initializer
1268
-
1269
- Parameters
1270
- intvMean : interval mean
1271
- intvScale : interval std dev
1272
- distr : type of distr for interval
1273
- spikeValueMean : spike value mean
1274
- spikeValueStd : spike value std dev
1275
- spikeMaxDuration : max duration for spike
1276
- baseValue : base or offset value
1277
- """
1278
- if distr == "norm":
1279
- self.intvSampler = NormalSampler(intvMean, intvScale)
1280
- elif distr == "expo":
1281
- rate = 1.0 / intvScale
1282
- self.intvSampler = ExponentialSampler(rate)
1283
- else:
1284
- raise ValueError("invalid distribution")
1285
-
1286
- self.spikeSampler = NormalSampler(spikeValueMean, spikeValueStd)
1287
- self.spikeMaxDuration = spikeMaxDuration
1288
- self.baseValue = baseValue
1289
- self.inSpike = False
1290
- self.spikeCount = 0
1291
- self.baseCount = 0
1292
- self.baseLength = int(self.intvSampler.sample())
1293
- self.spikeValues = list()
1294
- self.spikeLength = None
1295
-
1296
- def sample(self):
1297
- """
1298
- sample new value
1299
- """
1300
- if self.baseCount <= self.baseLength:
1301
- sampled = self.baseValue
1302
- self.baseCount += 1
1303
- else:
1304
- if not self.inSpike:
1305
- #starting spike
1306
- spikeVal = self.spikeSampler.sample()
1307
- self.spikeLength = sampleUniform(1, self.spikeMaxDuration)
1308
- spikeMaxPos = 0 if self.spikeLength == 1 else sampleUniform(0, self.spikeLength-1)
1309
- self.spikeValues.clear()
1310
- for i in range(self.spikeLength):
1311
- if i < spikeMaxPos:
1312
- frac = (i + 1) / (spikeMaxPos + 1)
1313
- frac = sampleFloatFromBase(frac, 0.1 * frac)
1314
- elif i > spikeMaxPos:
1315
- frac = (self.spikeLength - i) / (self.spikeLength - spikeMaxPos)
1316
- frac = sampleFloatFromBase(frac, 0.1 * frac)
1317
- else:
1318
- frac = 1.0
1319
- self.spikeValues.append(frac * spikeVal)
1320
- self.inSpike = True
1321
- self.spikeCount = 0
1322
-
1323
-
1324
- sampled = self.spikeValues[self.spikeCount]
1325
- self.spikeCount += 1
1326
-
1327
- if self.spikeCount == self.spikeLength:
1328
- #ending spike
1329
- self.baseCount = 0
1330
- self.baseLength = int(self.intvSampler.sample())
1331
- self.inSpike = False
1332
-
1333
- return sampled
1334
-
1335
-
1336
- class EventSampler:
1337
- """
1338
- sample event
1339
- """
1340
- def __init__(self, intvSampler, valSampler=None):
1341
- """
1342
- initializer
1343
-
1344
- Parameters
1345
- intvSampler : interval sampler
1346
- valSampler : value sampler
1347
- """
1348
- self.intvSampler = intvSampler
1349
- self.valSampler = valSampler
1350
- self.trigger = int(self.intvSampler.sample())
1351
- self.count = 0
1352
-
1353
- def reset(self):
1354
- """
1355
- reset trigger
1356
- """
1357
- self.trigger = int(self.intvSampler.sample())
1358
- self.count = 0
1359
-
1360
- def sample(self):
1361
- """
1362
- sample event
1363
- """
1364
- if self.count == self.trigger:
1365
- sampled = self.valSampler.sample() if self.valSampler is not None else 1.0
1366
- self.trigger = int(self.intvSampler.sample())
1367
- self.count = 0
1368
- else:
1369
- sample = 0.0
1370
- self.count += 1
1371
- return sampled
1372
-
1373
-
1374
-
1375
-
1376
- def createSampler(data):
1377
- """
1378
- create sampler
1379
-
1380
- Parameters
1381
- data : sampler description
1382
- """
1383
- #print(data)
1384
- items = data.split(":")
1385
- size = len(items)
1386
- dtype = items[-1]
1387
- stype = items[-2]
1388
- #print("sampler data {}".format(data))
1389
- #print("sampler {}".format(stype))
1390
- sampler = None
1391
- if stype == "uniform":
1392
- if dtype == "int":
1393
- min = int(items[0])
1394
- max = int(items[1])
1395
- sampler = UniformNumericSampler(min, max)
1396
- elif dtype == "float":
1397
- min = float(items[0])
1398
- max = float(items[1])
1399
- sampler = UniformNumericSampler(min, max)
1400
- elif dtype == "categorical":
1401
- values = items[:-2]
1402
- sampler = UniformCategoricalSampler(values)
1403
- elif stype == "normal":
1404
- mean = float(items[0])
1405
- sd = float(items[1])
1406
- sampler = NormalSampler(mean, sd)
1407
- if dtype == "int":
1408
- sampler.sampleAsIntValue()
1409
- elif stype == "nonparam":
1410
- if dtype == "int" or dtype == "float":
1411
- min = int(items[0])
1412
- binWidth = int(items[1])
1413
- values = items[2:-2]
1414
- values = list(map(lambda v: int(v), values))
1415
- sampler = NonParamRejectSampler(min, binWidth, values)
1416
- if dtype == "float":
1417
- sampler.sampleAsFloat()
1418
- elif dtype == "categorical":
1419
- values = list()
1420
- for i in range(0, size-2, 2):
1421
- cval = items[i]
1422
- dist = int(items[i+1])
1423
- pair = (cval, dist)
1424
- values.append(pair)
1425
- sampler = CategoricalRejectSampler(values)
1426
- elif dtype == "scategorical":
1427
- vfpath = items[0]
1428
- values = getFileLines(vfpath, None)
1429
- sampler = CategoricalSetSampler(values)
1430
- elif stype == "discrete":
1431
- vmin = int(items[0])
1432
- vmax = int(items[1])
1433
- step = int(items[2])
1434
- values = list(map(lambda i : int(items[i]), range(3, len(items)-2)))
1435
- sampler = DiscreteRejectSampler(vmin, vmax, step, values)
1436
- elif stype == "bernauli":
1437
- pr = float(items[0])
1438
- events = None
1439
- if len(items) == 5:
1440
- events = list()
1441
- if dtype == "int":
1442
- events.append(int(items[1]))
1443
- events.append(int(items[2]))
1444
- elif dtype == "categorical":
1445
- events.append(items[1])
1446
- events.append(items[2])
1447
- sampler = BernoulliTrialSampler(pr, events)
1448
- else:
1449
- raise ValueError("invalid sampler type " + stype)
1450
- return sampler
1451
-
1452
-
1453
-
1454
-
1455
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/matumizi/stats.py DELETED
@@ -1,496 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # avenir-python: Machine Learning
4
- # Author: Pranab Ghosh
5
- #
6
- # Licensed under the Apache License, Version 2.0 (the "License"); you
7
- # may not use this file except in compliance with the License. You may
8
- # obtain a copy of the License at
9
- #
10
- # http://www.apache.org/licenses/LICENSE-2.0
11
- #
12
- # Unless required by applicable law or agreed to in writing, software
13
- # distributed under the License is distributed on an "AS IS" BASIS,
14
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
15
- # implied. See the License for the specific language governing
16
- # permissions and limitations under the License.
17
-
18
- import sys
19
- import random
20
- import time
21
- import math
22
- import numpy as np
23
- import statistics
24
- from .util import *
25
-
26
- """
27
- histogram class
28
- """
29
- class Histogram:
30
- def __init__(self, min, binWidth):
31
- """
32
- initializer
33
-
34
- Parameters
35
- min : min x
36
- binWidth : bin width
37
- """
38
- self.xmin = min
39
- self.binWidth = binWidth
40
- self.normalized = False
41
-
42
- @classmethod
43
- def createInitialized(cls, xmin, binWidth, values):
44
- """
45
- create histogram instance with min domain, bin width and values
46
-
47
- Parameters
48
- min : min x
49
- binWidth : bin width
50
- values : y values
51
- """
52
- instance = cls(xmin, binWidth)
53
- instance.xmax = xmin + binWidth * (len(values) - 1)
54
- instance.ymin = 0
55
- instance.bins = np.array(values)
56
- instance.fmax = 0
57
- for v in values:
58
- if (v > instance.fmax):
59
- instance.fmax = v
60
- instance.ymin = 0.0
61
- instance.ymax = instance.fmax
62
- return instance
63
-
64
- @classmethod
65
- def createWithNumBins(cls, values, numBins=20):
66
- """
67
- create histogram instance values and no of bins
68
-
69
- Parameters
70
- values : y values
71
- numBins : no of bins
72
- """
73
- xmin = min(values)
74
- xmax = max(values)
75
- binWidth = (xmax + .01 - (xmin - .01)) / numBins
76
- instance = cls(xmin, binWidth)
77
- instance.xmax = xmax
78
- instance.numBin = numBins
79
- instance.bins = np.zeros(instance.numBin)
80
- for v in values:
81
- instance.add(v)
82
- return instance
83
-
84
- @classmethod
85
- def createUninitialized(cls, xmin, xmax, binWidth):
86
- """
87
- create histogram instance with no y values using domain min , max and bin width
88
-
89
- Parameters
90
- min : min x
91
- max : max x
92
- binWidth : bin width
93
- """
94
- instance = cls(xmin, binWidth)
95
- instance.xmax = xmax
96
- instance.numBin = (xmax - xmin) / binWidth + 1
97
- instance.bins = np.zeros(instance.numBin)
98
- return instance
99
-
100
- def initialize(self):
101
- """
102
- set y values to 0
103
- """
104
- self.bins = np.zeros(self.numBin)
105
-
106
- def add(self, value):
107
- """
108
- adds a value to a bin
109
-
110
- Parameters
111
- value : value
112
- """
113
- bin = int((value - self.xmin) / self.binWidth)
114
- if (bin < 0 or bin > self.numBin - 1):
115
- print (bin)
116
- raise ValueError("outside histogram range")
117
- self.bins[bin] += 1.0
118
-
119
- def normalize(self):
120
- """
121
- normalize bin counts
122
- """
123
- if not self.normalized:
124
- total = self.bins.sum()
125
- self.bins = np.divide(self.bins, total)
126
- self.normalized = True
127
-
128
- def cumDistr(self):
129
- """
130
- cumulative dists
131
- """
132
- self.normalize()
133
- self.cbins = np.cumsum(self.bins)
134
- return self.cbins
135
-
136
- def distr(self):
137
- """
138
- distr
139
- """
140
- self.normalize()
141
- return self.bins
142
-
143
-
144
- def percentile(self, percent):
145
- """
146
- return value corresponding to a percentile
147
-
148
- Parameters
149
- percent : percentile value
150
- """
151
- if self.cbins is None:
152
- raise ValueError("cumulative distribution is not available")
153
-
154
- for i,cuml in enumerate(self.cbins):
155
- if percent > cuml:
156
- value = (i * self.binWidth) - (self.binWidth / 2) + \
157
- (percent - self.cbins[i-1]) * self.binWidth / (self.cbins[i] - self.cbins[i-1])
158
- break
159
- return value
160
-
161
- def max(self):
162
- """
163
- return max bin value
164
- """
165
- return self.bins.max()
166
-
167
- def value(self, x):
168
- """
169
- return a bin value
170
-
171
- Parameters
172
- x : x value
173
- """
174
- bin = int((x - self.xmin) / self.binWidth)
175
- f = self.bins[bin]
176
- return f
177
-
178
- def bin(self, x):
179
- """
180
- return a bin index
181
-
182
- Parameters
183
- x : x value
184
- """
185
- return int((x - self.xmin) / self.binWidth)
186
-
187
- def cumValue(self, x):
188
- """
189
- return a cumulative bin value
190
-
191
- Parameters
192
- x : x value
193
- """
194
- bin = int((x - self.xmin) / self.binWidth)
195
- c = self.cbins[bin]
196
- return c
197
-
198
-
199
- def getMinMax(self):
200
- """
201
- returns x min and x max
202
- """
203
- return (self.xmin, self.xmax)
204
-
205
- def boundedValue(self, x):
206
- """
207
- return x bounde by min and max
208
-
209
- Parameters
210
- x : x value
211
- """
212
- if x < self.xmin:
213
- x = self.xmin
214
- elif x > self.xmax:
215
- x = self.xmax
216
- return x
217
-
218
- """
219
- categorical histogram class
220
- """
221
- class CatHistogram:
222
- def __init__(self):
223
- """
224
- initializer
225
- """
226
- self.binCounts = dict()
227
- self.counts = 0
228
- self.normalized = False
229
-
230
- def add(self, value):
231
- """
232
- adds a value to a bin
233
-
234
- Parameters
235
- x : x value
236
- """
237
- addToKeyedCounter(self.binCounts, value)
238
- self.counts += 1
239
-
240
- def normalize(self):
241
- """
242
- normalize
243
- """
244
- if not self.normalized:
245
- self.binCounts = dict(map(lambda r : (r[0],r[1] / self.counts), self.binCounts.items()))
246
- self.normalized = True
247
-
248
- def getMode(self):
249
- """
250
- get mode
251
- """
252
- maxk = None
253
- maxv = 0
254
- #print(self.binCounts)
255
- for k,v in self.binCounts.items():
256
- if v > maxv:
257
- maxk = k
258
- maxv = v
259
- return (maxk, maxv)
260
-
261
- def getEntropy(self):
262
- """
263
- get entropy
264
- """
265
- self.normalize()
266
- entr = 0
267
- #print(self.binCounts)
268
- for k,v in self.binCounts.items():
269
- entr -= v * math.log(v)
270
- return entr
271
-
272
- def getUniqueValues(self):
273
- """
274
- get unique values
275
- """
276
- return list(self.binCounts.keys())
277
-
278
- def getDistr(self):
279
- """
280
- get distribution
281
- """
282
- self.normalize()
283
- return self.binCounts.copy()
284
-
285
- class RunningStat:
286
- """
287
- running stat class
288
- """
289
- def __init__(self):
290
- """
291
- initializer
292
- """
293
- self.sum = 0.0
294
- self.sumSq = 0.0
295
- self.count = 0
296
-
297
- @staticmethod
298
- def create(count, sum, sumSq):
299
- """
300
- creates iinstance
301
-
302
- Parameters
303
- sum : sum of values
304
- sumSq : sum of valure squared
305
- """
306
- rs = RunningStat()
307
- rs.sum = sum
308
- rs.sumSq = sumSq
309
- rs.count = count
310
- return rs
311
-
312
- def add(self, value):
313
- """
314
- adds new value
315
-
316
- Parameters
317
- value : value to add
318
- """
319
- self.sum += value
320
- self.sumSq += (value * value)
321
- self.count += 1
322
-
323
- def getStat(self):
324
- """
325
- return mean and std deviation
326
- """
327
- mean = self.sum /self. count
328
- t = self.sumSq / (self.count - 1) - mean * mean * self.count / (self.count - 1)
329
- sd = math.sqrt(t)
330
- re = (mean, sd)
331
- return re
332
-
333
- def addGetStat(self,value):
334
- """
335
- calculate mean and std deviation with new value added
336
-
337
- Parameters
338
- value : value to add
339
- """
340
- self.add(value)
341
- re = self.getStat()
342
- return re
343
-
344
- def getCount(self):
345
- """
346
- return count
347
- """
348
- return self.count
349
-
350
- def getState(self):
351
- """
352
- return state
353
- """
354
- s = (self.count, self.sum, self.sumSq)
355
- return s
356
-
357
- class SlidingWindowStat:
358
- """
359
- sliding window stats
360
- """
361
- def __init__(self):
362
- """
363
- initializer
364
- """
365
- self.sum = 0.0
366
- self.sumSq = 0.0
367
- self.count = 0
368
- self.values = None
369
-
370
- @staticmethod
371
- def create(values, sum, sumSq):
372
- """
373
- creates iinstance
374
-
375
- Parameters
376
- sum : sum of values
377
- sumSq : sum of valure squared
378
- """
379
- sws = SlidingWindowStat()
380
- sws.sum = sum
381
- sws.sumSq = sumSq
382
- self.values = values.copy()
383
- sws.count = len(self.values)
384
- return sws
385
-
386
- @staticmethod
387
- def initialize(values):
388
- """
389
- creates iinstance
390
-
391
- Parameters
392
- values : list of values
393
- """
394
- sws = SlidingWindowStat()
395
- sws.values = values.copy()
396
- for v in sws.values:
397
- sws.sum += v
398
- sws.sumSq += v * v
399
- sws.count = len(sws.values)
400
- return sws
401
-
402
- @staticmethod
403
- def createEmpty(count):
404
- """
405
- creates iinstance
406
-
407
- Parameters
408
- count : count of values
409
- """
410
- sws = SlidingWindowStat()
411
- sws.count = count
412
- sws.values = list()
413
- return sws
414
-
415
- def add(self, value):
416
- """
417
- adds new value
418
-
419
- Parameters
420
- value : value to add
421
- """
422
- self.values.append(value)
423
- if len(self.values) > self.count:
424
- self.sum += value - self.values[0]
425
- self.sumSq += (value * value) - (self.values[0] * self.values[0])
426
- self.values.pop(0)
427
- else:
428
- self.sum += value
429
- self.sumSq += (value * value)
430
-
431
-
432
- def getStat(self):
433
- """
434
- calculate mean and std deviation
435
- """
436
- mean = self.sum /self. count
437
- t = self.sumSq / (self.count - 1) - mean * mean * self.count / (self.count - 1)
438
- sd = math.sqrt(t)
439
- re = (mean, sd)
440
- return re
441
-
442
- def addGetStat(self,value):
443
- """
444
- calculate mean and std deviation with new value added
445
- """
446
- self.add(value)
447
- re = self.getStat()
448
- return re
449
-
450
- def getCount(self):
451
- """
452
- return count
453
- """
454
- return self.count
455
-
456
- def getCurSize(self):
457
- """
458
- return count
459
- """
460
- return len(self.values)
461
-
462
- def getState(self):
463
- """
464
- return state
465
- """
466
- s = (self.count, self.sum, self.sumSq)
467
- return s
468
-
469
-
470
- def basicStat(ldata):
471
- """
472
- mean and std dev
473
-
474
- Parameters
475
- ldata : list of values
476
- """
477
- m = statistics.mean(ldata)
478
- s = statistics.stdev(ldata, xbar=m)
479
- r = (m, s)
480
- return r
481
-
482
- def getFileColumnStat(filePath, col, delem=","):
483
- """
484
- gets stats for a file column
485
-
486
- Parameters
487
- filePath : file path
488
- col : col index
489
- delem : field delemter
490
- """
491
- rs = RunningStat()
492
- for rec in fileRecGen(filePath, delem):
493
- va = float(rec[col])
494
- rs.add(va)
495
-
496
- return rs.getStat()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/matumizi/util.py DELETED
@@ -1,2345 +0,0 @@
1
- #!/usr/local/bin/python3
2
-
3
- # Author: Pranab Ghosh
4
- #
5
- # Licensed under the Apache License, Version 2.0 (the "License"); you
6
- # may not use this file except in compliance with the License. You may
7
- # obtain a copy of the License at
8
- #
9
- # http://www.apache.org/licenses/LICENSE-2.0
10
- #
11
- # Unless required by applicable law or agreed to in writing, software
12
- # distributed under the License is distributed on an "AS IS" BASIS,
13
- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
14
- # implied. See the License for the specific language governing
15
- # permissions and limitations under the License.
16
-
17
- import os
18
- import sys
19
- from random import randint
20
- import random
21
- import time
22
- import uuid
23
- from datetime import datetime
24
- import math
25
- import numpy as np
26
- import pandas as pd
27
- import matplotlib.pyplot as plt
28
- import numpy as np
29
- import logging
30
- import logging.handlers
31
- import pickle
32
- from contextlib import contextmanager
33
-
34
- tokens = ["0","1","2","3","4","5","6","7","8","9","A","B","C","D","E","F","G","H","I","J","K","L","M",
35
- "N","O","P","Q","R","S","T","U","V","W","X","Y","Z","0","1","2","3","4","5","6","7","8","9"]
36
- numTokens = tokens[:10]
37
- alphaTokens = tokens[10:36]
38
- loCaseChars = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k","l","m","n","o",
39
- "p","q","r","s","t","u","v","w","x","y","z"]
40
-
41
- typeInt = "int"
42
- typeFloat = "float"
43
- typeString = "string"
44
-
45
- secInMinute = 60
46
- secInHour = 60 * 60
47
- secInDay = 24 * secInHour
48
- secInWeek = 7 * secInDay
49
- secInYear = 365 * secInDay
50
- secInMonth = secInYear / 12
51
-
52
- minInHour = 60
53
- minInDay = 24 * minInHour
54
-
55
- ftPerYard = 3
56
- ftPerMile = ftPerYard * 1760
57
-
58
-
59
- def genID(size):
60
- """
61
- generates ID
62
-
63
- Parameters
64
- size : size of ID
65
- """
66
- id = ""
67
- for i in range(size):
68
- id = id + selectRandomFromList(tokens)
69
- return id
70
-
71
- def genIdList(numId, idSize):
72
- """
73
- generate list of IDs
74
-
75
- Parameters:
76
- numId: number of Ids
77
- idSize: ID size
78
- """
79
- iDs = []
80
- for i in range(numId):
81
- iDs.append(genID(idSize))
82
- return iDs
83
-
84
- def genNumID(size):
85
- """
86
- generates ID consisting of digits onl
87
-
88
- Parameters
89
- size : size of ID
90
- """
91
- id = ""
92
- for i in range(size):
93
- id = id + selectRandomFromList(numTokens)
94
- return id
95
-
96
- def genLowCaseID(size):
97
- """
98
- generates ID consisting of lower case chars
99
-
100
- Parameters
101
- size : size of ID
102
- """
103
- id = ""
104
- for i in range(size):
105
- id = id + selectRandomFromList(loCaseChars)
106
- return id
107
-
108
- def genNumIdList(numId, idSize):
109
- """
110
- generate list of numeric IDs
111
-
112
- Parameters:
113
- numId: number of Ids
114
- idSize: ID size
115
- """
116
- iDs = []
117
- for i in range(numId):
118
- iDs.append(genNumID(idSize))
119
- return iDs
120
-
121
- def genNameInitial():
122
- """
123
- generate name initial
124
- """
125
- return selectRandomFromList(alphaTokens) + selectRandomFromList(alphaTokens)
126
-
127
- def genPhoneNum(arCode):
128
- """
129
- generates phone number
130
-
131
- Parameters
132
- arCode: area code
133
- """
134
- phNum = genNumID(7)
135
- return arCode + str(phNum)
136
-
137
- def selectRandomFromList(ldata):
138
- """
139
- select an element randomly from a lis
140
-
141
- Parameters
142
- ldata : list data
143
- """
144
- return ldata[randint(0, len(ldata)-1)]
145
-
146
- def selectOtherRandomFromList(ldata, cval):
147
- """
148
- select an element randomly from a list excluding the given one
149
-
150
- Parameters
151
- ldata : list data
152
- cval : value to be excluded
153
- """
154
- nval = selectRandomFromList(ldata)
155
- while nval == cval:
156
- nval = selectRandomFromList(ldata)
157
- return nval
158
-
159
- def selectRandomSubListFromList(ldata, num):
160
- """
161
- generates random sublist from a list without replacemment
162
-
163
- Parameters
164
- ldata : list data
165
- num : output list size
166
- """
167
- assertLesser(num, len(ldata), "size of sublist to be sampled greater than or equal to main list")
168
- i = randint(0, len(ldata)-1)
169
- sel = ldata[i]
170
- selSet = {i}
171
- selList = [sel]
172
- while (len(selSet) < num):
173
- i = randint(0, len(ldata)-1)
174
- if (i not in selSet):
175
- sel = ldata[i]
176
- selSet.add(i)
177
- selList.append(sel)
178
- return selList
179
-
180
- def selectRandomSubListFromListWithRepl(ldata, num):
181
- """
182
- generates random sublist from a list with replacemment
183
-
184
- Parameters
185
- ldata : list data
186
- num : output list size
187
-
188
- """
189
- return list(map(lambda i : selectRandomFromList(ldata), range(num)))
190
-
191
- def selectRandomFromDict(ddata):
192
- """
193
- select an element randomly from a dictionary
194
-
195
- Parameters
196
- ddata : dictionary data
197
- """
198
- dkeys = list(ddata.keys())
199
- dk = selectRandomFromList(dkeys)
200
- el = (dk, ddata[dk])
201
- return el
202
-
203
- def setListRandomFromList(ldata, ldataRepl):
204
- """
205
- sets some elents in the first list randomly with elements from the second list
206
-
207
- Parameters
208
- ldata : list data
209
- ldataRepl : list with replacement data
210
- """
211
- l = len(ldata)
212
- selSet = set()
213
- for d in ldataRepl:
214
- i = randint(0, l-1)
215
- while i in selSet:
216
- i = randint(0, l-1)
217
- ldata[i] = d
218
- selSet.add(i)
219
-
220
- def genIpAddress():
221
- """
222
- generates IP address
223
- """
224
- i1 = randint(0,256)
225
- i2 = randint(0,256)
226
- i3 = randint(0,256)
227
- i4 = randint(0,256)
228
- ip = "%d.%d.%d.%d" %(i1,i2,i3,i4)
229
- return ip
230
-
231
- def curTimeMs():
232
- """
233
- current time in ms
234
- """
235
- return int((datetime.utcnow() - datetime(1970,1,1)).total_seconds() * 1000)
236
-
237
- def secDegPolyFit(x1, y1, x2, y2, x3, y3):
238
- """
239
- second deg polynomial
240
-
241
- Parameters
242
- x1 : 1st point x
243
- y1 : 1st point y
244
- x2 : 2nd point x
245
- y2 : 2nd point y
246
- x3 : 3rd point x
247
- y3 : 3rd point y
248
- """
249
- t = (y1 - y2) / (x1 - x2)
250
- a = t - (y2 - y3) / (x2 - x3)
251
- a = a / (x1 - x3)
252
- b = t - a * (x1 + x2)
253
- c = y1 - a * x1 * x1 - b * x1
254
- return (a, b, c)
255
-
256
- def range_limit(val, minv, maxv):
257
- """
258
- range limit a value
259
-
260
- Parameters
261
- val : data value
262
- minv : minimum
263
- maxv : maximum
264
- """
265
- if (val < minv):
266
- val = minv
267
- elif (val > maxv):
268
- val = maxv
269
- return val
270
-
271
- def rangeLimit(val, minv, maxv):
272
- """
273
- range limit a value
274
-
275
- Parameters
276
- val : data value
277
- minv : minimum
278
- maxv : maximum
279
- """
280
- return range_limit(val, minv, maxv)
281
-
282
- def isInRange(val, minv, maxv):
283
- """
284
- checks if within range
285
-
286
- Parameters
287
- val : data value
288
- minv : minimum
289
- maxv : maximum
290
- """
291
- return val >= minv and val <= maxv
292
-
293
- def stripFileLines(filePath, offset):
294
- """
295
- strips number of chars from both ends
296
-
297
- Parameters
298
- filePath : file path
299
- offset : offset from both ends of line
300
- """
301
- fp = open(filePath, "r")
302
- for line in fp:
303
- stripped = line[offset:len(line) - 1 - offset]
304
- print (stripped)
305
- fp.close()
306
-
307
- def genLatLong(lat1, long1, lat2, long2):
308
- """
309
- generate lat log within limits
310
-
311
- Parameters
312
- lat1 : lat of 1st point
313
- long1 : long of 1st point
314
- lat2 : lat of 2nd point
315
- long2 : long of 2nd point
316
- """
317
- lat = lat1 + (lat2 - lat1) * random.random()
318
- longg = long1 + (long2 - long1) * random.random()
319
- return (lat, longg)
320
-
321
- def geoDistance(lat1, long1, lat2, long2):
322
- """
323
- find geo distance in ft
324
-
325
- Parameters
326
- lat1 : lat of 1st point
327
- long1 : long of 1st point
328
- lat2 : lat of 2nd point
329
- long2 : long of 2nd point
330
- """
331
- latDiff = math.radians(lat1 - lat2)
332
- longDiff = math.radians(long1 - long2)
333
- l1 = math.sin(latDiff/2.0)
334
- l2 = math.sin(longDiff/2.0)
335
- l3 = math.cos(math.radians(lat1))
336
- l4 = math.cos(math.radians(lat2))
337
- a = l1 * l1 + l3 * l4 * l2 * l2
338
- l5 = math.sqrt(a)
339
- l6 = math.sqrt(1.0 - a)
340
- c = 2.0 * math.atan2(l5, l6)
341
- r = 6371008.8 * 3.280840
342
- return c * r
343
-
344
- def minLimit(val, limit):
345
- """
346
- min limit
347
- Parameters
348
-
349
- """
350
- if (val < limit):
351
- val = limit
352
- return val;
353
-
354
- def maxLimit(val, limit):
355
- """
356
- max limit
357
- Parameters
358
-
359
- """
360
- if (val > limit):
361
- val = limit
362
- return val;
363
-
364
- def rangeSample(val, minLim, maxLim):
365
- """
366
- if out side range sample within range
367
-
368
- Parameters
369
- val : value
370
- minLim : minimum
371
- maxLim : maximum
372
- """
373
- if val < minLim or val > maxLim:
374
- val = randint(minLim, maxLim)
375
- return val
376
-
377
- def genRandomIntListWithinRange(size, minLim, maxLim):
378
- """
379
- random unique list of integers within range
380
-
381
- Parameters
382
- size : size of returned list
383
- minLim : minimum
384
- maxLim : maximum
385
- """
386
- values = set()
387
- for i in range(size):
388
- val = randint(minLim, maxLim)
389
- while val not in values:
390
- values.add(val)
391
- return list(values)
392
-
393
- def preturbScalar(value, vrange, distr="uniform"):
394
- """
395
- preturbs a mutiplicative value within range
396
-
397
- Parameters
398
- value : data value
399
- vrange : value delta fraction
400
- distr : noise distribution type
401
- """
402
- if distr == "uniform":
403
- scale = 1.0 - vrange + 2 * vrange * random.random()
404
- elif distr == "normal":
405
- scale = 1.0 + np.random.normal(0, vrange)
406
- else:
407
- exisWithMsg("unknown noise distr " + distr)
408
- return value * scale
409
-
410
- def preturbScalarAbs(value, vrange):
411
- """
412
- preturbs an absolute value within range
413
-
414
- Parameters
415
- value : data value
416
- vrange : value delta absolute
417
-
418
- """
419
- delta = - vrange + 2.0 * vrange * random.random()
420
- return value + delta
421
-
422
- def preturbVector(values, vrange):
423
- """
424
- preturbs a list within range
425
-
426
- Parameters
427
- values : list data
428
- vrange : value delta fraction
429
- """
430
- nValues = list(map(lambda va: preturbScalar(va, vrange), values))
431
- return nValues
432
-
433
- def randomShiftVector(values, smin, smax):
434
- """
435
- shifts a list by a random quanity with a range
436
-
437
- Parameters
438
- values : list data
439
- smin : samplinf minimum
440
- smax : sampling maximum
441
- """
442
- shift = np.random.uniform(smin, smax)
443
- return list(map(lambda va: va + shift, values))
444
-
445
- def floatRange(beg, end, incr):
446
- """
447
- generates float range
448
-
449
- Parameters
450
- beg :range begin
451
- end: range end
452
- incr : range increment
453
- """
454
- return list(np.arange(beg, end, incr))
455
-
456
- def shuffle(values, *numShuffles):
457
- """
458
- in place shuffling with swap of pairs
459
-
460
- Parameters
461
- values : list data
462
- numShuffles : parameter list for number of shuffles
463
- """
464
- size = len(values)
465
- if len(numShuffles) == 0:
466
- numShuffle = int(size / 2)
467
- elif len(numShuffles) == 1:
468
- numShuffle = numShuffles[0]
469
- else:
470
- numShuffle = randint(numShuffles[0], numShuffles[1])
471
- print("numShuffle {}".format(numShuffle))
472
- for i in range(numShuffle):
473
- first = random.randint(0, size - 1)
474
- second = random.randint(0, size - 1)
475
- while first == second:
476
- second = random.randint(0, size - 1)
477
- tmp = values[first]
478
- values[first] = values[second]
479
- values[second] = tmp
480
-
481
-
482
- def splitList(itms, numGr):
483
- """
484
- splits a list into sub lists of approximately equal size, with items in sublists randomly chod=sen
485
-
486
- Parameters
487
- itms ; list of values
488
- numGr : no of groups
489
- """
490
- tcount = len(itms)
491
- cItems = list(itms)
492
- sz = int(len(cItems) / numGr)
493
- groups = list()
494
- count = 0
495
- for i in range(numGr):
496
- if (i == numGr - 1):
497
- csz = tcount - count
498
- else:
499
- csz = sz + randint(-2, 2)
500
- count += csz
501
- gr = list()
502
- for j in range(csz):
503
- it = selectRandomFromList(cItems)
504
- gr.append(it)
505
- cItems.remove(it)
506
- groups.append(gr)
507
- return groups
508
-
509
- def multVector(values, vrange):
510
- """
511
- multiplies a list within value range
512
-
513
- Parameters
514
- values : list of values
515
- vrange : fraction of vaue to be used to update
516
- """
517
- scale = 1.0 - vrange + 2 * vrange * random.random()
518
- nValues = list(map(lambda va: va * scale, values))
519
- return nValues
520
-
521
- def weightedAverage(values, weights):
522
- """
523
- calculates weighted average
524
-
525
- Parameters
526
- values : list of values
527
- weights : list of weights
528
- """
529
- assert len(values) == len(weights), "values and weights should be same size"
530
- vw = zip(values, weights)
531
- wva = list(map(lambda e : e[0] * e[1], vw))
532
- #wa = sum(x * y for x, y in vw) / sum(weights)
533
- wav = sum(wva) / sum(weights)
534
- return wav
535
-
536
- def extractFields(line, delim, keepIndices):
537
- """
538
- breaks a line into fields and keeps only specified fileds and returns new line
539
-
540
- Parameters
541
- line ; deli separated string
542
- delim : delemeter
543
- keepIndices : list of indexes to fields to be retained
544
- """
545
- items = line.split(delim)
546
- newLine = []
547
- for i in keepIndices:
548
- newLine.append(line[i])
549
- return delim.join(newLine)
550
-
551
- def remFields(line, delim, remIndices):
552
- """
553
- removes fields from delim separated string
554
-
555
- Parameters
556
- line ; delemeter separated string
557
- delim : delemeter
558
- remIndices : list of indexes to fields to be removed
559
- """
560
- items = line.split(delim)
561
- newLine = []
562
- for i in range(len(items)):
563
- if not arrayContains(remIndices, i):
564
- newLine.append(line[i])
565
- return delim.join(newLine)
566
-
567
- def extractList(data, indices):
568
- """
569
- extracts list from another list, given indices
570
-
571
- Parameters
572
- remIndices : list data
573
- indices : list of indexes to fields to be retained
574
- """
575
- if areAllFieldsIncluded(data, indices):
576
- exList = data.copy()
577
- #print("all indices")
578
- else:
579
- exList = list()
580
- le = len(data)
581
- for i in indices:
582
- assert i < le , "index {} out of bound {}".format(i, le)
583
- exList.append(data[i])
584
-
585
- return exList
586
-
587
- def arrayContains(arr, item):
588
- """
589
- checks if array contains an item
590
-
591
- Parameters
592
- arr : list data
593
- item : item to search
594
- """
595
- contains = True
596
- try:
597
- arr.index(item)
598
- except ValueError:
599
- contains = False
600
- return contains
601
-
602
- def strToIntArray(line, delim=","):
603
- """
604
- int array from delim separated string
605
-
606
- Parameters
607
- line ; delemeter separated string
608
- """
609
- arr = line.split(delim)
610
- return [int(a) for a in arr]
611
-
612
- def strToFloatArray(line, delim=","):
613
- """
614
- float array from delim separated string
615
-
616
- Parameters
617
- line ; delemeter separated string
618
- """
619
- arr = line.split(delim)
620
- return [float(a) for a in arr]
621
-
622
- def strListOrRangeToIntArray(line):
623
- """
624
- int array from delim separated string or range
625
-
626
- Parameters
627
- line ; delemeter separated string
628
- """
629
- varr = line.split(",")
630
- if (len(varr) > 1):
631
- iarr = list(map(lambda v: int(v), varr))
632
- else:
633
- vrange = line.split(":")
634
- if (len(vrange) == 2):
635
- lo = int(vrange[0])
636
- hi = int(vrange[1])
637
- iarr = list(range(lo, hi+1))
638
- else:
639
- iarr = [int(line)]
640
- return iarr
641
-
642
- def toStr(val, precision):
643
- """
644
- converts any type to string
645
-
646
- Parameters
647
- val : value
648
- precision ; precision for float value
649
- """
650
- if type(val) == float or type(val) == np.float64 or type(val) == np.float32:
651
- format = "%" + ".%df" %(precision)
652
- sVal = format %(val)
653
- else:
654
- sVal = str(val)
655
- return sVal
656
-
657
- def toStrFromList(values, precision, delim=","):
658
- """
659
- converts list of any type to delim separated string
660
-
661
- Parameters
662
- values : list data
663
- precision ; precision for float value
664
- delim : delemeter
665
- """
666
- sValues = list(map(lambda v: toStr(v, precision), values))
667
- return delim.join(sValues)
668
-
669
- def toIntList(values):
670
- """
671
- convert to int list
672
-
673
- Parameters
674
- values : list data
675
- """
676
- return list(map(lambda va: int(va), values))
677
-
678
- def toFloatList(values):
679
- """
680
- convert to float list
681
-
682
- Parameters
683
- values : list data
684
-
685
- """
686
- return list(map(lambda va: float(va), values))
687
-
688
- def toStrList(values, precision=None):
689
- """
690
- convert to string list
691
-
692
- Parameters
693
- values : list data
694
- precision ; precision for float value
695
- """
696
- return list(map(lambda va: toStr(va, precision), values))
697
-
698
- def toIntFromBoolean(value):
699
- """
700
- convert to int
701
-
702
- Parameters
703
- value : boolean value
704
- """
705
- ival = 1 if value else 0
706
- return ival
707
-
708
- def scaleBySum(ldata):
709
- """
710
- scales so that sum is 1
711
-
712
- Parameters
713
- ldata : list data
714
- """
715
- s = sum(ldata)
716
- return list(map(lambda e : e/s, ldata))
717
-
718
- def scaleByMax(ldata):
719
- """
720
- scales so that max value is 1
721
-
722
- Parameters
723
- ldata : list data
724
- """
725
- m = max(ldata)
726
- return list(map(lambda e : e/m, ldata))
727
-
728
- def typedValue(val, dtype=None):
729
- """
730
- return typed value given string, discovers data type if not specified
731
-
732
- Parameters
733
- val : value
734
- dtype : data type
735
- """
736
- tVal = None
737
-
738
- if dtype is not None:
739
- if dtype == "num":
740
- dtype = "int" if dtype.find(".") == -1 else "float"
741
-
742
- if dtype == "int":
743
- tVal = int(val)
744
- elif dtype == "float":
745
- tVal = float(val)
746
- elif dtype == "bool":
747
- tVal = bool(val)
748
- else:
749
- tVal = val
750
- else:
751
- if type(val) == str:
752
- lVal = val.lower()
753
-
754
- #int
755
- done = True
756
- try:
757
- tVal = int(val)
758
- except ValueError:
759
- done = False
760
-
761
- #float
762
- if not done:
763
- done = True
764
- try:
765
- tVal = float(val)
766
- except ValueError:
767
- done = False
768
-
769
- #boolean
770
- if not done:
771
- done = True
772
- if lVal == "true":
773
- tVal = True
774
- elif lVal == "false":
775
- tVal = False
776
- else:
777
- done = False
778
- #None
779
- if not done:
780
- if lVal == "none":
781
- tVal = None
782
- else:
783
- tVal = val
784
- else:
785
- tVal = val
786
-
787
- return tVal
788
-
789
- def isInt(val):
790
- """
791
- return true if string is int and the typed value
792
-
793
- Parameters
794
- val : value
795
- """
796
- valInt = True
797
- try:
798
- tVal = int(val)
799
- except ValueError:
800
- valInt = False
801
- tVal = None
802
- r = (valInt, tVal)
803
- return r
804
-
805
- def isFloat(val):
806
- """
807
- return true if string is float
808
-
809
- Parameters
810
- val : value
811
- """
812
- valFloat = True
813
- try:
814
- tVal = float(val)
815
- except ValueError:
816
- valFloat = False
817
- tVal = None
818
- r = (valFloat, tVal)
819
- return r
820
-
821
- def getAllFiles(dirPath):
822
- """
823
- get all files recursively
824
-
825
- Parameters
826
- dirPath : directory path
827
- """
828
- filePaths = []
829
- for (thisDir, subDirs, fileNames) in os.walk(dirPath):
830
- for fileName in fileNames:
831
- filePaths.append(os.path.join(thisDir, fileName))
832
- filePaths.sort()
833
- return filePaths
834
-
835
- def getFileContent(fpath, verbose=False):
836
- """
837
- get file contents in directory
838
-
839
- Parameters
840
- fpath ; directory path
841
- verbose : verbosity flag
842
- """
843
- # dcument list
844
- docComplete = []
845
- filePaths = getAllFiles(fpath)
846
-
847
- # read files
848
- for filePath in filePaths:
849
- if verbose:
850
- print("next file " + filePath)
851
- with open(filePath, 'r') as contentFile:
852
- content = contentFile.read()
853
- docComplete.append(content)
854
- return (docComplete, filePaths)
855
-
856
- def getOneFileContent(fpath):
857
- """
858
- get one file contents
859
-
860
- Parameters
861
- fpath : file path
862
- """
863
- with open(fpath, 'r') as contentFile:
864
- docStr = contentFile.read()
865
- return docStr
866
-
867
- def getFileLines(dirPath, delim=","):
868
- """
869
- get lines from a file
870
-
871
- Parameters
872
- dirPath : file path
873
- delim : delemeter
874
- """
875
- lines = list()
876
- for li in fileRecGen(dirPath, delim):
877
- lines.append(li)
878
- return lines
879
-
880
- def getFileSampleLines(dirPath, percen, delim=","):
881
- """
882
- get sampled lines from a file
883
-
884
- Parameters
885
- dirPath : file path
886
- percen : sampling percentage
887
- delim : delemeter
888
- """
889
- lines = list()
890
- for li in fileRecGen(dirPath, delim):
891
- if randint(0, 100) < percen:
892
- lines.append(li)
893
- return lines
894
-
895
- def getFileColumnAsString(dirPath, index, delim=","):
896
- """
897
- get string column from a file
898
-
899
- Parameters
900
- dirPath : file path
901
- index : index
902
- delim : delemeter
903
- """
904
- fields = list()
905
- for rec in fileRecGen(dirPath, delim):
906
- fields.append(rec[index])
907
- #print(fields)
908
- return fields
909
-
910
- def getFileColumnsAsString(dirPath, indexes, delim=","):
911
- """
912
- get multiple string columns from a file
913
-
914
- Parameters
915
- dirPath : file path
916
- indexes : indexes of columns
917
- delim : delemeter
918
-
919
- """
920
- nindex = len(indexes)
921
- columns = list(map(lambda i : list(), range(nindex)))
922
- for rec in fileRecGen(dirPath, delim):
923
- for i in range(nindex):
924
- columns[i].append(rec[indexes[i]])
925
- return columns
926
-
927
- def getFileColumnAsFloat(dirPath, index, delim=","):
928
- """
929
- get float fileds from a file
930
-
931
- Parameters
932
- dirPath : file path
933
- index : index
934
- delim : delemeter
935
-
936
- """
937
- #print("{} {}".format(dirPath, index))
938
- fields = getFileColumnAsString(dirPath, index, delim)
939
- return list(map(lambda v:float(v), fields))
940
-
941
- def getFileColumnAsInt(dirPath, index, delim=","):
942
- """
943
- get float fileds from a file
944
-
945
- Parameters
946
- dirPath : file path
947
- index : index
948
- delim : delemeter
949
- """
950
- fields = getFileColumnAsString(dirPath, index, delim)
951
- return list(map(lambda v:int(v), fields))
952
-
953
- def getFileAsIntMatrix(dirPath, columns, delim=","):
954
- """
955
- extracts int matrix from csv file given column indices with each row being concatenation of
956
- extracted column values row size = num of columns
957
-
958
- Parameters
959
- dirPath : file path
960
- columns : indexes of columns
961
- delim : delemeter
962
- """
963
- mat = list()
964
- for rec in fileSelFieldsRecGen(dirPath, columns, delim):
965
- mat.append(asIntList(rec))
966
- return mat
967
-
968
- def getFileAsFloatMatrix(dirPath, columns, delim=","):
969
- """
970
- extracts float matrix from csv file given column indices with each row being concatenation of
971
- extracted column values row size = num of columns
972
-
973
- Parameters
974
- dirPath : file path
975
- columns : indexes of columns
976
- delim : delemeter
977
- """
978
- mat = list()
979
- for rec in fileSelFieldsRecGen(dirPath, columns, delim):
980
- mat.append(asFloatList(rec))
981
- return mat
982
-
983
- def getFileAsFloatColumn(dirPath):
984
- """
985
- grt float list from a file with one float per row
986
-
987
- Parameters
988
- dirPath : file path
989
- """
990
- flist = list()
991
- for rec in fileRecGen(dirPath, None):
992
- flist.append(float(rec))
993
- return flist
994
-
995
- def getFileAsFiltFloatMatrix(dirPath, filt, columns, delim=","):
996
- """
997
- extracts float matrix from csv file given row filter and column indices with each row being
998
- concatenation of extracted column values row size = num of columns
999
-
1000
- Parameters
1001
- dirPath : file path
1002
- columns : indexes of columns
1003
- filt : row filter lambda
1004
- delim : delemeter
1005
-
1006
- """
1007
- mat = list()
1008
- for rec in fileFiltSelFieldsRecGen(dirPath, filt, columns, delim):
1009
- mat.append(asFloatList(rec))
1010
- return mat
1011
-
1012
- def getFileAsTypedRecords(dirPath, types, delim=","):
1013
- """
1014
- extracts typed records from csv file with each row being concatenation of
1015
- extracted column values
1016
-
1017
- Parameters
1018
- dirPath : file path
1019
- types : data types
1020
- delim : delemeter
1021
- """
1022
- (dtypes, cvalues) = extractTypesFromString(types)
1023
- tdata = list()
1024
- for rec in fileRecGen(dirPath, delim):
1025
- trec = list()
1026
- for index, value in enumerate(rec):
1027
- value = __convToTyped(index, value, dtypes)
1028
- trec.append(value)
1029
- tdata.append(trec)
1030
- return tdata
1031
-
1032
-
1033
- def getFileColsAsTypedRecords(dirPath, columns, types, delim=","):
1034
- """
1035
- extracts typed records from csv file given column indices with each row being concatenation of
1036
- extracted column values
1037
-
1038
- Parameters
1039
- Parameters
1040
- dirPath : file path
1041
- columns : column indexes
1042
- types : data types
1043
- delim : delemeter
1044
- """
1045
- (dtypes, cvalues) = extractTypesFromString(types)
1046
- tdata = list()
1047
- for rec in fileSelFieldsRecGen(dirPath, columns, delim):
1048
- trec = list()
1049
- for indx, value in enumerate(rec):
1050
- tindx = columns[indx]
1051
- value = __convToTyped(tindx, value, dtypes)
1052
- trec.append(value)
1053
- tdata.append(trec)
1054
- return tdata
1055
-
1056
- def getFileColumnsMinMax(dirPath, columns, dtype, delim=","):
1057
- """
1058
- extracts numeric matrix from csv file given column indices. For each column return min and max
1059
-
1060
- Parameters
1061
- dirPath : file path
1062
- columns : column indexes
1063
- dtype : data type
1064
- delim : delemeter
1065
- """
1066
- dtypes = list(map(lambda c : str(c) + ":" + dtype, columns))
1067
- dtypes = ",".join(dtypes)
1068
- #print(dtypes)
1069
-
1070
- tdata = getFileColsAsTypedRecords(dirPath, columns, dtypes, delim)
1071
- minMax = list()
1072
- ncola = len(tdata[0])
1073
- ncole = len(columns)
1074
- assertEqual(ncola, ncole, "actual no of columns different from expected")
1075
-
1076
- for ci in range(ncole):
1077
- vmin = sys.float_info.max
1078
- vmax = sys.float_info.min
1079
- for r in tdata:
1080
- cv = r[ci]
1081
- vmin = cv if cv < vmin else vmin
1082
- vmax = cv if cv > vmax else vmax
1083
- mm = (vmin, vmax, vmax - vmin)
1084
- minMax.append(mm)
1085
-
1086
- return minMax
1087
-
1088
-
1089
- def getRecAsTypedRecord(rec, types, delim=None):
1090
- """
1091
- converts record to typed records
1092
-
1093
- Parameters
1094
- rec : delemeter separate string or list of string
1095
- types : field data types
1096
- delim : delemeter
1097
- """
1098
- if delim is not None:
1099
- rec = rec.split(delim)
1100
- (dtypes, cvalues) = extractTypesFromString(types)
1101
- #print(types)
1102
- #print(dtypes)
1103
- trec = list()
1104
- for ind, value in enumerate(rec):
1105
- tvalue = __convToTyped(ind, value, dtypes)
1106
- trec.append(tvalue)
1107
- return trec
1108
-
1109
- def __convToTyped(index, value, dtypes):
1110
- """
1111
- convert to typed value
1112
-
1113
- Parameters
1114
- index : index in type list
1115
- value : data value
1116
- dtypes : data type list
1117
- """
1118
- #print(index, value)
1119
- dtype = dtypes[index]
1120
- tvalue = value
1121
- if dtype == "int":
1122
- tvalue = int(value)
1123
- elif dtype == "float":
1124
- tvalue = float(value)
1125
- return tvalue
1126
-
1127
-
1128
-
1129
- def extractTypesFromString(types):
1130
- """
1131
- extracts column data types and set values for categorical variables
1132
-
1133
- Parameters
1134
- types : encoded type information
1135
- """
1136
- ftypes = types.split(",")
1137
- dtypes = dict()
1138
- cvalues = dict()
1139
- for ftype in ftypes:
1140
- items = ftype.split(":")
1141
- cindex = int(items[0])
1142
- dtype = items[1]
1143
- dtypes[cindex] = dtype
1144
- if len(items) == 3:
1145
- sitems = items[2].split()
1146
- cvalues[cindex] = sitems
1147
- return (dtypes, cvalues)
1148
-
1149
- def getMultipleFileAsInttMatrix(dirPathWithCol, delim=","):
1150
- """
1151
- extracts int matrix from from csv files given column index for each file.
1152
- num of columns = number of rows in each file and num of rows = number of files
1153
-
1154
- Parameters
1155
- dirPathWithCol: list of file path and collumn index pair
1156
- delim : delemeter
1157
- """
1158
- mat = list()
1159
- minLen = -1
1160
- for path, col in dirPathWithCol:
1161
- colVals = getFileColumnAsInt(path, col, delim)
1162
- if minLen < 0 or len(colVals) < minLen:
1163
- minLen = len(colVals)
1164
- mat.append(colVals)
1165
-
1166
- #make all same length
1167
- mat = list(map(lambda li:li[:minLen], mat))
1168
- return mat
1169
-
1170
- def getMultipleFileAsFloatMatrix(dirPathWithCol, delim=","):
1171
- """
1172
- extracts float matrix from from csv files given column index for each file.
1173
- num of columns = number of rows in each file and num of rows = number of files
1174
-
1175
- Parameters
1176
- dirPathWithCol: list of file path and collumn index pair
1177
- delim : delemeter
1178
- """
1179
- mat = list()
1180
- minLen = -1
1181
- for path, col in dirPathWithCol:
1182
- colVals = getFileColumnAsFloat(path, col, delim)
1183
- if minLen < 0 or len(colVals) < minLen:
1184
- minLen = len(colVals)
1185
- mat.append(colVals)
1186
-
1187
- #make all same length
1188
- mat = list(map(lambda li:li[:minLen], mat))
1189
- return mat
1190
-
1191
- def writeStrListToFile(ldata, filePath, delem=","):
1192
- """
1193
- writes list of dlem separated string or list of list of string to afile
1194
-
1195
- Parameters
1196
- ldata : list data
1197
- filePath : file path
1198
- delim : delemeter
1199
- """
1200
- with open(filePath, "w") as fh:
1201
- for r in ldata:
1202
- if type(r) == list:
1203
- r = delem.join(r)
1204
- fh.write(r + "\n")
1205
-
1206
- def writeFloatListToFile(ldata, prec, filePath):
1207
- """
1208
- writes float list to file, one value per line
1209
-
1210
- Parameters
1211
- ldata : list data
1212
- prec : precision
1213
- filePath : file path
1214
- """
1215
- with open(filePath, "w") as fh:
1216
- for d in ldata:
1217
- fh.write(formatFloat(prec, d) + "\n")
1218
-
1219
- def mutateFileLines(dirPath, mutator, marg, delim=","):
1220
- """
1221
- mutates lines from a file
1222
-
1223
- Parameters
1224
- dirPath : file path
1225
- mutator : mutation callback
1226
- marg : argument for mutation call back
1227
- delim : delemeter
1228
- """
1229
- lines = list()
1230
- for li in fileRecGen(dirPath, delim):
1231
- li = mutator(li) if marg is None else mutator(li, marg)
1232
- lines.append(li)
1233
- return lines
1234
-
1235
- def takeFirst(elems):
1236
- """
1237
- return fisrt item
1238
-
1239
- Parameters
1240
- elems : list of data
1241
- """
1242
- return elems[0]
1243
-
1244
- def takeSecond(elems):
1245
- """
1246
- return 2nd element
1247
-
1248
- Parameters
1249
- elems : list of data
1250
- """
1251
- return elems[1]
1252
-
1253
- def takeThird(elems):
1254
- """
1255
- returns 3rd element
1256
-
1257
- Parameters
1258
- elems : list of data
1259
- """
1260
- return elems[2]
1261
-
1262
- def addToKeyedCounter(dCounter, key, count=1):
1263
- """
1264
- add to to keyed counter
1265
-
1266
- Parameters
1267
- dCounter : dictionary of counters
1268
- key : dictionary key
1269
- count : count to add
1270
- """
1271
- curCount = dCounter.get(key, 0)
1272
- dCounter[key] = curCount + count
1273
-
1274
- def incrKeyedCounter(dCounter, key):
1275
- """
1276
- increment keyed counter
1277
-
1278
- Parameters
1279
- dCounter : dictionary of counters
1280
- key : dictionary key
1281
- """
1282
- addToKeyedCounter(dCounter, key, 1)
1283
-
1284
- def appendKeyedList(dList, key, elem):
1285
- """
1286
- keyed list
1287
-
1288
- Parameters
1289
- dList : dictionary of lists
1290
- key : dictionary key
1291
- elem : value to append
1292
- """
1293
- curList = dList.get(key, [])
1294
- curList.append(elem)
1295
- dList[key] = curList
1296
-
1297
- def isNumber(st):
1298
- """
1299
- Returns True is string is a number
1300
-
1301
- Parameters
1302
- st : string value
1303
- """
1304
- return st.replace('.','',1).isdigit()
1305
-
1306
- def removeNan(values):
1307
- """
1308
- removes nan from list
1309
-
1310
- Parameters
1311
- values : list data
1312
- """
1313
- return list(filter(lambda v: not math.isnan(v), values))
1314
-
1315
- def fileRecGen(filePath, delim = ","):
1316
- """
1317
- file record generator
1318
-
1319
- Parameters
1320
- filePath ; file path
1321
- delim : delemeter
1322
- """
1323
- with open(filePath, "r") as fp:
1324
- for line in fp:
1325
- line = line[:-1]
1326
- if delim is not None:
1327
- line = line.split(delim)
1328
- yield line
1329
-
1330
- def fileSelFieldsRecGen(dirPath, columns, delim=","):
1331
- """
1332
- file record generator given column indices
1333
-
1334
- Parameters
1335
- filePath ; file path
1336
- columns : column indexes as int array or coma separated string
1337
- delim : delemeter
1338
- """
1339
- if type(columns) == str:
1340
- columns = strToIntArray(columns, delim)
1341
- for rec in fileRecGen(dirPath, delim):
1342
- extracted = extractList(rec, columns)
1343
- yield extracted
1344
-
1345
- def fileSelFieldValueGen(dirPath, column, delim=","):
1346
- """
1347
- file record generator for a given column
1348
-
1349
- Parameters
1350
- filePath ; file path
1351
- column : column index
1352
- delim : delemeter
1353
- """
1354
- for rec in fileRecGen(dirPath, delim):
1355
- yield rec[column]
1356
-
1357
- def fileFiltRecGen(filePath, filt, delim = ","):
1358
- """
1359
- file record generator with row filter applied
1360
-
1361
- Parameters
1362
- filePath ; file path
1363
- filt : row filter
1364
- delim : delemeter
1365
- """
1366
- with open(filePath, "r") as fp:
1367
- for line in fp:
1368
- line = line[:-1]
1369
- if delim is not None:
1370
- line = line.split(delim)
1371
- if filt(line):
1372
- yield line
1373
-
1374
- def fileFiltSelFieldsRecGen(filePath, filt, columns, delim = ","):
1375
- """
1376
- file record generator with row and column filter applied
1377
-
1378
- Parameters
1379
- filePath ; file path
1380
- filt : row filter
1381
- columns : column indexes as int array or coma separated string
1382
- delim : delemeter
1383
- """
1384
- columns = strToIntArray(columns, delim)
1385
- with open(filePath, "r") as fp:
1386
- for line in fp:
1387
- line = line[:-1]
1388
- if delim is not None:
1389
- line = line.split(delim)
1390
- if filt(line):
1391
- selected = extractList(line, columns)
1392
- yield selected
1393
-
1394
- def fileTypedRecGen(filePath, ftypes, delim = ","):
1395
- """
1396
- file typed record generator
1397
-
1398
- Parameters
1399
- filePath ; file path
1400
- ftypes : list of field types
1401
- delim : delemeter
1402
- """
1403
- with open(filePath, "r") as fp:
1404
- for line in fp:
1405
- line = line[:-1]
1406
- line = line.split(delim)
1407
- for i in range(0, len(ftypes), 2):
1408
- ci = ftypes[i]
1409
- dtype = ftypes[i+1]
1410
- assertLesser(ci, len(line), "index out of bound")
1411
- if dtype == "int":
1412
- line[ci] = int(line[ci])
1413
- elif dtype == "float":
1414
- line[ci] = float(line[ci])
1415
- else:
1416
- exitWithMsg("invalid data type")
1417
- yield line
1418
-
1419
- def fileMutatedFieldsRecGen(dirPath, mutator, delim=","):
1420
- """
1421
- file record generator with some columns mutated
1422
-
1423
- Parameters
1424
- dirPath ; file path
1425
- mutator : row field mutator
1426
- delim : delemeter
1427
- """
1428
- for rec in fileRecGen(dirPath, delim):
1429
- mutated = mutator(rec)
1430
- yield mutated
1431
-
1432
- def tableSelFieldsFilter(tdata, columns):
1433
- """
1434
- gets tabular data for selected columns
1435
-
1436
- Parameters
1437
- tdata : tabular data
1438
- columns : column indexes
1439
- """
1440
- if areAllFieldsIncluded(tdata[0], columns):
1441
- ntdata = tdata
1442
- else:
1443
- ntdata = list()
1444
- for rec in tdata:
1445
- #print(rec)
1446
- #print(columns)
1447
- nrec = extractList(rec, columns)
1448
- ntdata.append(nrec)
1449
- return ntdata
1450
-
1451
-
1452
- def areAllFieldsIncluded(ldata, columns):
1453
- """
1454
- return True id all indexes are in the columns
1455
-
1456
- Parameters
1457
- ldata : list data
1458
- columns : column indexes
1459
- """
1460
- return list(range(len(ldata))) == columns
1461
-
1462
- def asIntList(items):
1463
- """
1464
- returns int list
1465
-
1466
- Parameters
1467
- items : list data
1468
- """
1469
- return [int(i) for i in items]
1470
-
1471
- def asFloatList(items):
1472
- """
1473
- returns float list
1474
-
1475
- Parameters
1476
- items : list data
1477
- """
1478
- return [float(i) for i in items]
1479
-
1480
- def pastTime(interval, unit):
1481
- """
1482
- current and past time
1483
-
1484
- Parameters
1485
- interval : time interval
1486
- unit: time unit
1487
- """
1488
- curTime = int(time.time())
1489
- if unit == "d":
1490
- pastTime = curTime - interval * secInDay
1491
- elif unit == "h":
1492
- pastTime = curTime - interval * secInHour
1493
- elif unit == "m":
1494
- pastTime = curTime - interval * secInMinute
1495
- else:
1496
- raise ValueError("invalid time unit " + unit)
1497
- return (curTime, pastTime)
1498
-
1499
- def minuteAlign(ts):
1500
- """
1501
- minute aligned time
1502
-
1503
- Parameters
1504
- ts : time stamp in sec
1505
- """
1506
- return int((ts / secInMinute)) * secInMinute
1507
-
1508
- def multMinuteAlign(ts, min):
1509
- """
1510
- multi minute aligned time
1511
-
1512
- Parameters
1513
- ts : time stamp in sec
1514
- min : minute value
1515
- """
1516
- intv = secInMinute * min
1517
- return int((ts / intv)) * intv
1518
-
1519
- def hourAlign(ts):
1520
- """
1521
- hour aligned time
1522
-
1523
- Parameters
1524
- ts : time stamp in sec
1525
- """
1526
- return int((ts / secInHour)) * secInHour
1527
-
1528
- def hourOfDayAlign(ts, hour):
1529
- """
1530
- hour of day aligned time
1531
-
1532
- Parameters
1533
- ts : time stamp in sec
1534
- hour : hour of day
1535
- """
1536
- day = int(ts / secInDay)
1537
- return (24 * day + hour) * secInHour
1538
-
1539
- def dayAlign(ts):
1540
- """
1541
- day aligned time
1542
-
1543
- Parameters
1544
- ts : time stamp in sec
1545
- """
1546
- return int(ts / secInDay) * secInDay
1547
-
1548
- def timeAlign(ts, unit):
1549
- """
1550
- boundary alignment of time
1551
-
1552
- Parameters
1553
- ts : time stamp in sec
1554
- unit : unit of time
1555
- """
1556
- alignedTs = 0
1557
- if unit == "s":
1558
- alignedTs = ts
1559
- elif unit == "m":
1560
- alignedTs = minuteAlign(ts)
1561
- elif unit == "h":
1562
- alignedTs = hourAlign(ts)
1563
- elif unit == "d":
1564
- alignedTs = dayAlign(ts)
1565
- else:
1566
- raise ValueError("invalid time unit")
1567
- return alignedTs
1568
-
1569
- def monthOfYear(ts):
1570
- """
1571
- month of year
1572
-
1573
- Parameters
1574
- ts : time stamp in sec
1575
- """
1576
- rem = ts % secInYear
1577
- dow = int(rem / secInMonth)
1578
- return dow
1579
-
1580
- def dayOfWeek(ts):
1581
- """
1582
- day of week
1583
-
1584
- Parameters
1585
- ts : time stamp in sec
1586
- """
1587
- rem = ts % secInWeek
1588
- dow = int(rem / secInDay)
1589
- return dow
1590
-
1591
- def hourOfDay(ts):
1592
- """
1593
- hour of day
1594
-
1595
- Parameters
1596
- ts : time stamp in sec
1597
- """
1598
- rem = ts % secInDay
1599
- hod = int(rem / secInHour)
1600
- return hod
1601
-
1602
- def processCmdLineArgs(expectedTypes, usage):
1603
- """
1604
- process command line args and returns args as typed values
1605
-
1606
- Parameters
1607
- expectedTypes : expected data types of arguments
1608
- usage : usage message string
1609
- """
1610
- args = []
1611
- numComLineArgs = len(sys.argv)
1612
- numExpected = len(expectedTypes)
1613
- if (numComLineArgs - 1 == len(expectedTypes)):
1614
- try:
1615
- for i in range(0, numExpected):
1616
- if (expectedTypes[i] == typeInt):
1617
- args.append(int(sys.argv[i+1]))
1618
- elif (expectedTypes[i] == typeFloat):
1619
- args.append(float(sys.argv[i+1]))
1620
- elif (expectedTypes[i] == typeString):
1621
- args.append(sys.argv[i+1])
1622
- except ValueError:
1623
- print ("expected number of command line arguments found but there is type mis match")
1624
- sys.exit(1)
1625
- else:
1626
- print ("expected number of command line arguments not found")
1627
- print (usage)
1628
- sys.exit(1)
1629
- return args
1630
-
1631
- def mutateString(val, numMutate, ctype):
1632
- """
1633
- mutate string multiple times
1634
-
1635
- Parameters
1636
- val : string value
1637
- numMutate : num of mutations
1638
- ctype : type of character to mutate with
1639
- """
1640
- mutations = set()
1641
- count = 0
1642
- while count < numMutate:
1643
- j = randint(0, len(val)-1)
1644
- if j not in mutations:
1645
- if ctype == "alpha":
1646
- ch = selectRandomFromList(alphaTokens)
1647
- elif ctype == "num":
1648
- ch = selectRandomFromList(numTokens)
1649
- elif ctype == "any":
1650
- ch = selectRandomFromList(tokens)
1651
- val = val[:j] + ch + val[j+1:]
1652
- mutations.add(j)
1653
- count += 1
1654
- return val
1655
-
1656
- def mutateList(values, numMutate, vmin, vmax, rabs=True):
1657
- """
1658
- mutate list multiple times
1659
-
1660
- Parameters
1661
- values : list value
1662
- numMutate : num of mutations
1663
- vmin : minimum of value range
1664
- vmax : maximum of value range
1665
- rabs : True if mim max range is absolute otherwise relative
1666
- """
1667
- mutations = set()
1668
- count = 0
1669
- while count < numMutate:
1670
- j = randint(0, len(values)-1)
1671
- if j not in mutations:
1672
- s = np.random.uniform(vmin, vmax)
1673
- values[j] = s if rabs else values[j] * s
1674
- count += 1
1675
- mutations.add(j)
1676
- return values
1677
-
1678
-
1679
- def swap(values, first, second):
1680
- """
1681
- swap two elements
1682
-
1683
- Parameters
1684
- values : list value
1685
- first : first swap position
1686
- second : second swap position
1687
- """
1688
- t = values[first]
1689
- values[first] = values[second]
1690
- values[second] = t
1691
-
1692
- def swapBetweenLists(values1, values2):
1693
- """
1694
- swap two elements between 2 lists
1695
-
1696
- Parameters
1697
- values1 : first list of values
1698
- values2 : second list of values
1699
- """
1700
- p1 = randint(0, len(values1)-1)
1701
- p2 = randint(0, len(values2)-1)
1702
- tmp = values1[p1]
1703
- values1[p1] = values2[p2]
1704
- values2[p2] = tmp
1705
-
1706
- def safeAppend(values, value):
1707
- """
1708
- append only if not None
1709
-
1710
- Parameters
1711
- values : list value
1712
- value : value to append
1713
- """
1714
- if value is not None:
1715
- values.append(value)
1716
-
1717
- def getAllIndex(ldata, fldata):
1718
- """
1719
- get ALL indexes of list elements
1720
-
1721
- Parameters
1722
- ldata : list data to find index in
1723
- fldata : list data for values for index look up
1724
- """
1725
- return list(map(lambda e : fldata.index(e), ldata))
1726
-
1727
- def findIntersection(lOne, lTwo):
1728
- """
1729
- find intersection elements between 2 lists
1730
-
1731
- Parameters
1732
- lOne : first list of data
1733
- lTwo : second list of data
1734
- """
1735
- sOne = set(lOne)
1736
- sTwo = set(lTwo)
1737
- sInt = sOne.intersection(sTwo)
1738
- return list(sInt)
1739
-
1740
- def isIntvOverlapped(rOne, rTwo):
1741
- """
1742
- checks overlap between 2 intervals
1743
-
1744
- Parameters
1745
- rOne : first interval boundaries
1746
- rTwo : second interval boundaries
1747
- """
1748
- clear = rOne[1] <= rTwo[0] or rOne[0] >= rTwo[1]
1749
- return not clear
1750
-
1751
- def isIntvLess(rOne, rTwo):
1752
- """
1753
- checks if first iterval is less than second
1754
-
1755
- Parameters
1756
- rOne : first interval boundaries
1757
- rTwo : second interval boundaries
1758
- """
1759
- less = rOne[1] <= rTwo[0]
1760
- return less
1761
-
1762
- def findRank(e, values):
1763
- """
1764
- find rank of value in a list
1765
-
1766
- Parameters
1767
- e : value to compare with
1768
- values : list data
1769
- """
1770
- count = 1
1771
- for ve in values:
1772
- if ve < e:
1773
- count += 1
1774
- return count
1775
-
1776
- def findRanks(toBeRanked, values):
1777
- """
1778
- find ranks of values in one list in another list
1779
-
1780
- Parameters
1781
- toBeRanked : list of values for which ranks are found
1782
- values : list in which rank is found :
1783
- """
1784
- return list(map(lambda e: findRank(e, values), toBeRanked))
1785
-
1786
- def formatFloat(prec, value, label = None):
1787
- """
1788
- formats a float with optional label
1789
-
1790
- Parameters
1791
- prec : precision
1792
- value : data value
1793
- label : label for data
1794
- """
1795
- st = (label + " ") if label else ""
1796
- formatter = "{:." + str(prec) + "f}"
1797
- return st + formatter.format(value)
1798
-
1799
- def formatAny(value, label = None):
1800
- """
1801
- formats any obkect with optional label
1802
-
1803
- Parameters
1804
- value : data value
1805
- label : label for data
1806
- """
1807
- st = (label + " ") if label else ""
1808
- return st + str(value)
1809
-
1810
- def printList(values):
1811
- """
1812
- pretty print list
1813
-
1814
- Parameters
1815
- values : list of values
1816
- """
1817
- for v in values:
1818
- print(v)
1819
-
1820
- def printMap(values, klab, vlab, precision, offset=16):
1821
- """
1822
- pretty print hash map
1823
-
1824
- Parameters
1825
- values : dictionary of values
1826
- klab : label for key
1827
- vlab : label for value
1828
- precision : precision
1829
- offset : left justify offset
1830
- """
1831
- print(klab.ljust(offset, " ") + vlab)
1832
- for k in values.keys():
1833
- v = values[k]
1834
- ks = toStr(k, precision).ljust(offset, " ")
1835
- vs = toStr(v, precision)
1836
- print(ks + vs)
1837
-
1838
- def printPairList(values, lab1, lab2, precision, offset=16):
1839
- """
1840
- pretty print list of pairs
1841
-
1842
- Parameters
1843
- values : dictionary of values
1844
- lab1 : first label
1845
- lab2 : second label
1846
- precision : precision
1847
- offset : left justify offset
1848
- """
1849
- print(lab1.ljust(offset, " ") + lab2)
1850
- for (v1, v2) in values:
1851
- sv1 = toStr(v1, precision).ljust(offset, " ")
1852
- sv2 = toStr(v2, precision)
1853
- print(sv1 + sv2)
1854
-
1855
- def createMap(*values):
1856
- """
1857
- create disctionary with results
1858
-
1859
- Parameters
1860
- values : sequence of key value pairs
1861
- """
1862
- result = dict()
1863
- for i in range(0, len(values), 2):
1864
- result[values[i]] = values[i+1]
1865
- return result
1866
-
1867
- def getColMinMax(table, col):
1868
- """
1869
- return min, max values of a column
1870
-
1871
- Parameters
1872
- table : tabular data
1873
- col : column index
1874
- """
1875
- vmin = None
1876
- vmax = None
1877
- for rec in table:
1878
- value = rec[col]
1879
- if vmin is None:
1880
- vmin = value
1881
- vmax = value
1882
- else:
1883
- if value < vmin:
1884
- vmin = value
1885
- elif value > vmax:
1886
- vmax = value
1887
- return (vmin, vmax, vmax - vmin)
1888
-
1889
- def createLogger(name, logFilePath, logLevName):
1890
- """
1891
- creates logger
1892
-
1893
- Parameters
1894
- name : logger name
1895
- logFilePath : log file path
1896
- logLevName : log level
1897
- """
1898
- logger = logging.getLogger(name)
1899
- fHandler = logging.handlers.RotatingFileHandler(logFilePath, maxBytes=1048576, backupCount=4)
1900
- logLev = logLevName.lower()
1901
- if logLev == "debug":
1902
- logLevel = logging.DEBUG
1903
- elif logLev == "info":
1904
- logLevel = logging.INFO
1905
- elif logLev == "warning":
1906
- logLevel = logging.WARNING
1907
- elif logLev == "error":
1908
- logLevel = logging.ERROR
1909
- elif logLev == "critical":
1910
- logLevel = logging.CRITICAL
1911
- else:
1912
- raise ValueError("invalid log level name " + logLevelName)
1913
- fHandler.setLevel(logLevel)
1914
- fFormat = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
1915
- fHandler.setFormatter(fFormat)
1916
- logger.addHandler(fHandler)
1917
- logger.setLevel(logLevel)
1918
- return logger
1919
-
1920
- @contextmanager
1921
- def suppressStdout():
1922
- """
1923
- suppress stdout
1924
-
1925
- Parameters
1926
-
1927
- """
1928
- with open(os.devnull, "w") as devnull:
1929
- oldStdout = sys.stdout
1930
- sys.stdout = devnull
1931
- try:
1932
- yield
1933
- finally:
1934
- sys.stdout = oldStdout
1935
-
1936
- def exitWithMsg(msg):
1937
- """
1938
- print message and exit
1939
-
1940
- Parameters
1941
- msg : message
1942
- """
1943
- print(msg + " -- quitting")
1944
- sys.exit(0)
1945
-
1946
- def drawLine(data, yscale=None):
1947
- """
1948
- line plot
1949
-
1950
- Parameters
1951
- data : list data
1952
- yscale : y axis scale
1953
- """
1954
- plt.plot(data)
1955
- if yscale:
1956
- step = int(yscale / 10)
1957
- step = int(step / 10) * 10
1958
- plt.yticks(range(0, yscale, step))
1959
- plt.show()
1960
-
1961
- def drawPlot(x, y, xlabel, ylabel):
1962
- """
1963
- line plot
1964
-
1965
- Parameters
1966
- x : x values
1967
- y : y values
1968
- xlabel : x axis label
1969
- ylabel : y axis label
1970
- """
1971
- if x is None:
1972
- x = list(range(len(y)))
1973
- plt.plot(x,y)
1974
- plt.xlabel(xlabel)
1975
- plt.ylabel(ylabel)
1976
- plt.show()
1977
-
1978
- def drawPairPlot(x, y1, y2, xlabel,ylabel, y1label, y2label):
1979
- """
1980
- line plot of 2 lines
1981
-
1982
- Parameters
1983
- x : x values
1984
- y1 : first y values
1985
- y2 : second y values
1986
- xlabel : x labbel
1987
- ylabel : y label
1988
- y1label : first plot label
1989
- y2label : second plot label
1990
- """
1991
- plt.plot(x, y1, label = y1label)
1992
- plt.plot(x, y2, label = y2label)
1993
- plt.xlabel(xlabel)
1994
- plt.ylabel(ylabel)
1995
- plt.legend()
1996
- plt.show()
1997
-
1998
- def drawHist(ldata, myTitle, myXlabel, myYlabel, nbins=10):
1999
- """
2000
- draw histogram
2001
-
2002
- Parameters
2003
- ldata : list data
2004
- myTitle : title
2005
- myXlabel : x label
2006
- myYlabel : y label
2007
- nbins : num of bins
2008
- """
2009
- plt.hist(ldata, bins=nbins, density=True)
2010
- plt.title(myTitle)
2011
- plt.xlabel(myXlabel)
2012
- plt.ylabel(myYlabel)
2013
- plt.show()
2014
-
2015
- def saveObject(obj, filePath):
2016
- """
2017
- saves an object
2018
-
2019
- Parameters
2020
- obj : object
2021
- filePath : file path for saved object
2022
- """
2023
- with open(filePath, "wb") as outfile:
2024
- pickle.dump(obj,outfile)
2025
-
2026
- def restoreObject(filePath):
2027
- """
2028
- restores an object
2029
-
2030
- Parameters
2031
- filePath : file path to restore object from
2032
- """
2033
- with open(filePath, "rb") as infile:
2034
- obj = pickle.load(infile)
2035
- return obj
2036
-
2037
- def isNumeric(data):
2038
- """
2039
- true if all elements int or float
2040
-
2041
- Parameters
2042
- data : numeric data list
2043
- """
2044
- if type(data) == list or type(data) == np.ndarray:
2045
- col = pd.Series(data)
2046
- else:
2047
- col = data
2048
- return col.dtype == np.int32 or col.dtype == np.int64 or col.dtype == np.float32 or col.dtype == np.float64
2049
-
2050
- def isInteger(data):
2051
- """
2052
- true if all elements int
2053
-
2054
- Parameters
2055
- data : numeric data list
2056
- """
2057
- if type(data) == list or type(data) == np.ndarray:
2058
- col = pd.Series(data)
2059
- else:
2060
- col = data
2061
- return col.dtype == np.int32 or col.dtype == np.int64
2062
-
2063
- def isFloat(data):
2064
- """
2065
- true if all elements float
2066
-
2067
- Parameters
2068
- data : numeric data list
2069
- """
2070
- if type(data) == list or type(data) == np.ndarray:
2071
- col = pd.Series(data)
2072
- else:
2073
- col = data
2074
- return col.dtype == np.float32 or col.dtype == np.float64
2075
-
2076
- def isBinary(data):
2077
- """
2078
- true if all elements either 0 or 1
2079
-
2080
- Parameters
2081
- data : binary data
2082
- """
2083
- re = next((d for d in data if not (type(d) == int and (d == 0 or d == 1))), None)
2084
- return (re is None)
2085
-
2086
- def isCategorical(data):
2087
- """
2088
- true if all elements int or string
2089
-
2090
- Parameters
2091
- data : data value
2092
- """
2093
- re = next((d for d in data if not (type(d) == int or type(d) == str)), None)
2094
- return (re is None)
2095
-
2096
- def assertEqual(value, veq, msg):
2097
- """
2098
- assert equal to
2099
-
2100
- Parameters
2101
- value : value
2102
- veq : value to be equated with
2103
- msg : error msg
2104
- """
2105
- assert value == veq , msg
2106
-
2107
- def assertGreater(value, vmin, msg):
2108
- """
2109
- assert greater than
2110
-
2111
- Parameters
2112
- value : value
2113
- vmin : minimum value
2114
- msg : error msg
2115
- """
2116
- assert value > vmin , msg
2117
-
2118
- def assertGreaterEqual(value, vmin, msg):
2119
- """
2120
- assert greater than
2121
-
2122
- Parameters
2123
- value : value
2124
- vmin : minimum value
2125
- msg : error msg
2126
- """
2127
- assert value >= vmin , msg
2128
-
2129
- def assertLesser(value, vmax, msg):
2130
- """
2131
- assert less than
2132
-
2133
- Parameters
2134
- value : value
2135
- vmax : maximum value
2136
- msg : error msg
2137
- """
2138
- assert value < vmax , msg
2139
-
2140
- def assertLesserEqual(value, vmax, msg):
2141
- """
2142
- assert less than
2143
-
2144
- Parameters
2145
- value : value
2146
- vmax : maximum value
2147
- msg : error msg
2148
- """
2149
- assert value <= vmax , msg
2150
-
2151
- def assertWithinRange(value, vmin, vmax, msg):
2152
- """
2153
- assert within range
2154
-
2155
- Parameters
2156
- value : value
2157
- vmin : minimum value
2158
- vmax : maximum value
2159
- msg : error msg
2160
- """
2161
- assert value >= vmin and value <= vmax, msg
2162
-
2163
- def assertInList(value, values, msg):
2164
- """
2165
- assert contains in a list
2166
-
2167
- Parameters
2168
- value ; balue to check for inclusion
2169
- values : list data
2170
- msg : error msg
2171
- """
2172
- assert value in values, msg
2173
-
2174
- def maxListDist(l1, l2):
2175
- """
2176
- maximum list element difference between 2 lists
2177
-
2178
- Parameters
2179
- l1 : first list data
2180
- l2 : second list data
2181
- """
2182
- dist = max(list(map(lambda v : abs(v[0] - v[1]), zip(l1, l2))))
2183
- return dist
2184
-
2185
- def fileLineCount(fPath):
2186
- """
2187
- number of lines ina file
2188
-
2189
- Parameters
2190
- fPath : file path
2191
- """
2192
- with open(fPath) as f:
2193
- for i, li in enumerate(f):
2194
- pass
2195
- return (i + 1)
2196
-
2197
- def getAlphaNumCharCount(sdata):
2198
- """
2199
- number of alphabetic and numeric charcters in a string
2200
-
2201
- Parameters
2202
- sdata : string data
2203
- """
2204
- acount = 0
2205
- ncount = 0
2206
- scount = 0
2207
- ocount = 0
2208
- assertEqual(type(sdata), str, "input must be string")
2209
- for c in sdata:
2210
- if c.isnumeric():
2211
- ncount += 1
2212
- elif c.isalpha():
2213
- acount += 1
2214
- elif c.isspace():
2215
- scount += 1
2216
- else:
2217
- ocount += 1
2218
- r = (acount, ncount, ocount)
2219
- return r
2220
-
2221
- def genPowerSet(cvalues, incEmpty=False):
2222
- """
2223
- generates power set i.e all possible subsets
2224
-
2225
- Parameters
2226
- cvalues : list of categorical values
2227
- incEmpty : include empty set if True
2228
- """
2229
- ps = list()
2230
- for cv in cvalues:
2231
- pse = list()
2232
- for s in ps:
2233
- sc = s.copy()
2234
- sc.add(cv)
2235
- #print(sc)
2236
- pse.append(sc)
2237
- ps.extend(pse)
2238
- es = set()
2239
- es.add(cv)
2240
- ps.append(es)
2241
- #print(es)
2242
-
2243
- if incEmpty:
2244
- ps.append({})
2245
- return ps
2246
-
2247
- class StepFunction:
2248
- """
2249
- step function
2250
-
2251
- Parameters
2252
-
2253
- """
2254
- def __init__(self, *values):
2255
- """
2256
- initilizer
2257
-
2258
- Parameters
2259
- values : list of tuples, wich each tuple containing 2 x values and corresponding y value
2260
- """
2261
- self.points = values
2262
-
2263
- def find(self, x):
2264
- """
2265
- finds step function value
2266
-
2267
- Parameters
2268
- x : x value
2269
- """
2270
- found = False
2271
- y = 0
2272
- for p in self.points:
2273
- if (x >= p[0] and x < p[1]):
2274
- y = p[2]
2275
- found = True
2276
- break
2277
-
2278
- if not found:
2279
- l = len(self.points)
2280
- if (x < self.points[0][0]):
2281
- y = self.points[0][2]
2282
- elif (x > self.points[l-1][1]):
2283
- y = self.points[l-1][2]
2284
- return y
2285
-
2286
-
2287
- class DummyVarGenerator:
2288
- """
2289
- dummy variable generator for categorical variable
2290
- """
2291
- def __init__(self, rowSize, catValues, trueVal, falseVal, delim=None):
2292
- """
2293
- initilizer
2294
-
2295
- Parameters
2296
- rowSize : row size
2297
- catValues : dictionary with field index as key and list of categorical values as value
2298
- trueVal : true value, typically "1"
2299
- falseval : false value , typically "0"
2300
- delim : field delemeter
2301
- """
2302
- self.rowSize = rowSize
2303
- self.catValues = catValues
2304
- numCatVar = len(catValues)
2305
- colCount = 0
2306
- for v in self.catValues.values():
2307
- colCount += len(v)
2308
- self.newRowSize = rowSize - numCatVar + colCount
2309
- #print ("new row size {}".format(self.newRowSize))
2310
- self.trueVal = trueVal
2311
- self.falseVal = falseVal
2312
- self.delim = delim
2313
-
2314
- def processRow(self, row):
2315
- """
2316
- encodes categorical variables, returning as delemeter separate dstring or list
2317
-
2318
- Parameters
2319
- row : row either delemeter separated string or list
2320
- """
2321
- if self.delim is not None:
2322
- rowArr = row.split(self.delim)
2323
- msg = "row does not have expected number of columns found " + str(len(rowArr)) + " expected " + str(self.rowSize)
2324
- assert len(rowArr) == self.rowSize, msg
2325
- else:
2326
- rowArr = row
2327
-
2328
- newRowArr = []
2329
- for i in range(len(rowArr)):
2330
- curVal = rowArr[i]
2331
- if (i in self.catValues):
2332
- values = self.catValues[i]
2333
- for val in values:
2334
- if val == curVal:
2335
- newVal = self.trueVal
2336
- else:
2337
- newVal = self.falseVal
2338
- newRowArr.append(newVal)
2339
- else:
2340
- newRowArr.append(curVal)
2341
- assert len(newRowArr) == self.newRowSize, "invalid new row size " + str(len(newRowArr)) + " expected " + str(self.newRowSize)
2342
- encRow = self.delim.join(newRowArr) if self.delim is not None else newRowArr
2343
- return encRow
2344
-
2345
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/pyproject.toml DELETED
@@ -1,6 +0,0 @@
1
- [build-system]
2
- requires = [
3
- "setuptools>=42",
4
- "wheel"
5
- ]
6
- build-backend = "setuptools.build_meta"
 
 
 
 
 
 
 
matumizi/requirements.txt DELETED
@@ -1,9 +0,0 @@
1
- hurst==0.0.5
2
- jprops==2.0.2
3
- matplotlib==3.3.0
4
- numpy==1.18.5
5
- pandas==1.1.0
6
- python_Levenshtein==0.12.2
7
- scikit_learn==1.0.2
8
- scipy==1.5.2
9
- statsmodels==0.11.1
 
 
 
 
 
 
 
 
 
 
matumizi/resources/spdata.txt DELETED
@@ -1,12 +0,0 @@
1
- WMT,171,22030
2
- PFE,226,9818
3
- NFLX,138,48338
4
- AMD,211,19423
5
- TSLA,57,55317
6
- AMZN,72,9604
7
- META,121,24221
8
- QCOM,83,13180
9
- CSCO,137,5854
10
- MSFT,67,16717
11
- SBUX,140,12640
12
- AAPL,78,11578
 
 
 
 
 
 
 
 
 
 
 
 
 
matumizi/setup.cfg DELETED
@@ -1,18 +0,0 @@
1
- [metadata]
2
- name = matumizi
3
- version = 0.0.7
4
- author = Pranab Ghosh
5
- author_email = pkghosh99@gmail.com
6
- description = Data exploration alopng with various utilities for Data Science
7
- long_description = file: README.md
8
- long_description_content_type = text/markdown
9
- url = https://github.com/pranab/whakapai/tree/master/matumizi
10
- classifiers =
11
- Programming Language :: Python :: 3
12
- License :: OSI Approved :: GNU General Public License v2 (GPLv2)
13
- Operating System :: OS Independent
14
-
15
- [options]
16
- packages = find:
17
- python_requires = >=3.7
18
- include_package_data = True