diff --git "a/datasets/amputation_removed_duplicates_and_balanced_report.html" "b/datasets/amputation_removed_duplicates_and_balanced_report.html" new file mode 100644--- /dev/null +++ "b/datasets/amputation_removed_duplicates_and_balanced_report.html" @@ -0,0 +1,7169 @@ +Pandas Profiling Report

Overview

Dataset statistics

Number of variables5
Number of observations210
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory8.3 KiB
Average record size in memory40.6 B

Variable types

Numeric1
Categorical4

Alerts

AGE is highly correlated with AMPUTATIONHigh correlation
AMPUTATION is highly correlated with AGEHigh correlation
AMPUTATION is uniformly distributed Uniform

Reproduction

Analysis started2021-11-16 20:48:41.142486
Analysis finished2021-11-16 20:48:42.048926
Duration0.91 seconds
Software versionpandas-profiling v3.1.0
Download configurationconfig.json

Variables

AGE
Real number (ℝ≥0)

HIGH CORRELATION

Distinct71
Distinct (%)33.8%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean55.0952381
Minimum4
Maximum89
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size1.8 KiB

Quantile statistics

Minimum4
5-th percentile16.9
Q147.25
median59
Q368
95-th percentile80
Maximum89
Range85
Interquartile range (IQR)20.75

Descriptive statistics

Standard deviation18.58024047
Coefficient of variation (CV)0.3372385911
Kurtosis0.4244667576
Mean55.0952381
Median Absolute Deviation (MAD)10
Skewness-0.8567484108
Sum11570
Variance345.2253361
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
6110
 
4.8%
629
 
4.3%
698
 
3.8%
608
 
3.8%
527
 
3.3%
547
 
3.3%
567
 
3.3%
736
 
2.9%
506
 
2.9%
656
 
2.9%
Other values (61)136
64.8%
ValueCountFrequency (%)
42
1.0%
52
1.0%
71
0.5%
81
0.5%
91
0.5%
111
0.5%
121
0.5%
151
0.5%
161
0.5%
182
1.0%
ValueCountFrequency (%)
891
 
0.5%
882
 
1.0%
851
 
0.5%
841
 
0.5%
831
 
0.5%
811
 
0.5%
805
2.4%
792
 
1.0%
782
 
1.0%
773
1.4%

GENDER
Categorical

Distinct2
Distinct (%)1.0%
Missing0
Missing (%)0.0%
Memory size1.8 KiB
F
112 
M
98 

Length

Max length1
Median length1
Mean length1
Min length1

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowM
2nd rowM
3rd rowF
4th rowF
5th rowF

Common Values

ValueCountFrequency (%)
F112
53.3%
M98
46.7%

Length

Histogram of lengths of the category

Pie chart

ValueCountFrequency (%)
f112
53.3%
m98
46.7%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

RACE
Categorical

Distinct5
Distinct (%)2.4%
Missing0
Missing (%)0.0%
Memory size1.8 KiB
Asian
89 
Black
57 
White
29 
Coloured
25 
Other
10 

Length

Max length8
Median length5
Mean length5.628571429
Min length5

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowBlack
2nd rowBlack
3rd rowAsian
4th rowBlack
5th rowWhite

Common Values

ValueCountFrequency (%)
Asian89
42.4%
Black 57
27.1%
White29
 
13.8%
Coloured25
 
11.9%
Other10
 
4.8%

Length

Histogram of lengths of the category

Pie chart

ValueCountFrequency (%)
asian89
42.4%
black57
27.1%
white29
 
13.8%
coloured25
 
11.9%
other10
 
4.8%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

DIABETES_CLASS
Categorical

Distinct2
Distinct (%)1.0%
Missing0
Missing (%)0.0%
Memory size1.8 KiB
Type 2 diabetes
135 
Type 1 diabetes
75 

Length

Max length15
Median length15
Mean length15
Min length15

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowType 2 diabetes
2nd rowType 2 diabetes
3rd rowType 2 diabetes
4th rowType 2 diabetes
5th rowType 2 diabetes

Common Values

ValueCountFrequency (%)
Type 2 diabetes135
64.3%
Type 1 diabetes75
35.7%

Length

Histogram of lengths of the category

Pie chart

ValueCountFrequency (%)
diabetes210
33.3%
type210
33.3%
2135
21.4%
175
 
11.9%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

AMPUTATION
Categorical

HIGH CORRELATION
UNIFORM

Distinct2
Distinct (%)1.0%
Missing0
Missing (%)0.0%
Memory size1.8 KiB
0
105 
1
105 

Length

Max length1
Median length1
Mean length1
Min length1

Characters and Unicode

Total characters0
Distinct characters0
Distinct categories0 ?
Distinct scripts0 ?
Distinct blocks0 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st row1
2nd row1
3rd row1
4th row1
5th row1

Common Values

ValueCountFrequency (%)
0105
50.0%
1105
50.0%

Length

Histogram of lengths of the category

Pie chart

ValueCountFrequency (%)
1105
50.0%
0105
50.0%

Most occurring characters

ValueCountFrequency (%)
No values found.

Most occurring categories

ValueCountFrequency (%)
No values found.

Most frequent character per category

Most occurring scripts

ValueCountFrequency (%)
No values found.

Most frequent character per script

Most occurring blocks

ValueCountFrequency (%)
No values found.

Most frequent character per block

Interactions

Correlations

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Cramér's V (φc)

Cramér's V is an association measure for nominal random variables. The coefficient ranges from 0 to 1, with 0 indicating independence and 1 indicating perfect association. The empirical estimators used for Cramér's V have been proved to be biased, even for large samples. We use a bias-corrected measure that has been proposed by Bergsma in 2013 that can be found here.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

First rows

AGEGENDERRACEDIABETES_CLASSAMPUTATION
050MBlackType 2 diabetes1
147MBlackType 2 diabetes1
276FAsianType 2 diabetes1
357FBlackType 2 diabetes1
467FWhiteType 2 diabetes1
556FWhiteType 2 diabetes1
666FAsianType 2 diabetes1
762FColouredType 1 diabetes1
865FBlackType 2 diabetes1
980FAsianType 1 diabetes1

Last rows

AGEGENDERRACEDIABETES_CLASSAMPUTATION
20060MColouredType 2 diabetes0
20169MWhiteType 2 diabetes0
20273FOtherType 2 diabetes0
20359FAsianType 2 diabetes0
20475FAsianType 2 diabetes0
20548FColouredType 1 diabetes0
20650MColouredType 2 diabetes0
20719FWhiteType 1 diabetes0
20888FBlackType 2 diabetes0
20965FOtherType 2 diabetes0
\ No newline at end of file