Spaces:
Sleeping
Sleeping
from .analysis_tools import univariate_analysis, bivariate_analysis, multivariate_analysis | |
from .data_cleaning_tools import handle_outliers, handle_missing_values | |
tool_library = { | |
# "HandleMissingValues": { | |
# "name": "Missing Values Handler", | |
# "function": handle_missing_values, | |
# "metadata": ''' | |
# 1. Fills missing values with the median of each column. | |
# 2. Fills missing values with the mode, if available; otherwise, logs a warning. | |
# 3. Fills missing values with the most frequent value or an empty string if mode is unavailable. | |
# ''', | |
# }, | |
"handle_outliers": { | |
"name": "Outlier Handler", | |
"function": handle_outliers, | |
"metadata": ''' | |
1. Uses median and MAD (Median Absolute Deviation) to detect outliers. | |
2. Identifies extreme values based on a set threshold and either excludes them from the dataset or keeps them marked for reference. | |
''', | |
}, | |
'univariate_analysis': { | |
"name": "Univariate Analysis", | |
"function": univariate_analysis, | |
"metadata": ''' | |
1. Provides a high-level summary of dataset structure, data types, and missing value statistics. | |
2. Analyzes missing values, their distribution, and correlation between missing columns. | |
3. Performs feature-specific analysis based on detected data types | |
4. Computes descriptive statistics, normality tests, and outlier detection for numerical columns. | |
5. Analyzes categorical distributions, entropy, and category frequencies with top values. | |
6. (Truncated but likely) extracts patterns, ranges, and trends from datetime columns. | |
''', | |
}, | |
'bivariate_analysis': { | |
"name": "Bivariate Analysis", | |
"function": bivariate_analysis, | |
"metadata": ''' | |
1. Uses Pearson, Spearman, and Kendall correlations for numerical variables, chi-square/Cramér’s V for categorical associations, and statistical tests like ANOVA for numerical vs. categorical analysis. Identifies best-fit relationships (linear, polynomial, etc.) for numerical pairs. | |
2. Provides a detailed bivariate analysis of all variable pairs in a dataframe, summarizing key correlations, associations, and insights. Optionally generates and saves visualizations like scatterplots and heatmaps. | |
3. Uses Chi-square tests and Cramer's V to assess categorical feature associations, calculates Phi coefficient for 2x2 tables, and computes Goodman & Kruskal’s Lambda for predictive strength. | |
4. Identifies statistically significant relationships between categorical variables, ranks them by strength, and optionally visualizes contingency tables as heatmaps. | |
5. The function performs ANOVA (One-Way & Welch’s ANOVA), Point-Biserial Correlation (for binary categories), and Levene’s test to analyze relationships between numerical and categorical features, calculating effect sizes (eta-squared, omega-squared) for significance testing. | |
''', | |
}, | |
'multivariate_analysis': { | |
"name": "Multivariate Analysis", | |
"function": multivariate_analysis, | |
"metadata": ''' | |
1. Calculates the pairwise correlation coefficients between all numerical columns in a given DataFrame, generating a correlation matrix. | |
2. It identifies pairs of numerical features with absolute correlation values exceeding a threshold of 0.7, indicating strong linear relationships. | |
3. Calculates the Variance Inflation Factor (VIF) for each numerical feature to detect multicollinearity, flagging features with VIF values greater than 10 as potential issues. | |
4. Uses PCA, Factor Analysis, t-SNE, and MDS. Identifies principal components or latent factors, aiming for 80% variance retention in PCA. | |
5. Finds optimal clusters using silhouette score, evaluates cluster quality. Density-based clustering for smaller datasets (<=5000 rows), identifies noise. Fits Gaussian mixture models, evaluates model fit. | |
6. Statistical tests and mutual information to rank individual feature relevance. Random Forest models to determine feature contribution to prediction. Iterative feature removal to select top features (max 10). | |
7. Detects outliers by isolating them in random partitions, using a contamination rate of 5%. Identifies local density deviations for smaller datasets (<= 5000 rows), also using a 5percent contamination rate and 20 neighbors. Provides the number and percentage of detected outliers for each method. | |
8. MANOVA: Tests mean differences across categorical target groups for multiple numerical features. LDA: Dimensionality reduction and classification for categorical targets. | |
''' | |
} | |
} |