architojha's picture
pushing api
8675ade
from .analysis_tools import univariate_analysis, bivariate_analysis, multivariate_analysis
from .data_cleaning_tools import handle_outliers, handle_missing_values
tool_library = {
# "HandleMissingValues": {
# "name": "Missing Values Handler",
# "function": handle_missing_values,
# "metadata": '''
# 1. Fills missing values with the median of each column.
# 2. Fills missing values with the mode, if available; otherwise, logs a warning.
# 3. Fills missing values with the most frequent value or an empty string if mode is unavailable.
# ''',
# },
"handle_outliers": {
"name": "Outlier Handler",
"function": handle_outliers,
"metadata": '''
1. Uses median and MAD (Median Absolute Deviation) to detect outliers.
2. Identifies extreme values based on a set threshold and either excludes them from the dataset or keeps them marked for reference.
''',
},
'univariate_analysis': {
"name": "Univariate Analysis",
"function": univariate_analysis,
"metadata": '''
1. Provides a high-level summary of dataset structure, data types, and missing value statistics.
2. Analyzes missing values, their distribution, and correlation between missing columns.
3. Performs feature-specific analysis based on detected data types
4. Computes descriptive statistics, normality tests, and outlier detection for numerical columns.
5. Analyzes categorical distributions, entropy, and category frequencies with top values.
6. (Truncated but likely) extracts patterns, ranges, and trends from datetime columns.
''',
},
'bivariate_analysis': {
"name": "Bivariate Analysis",
"function": bivariate_analysis,
"metadata": '''
1. Uses Pearson, Spearman, and Kendall correlations for numerical variables, chi-square/Cramér’s V for categorical associations, and statistical tests like ANOVA for numerical vs. categorical analysis. Identifies best-fit relationships (linear, polynomial, etc.) for numerical pairs.
2. Provides a detailed bivariate analysis of all variable pairs in a dataframe, summarizing key correlations, associations, and insights. Optionally generates and saves visualizations like scatterplots and heatmaps.
3. Uses Chi-square tests and Cramer's V to assess categorical feature associations, calculates Phi coefficient for 2x2 tables, and computes Goodman & Kruskal’s Lambda for predictive strength.
4. Identifies statistically significant relationships between categorical variables, ranks them by strength, and optionally visualizes contingency tables as heatmaps.
5. The function performs ANOVA (One-Way & Welch’s ANOVA), Point-Biserial Correlation (for binary categories), and Levene’s test to analyze relationships between numerical and categorical features, calculating effect sizes (eta-squared, omega-squared) for significance testing.
''',
},
'multivariate_analysis': {
"name": "Multivariate Analysis",
"function": multivariate_analysis,
"metadata": '''
1. Calculates the pairwise correlation coefficients between all numerical columns in a given DataFrame, generating a correlation matrix.
2. It identifies pairs of numerical features with absolute correlation values exceeding a threshold of 0.7, indicating strong linear relationships.
3. Calculates the Variance Inflation Factor (VIF) for each numerical feature to detect multicollinearity, flagging features with VIF values greater than 10 as potential issues.
4. Uses PCA, Factor Analysis, t-SNE, and MDS. Identifies principal components or latent factors, aiming for 80% variance retention in PCA.
5. Finds optimal clusters using silhouette score, evaluates cluster quality. Density-based clustering for smaller datasets (<=5000 rows), identifies noise. Fits Gaussian mixture models, evaluates model fit.
6. Statistical tests and mutual information to rank individual feature relevance. Random Forest models to determine feature contribution to prediction. Iterative feature removal to select top features (max 10).
7. Detects outliers by isolating them in random partitions, using a contamination rate of 5%. Identifies local density deviations for smaller datasets (<= 5000 rows), also using a 5percent contamination rate and 20 neighbors. Provides the number and percentage of detected outliers for each method.
8. MANOVA: Tests mean differences across categorical target groups for multiple numerical features. LDA: Dimensionality reduction and classification for categorical targets.
'''
}
}