LLM-Tutor / pages /2_Data_understanding.py
georgeek's picture
setup
de2b822
import streamlit as st
def run():
st.title("Data Understanding")
st.write("## Overview")
st.write("""
Data Understanding is the second phase of the CRISP-DM process. It involves collecting initial data, describing the data, exploring the data, and verifying data quality.
""")
st.write("## Key Concepts & Explanations")
st.markdown("""
- **Data Collection**: Gathering data from various sources.
- **Data Description**: Summarizing the main characteristics of the data.
- **Data Exploration**: Using statistical and visualization techniques to understand the data.
- **Data Quality Verification**: Ensuring the data is accurate, complete, and reliable.
""")
st.write("## Introduction")
st.write("""
The Data Understanding phase is crucial for identifying potential issues with the data and gaining insights that will inform the subsequent phases of the CRISP-DM process.
""")
st.header("Objectives")
st.write("""
- **Collect Initial Data**: Gather data from various sources to get a comprehensive dataset.
- **Describe the Data**: Summarize the main characteristics of the data, including its structure and content.
- **Explore the Data**: Use statistical and visualization techniques to identify patterns, trends, and anomalies.
- **Verify Data Quality**: Assess the quality of the data to ensure it is suitable for analysis.
""")
st.header("Key Activities")
st.write("""
- **Data Collection**: Gather data from internal and external sources.
- **Data Description**: Generate summary statistics and visualizations to describe the data.
- **Data Exploration**: Perform exploratory data analysis (EDA) to uncover patterns and relationships.
- **Data Quality Verification**: Check for missing values, outliers, and inconsistencies in the data.
""")
st.write("## Detailed Steps")
st.write("""
1. **Collect Initial Data**:
- Identify relevant data sources.
- Extract data from various sources and consolidate it into a single dataset.
2. **Describe the Data**:
- Generate summary statistics (e.g., mean, median, standard deviation).
- Create visualizations (e.g., histograms, box plots) to describe the data distribution.
3. **Explore the Data**:
- Perform exploratory data analysis (EDA) to identify patterns, trends, and anomalies.
- Use visualization tools (e.g., scatter plots, heatmaps) to explore relationships between variables.
4. **Verify Data Quality**:
- Check for missing values and handle them appropriately.
- Identify and address outliers and inconsistencies in the data.
- Assess the overall quality of the data to ensure it is suitable for analysis.
""")
st.write("## Quiz: Conceptual Questions")
q1 = st.radio("What is the main purpose of the Data Understanding phase?", ["Collect data", "Describe data", "Explore data", "All of the above"])
if q1 == "All of the above":
st.success("βœ… Correct!")
else:
st.error("❌ Incorrect. The main purpose is to collect, describe, and explore data.")
st.write("## Learning Resources")
st.markdown("""
- πŸ“˜ [CRISP-DM Guide](https://www.sv-europe.com/crisp-dm-methodology/)
- πŸŽ“ [Data Understanding in Data Science](https://towardsdatascience.com/data-understanding-in-data-science-1a1d5e8b1c3d)
- πŸ”¬ [Exploratory Data Analysis (EDA)](https://www.analyticsvidhya.com/blog/2021/06/exploratory-data-analysis-eda-a-step-by-step-guide/)
""")