--- title: "Data Processing Interface for GitHub Repositories" emoji: "📊" colorFrom: "indigo" colorTo: "pink" sdk: "streamlit" sdk_version: "1.8.0" app_file: "app.py" pinned: false --- # Data Processing Interface This project is a Streamlit-based interface designed to facilitate the mining, processing, and embedding of data from public GitHub repositories. It allows for the interactive selection and configuration of data sources, model parameters, and processing options, making it easier to manage data extraction and transformation tasks. ## Installation Before running the app, you need to install the necessary dependencies. This project requires Python 3.6 or later. 1. Clone the repository to your local machine: ``` git clone https://github.com/yourusername/yourprojectname.git cd yourprojectname ``` 2. Install the required Python packages: ``` pip install streamlit pandas tqdm ``` Make sure to install any other dependencies specific to your project. ## Running the App To run the app, navigate to the project directory in your terminal and execute the following command: ``` streamlit run streamlit_app.py ``` ## App Structure The app is organized into multiple pages, each dedicated to a specific part of the data processing workflow: - **Main Page:** Provides an overview and status of the data processing steps. - **Data Source Configuration:** Allows for the selection of a GitHub repository and specification of an output directory for generated data. - **Data Loading:** Enables directory selection within the repository and file type filtering for processing. - **Model Selection and Configuration:** Offers options to select and configure the embedding model and the question-and-answering model. - **Processing and Embedding:** Displays the process status, allows parameter tuning, and provides options to save preprocessed pages, processed pages, and vector store data. ## Navigating the Interface After launching the app, use the sidebar to navigate between the different pages. Each page includes interactive elements, such as input fields, dropdown menus, and checkboxes, allowing you to customize each step of the data processing pipeline. Ensure to follow the instructions on each page to properly configure and execute the data processing tasks. ## Contributing We welcome contributions to this project! If you have suggestions for improvements or encounter any issues, please open an issue or submit a pull request.