SmolFactory / docs /Data_Pipeline.md
Tonic's picture
adds readme, removes quantization, adds readtoken logic, updates trackio , spaces
3c37508

A newer version of the Gradio SDK is available: 5.44.1

Upgrade
graph LR
    EntryPoint["EntryPoint"]
    Configuration["Configuration"]
    Model_Abstraction["Model Abstraction"]
    Data_Pipeline["Data Pipeline"]
    Training_Logic["Training Logic"]
    Utilities["Utilities"]
    EntryPoint -- "instructs" --> Data_Pipeline
    EntryPoint -- "loads settings from" --> Configuration
    EntryPoint -- "initializes models via" --> Model_Abstraction
    EntryPoint -- "invokes" --> Training_Logic
    Configuration -- "provides settings to" --> EntryPoint
    Configuration -- "informs" --> Model_Abstraction
    Configuration -- "guides" --> Data_Pipeline
    Model_Abstraction -- "provides models to" --> EntryPoint
    Model_Abstraction -- "receives settings from" --> Configuration
    Model_Abstraction -- "interacts with" --> Training_Logic
    Data_Pipeline -- "provides processed data to" --> EntryPoint
    Data_Pipeline -- "receives parameters from" --> Configuration
    Data_Pipeline -- "supplies batches to" --> Training_Logic
    Training_Logic -- "receives control from" --> EntryPoint
    Training_Logic -- "consumes data from" --> Data_Pipeline
    Training_Logic -- "operates on models from" --> Model_Abstraction
    Training_Logic -- "uses" --> Utilities
    Utilities -- "used by" --> EntryPoint
    Utilities -- "provides functionalities to" --> Training_Logic
    Utilities -- "assists" --> Data_Pipeline
    click Model_Abstraction href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Model_Abstraction.md" "Details"
    click Data_Pipeline href "https://github.com/Josephrp/SmolFactory/blob/main/docs/Data_Pipeline.md" "Details"

CodeBoardingDemoContact

Details

Final component overview for the smollm3_finetune project, based on the provided analysis and adhering to Machine Learning Training and Fine-tuning Framework patterns.

EntryPoint

The main entry point of the application, responsible for orchestrating the entire training and fine-tuning workflow. It initializes other core components, loads configurations, and manages the overall execution flow.

Related Classes/Methods:

  • smollm3_finetune.train (1:1)

Configuration

Centralizes and defines all parameters and settings required for the training and fine-tuning process, including model hyperparameters, dataset paths, and training arguments. It promotes a configuration-driven architecture, allowing easy modification and versioning of experimental setups.

Related Classes/Methods:

Model Abstraction [Expand]

Encapsulates the logic for loading, initializing, and managing different machine learning models and their variants (e.g., different architectures, quantization settings). It provides a consistent interface for interacting with various model architectures.

Related Classes/Methods:

Data Pipeline [Expand]

Handles the entire data lifecycle, including dataset loading, preprocessing (e.g., tokenization, formatting), and creating efficient data loaders for both training and evaluation phases. It ensures data is prepared correctly and efficiently for the model.

Related Classes/Methods:

  • smollm3_finetune.data.load_and_preprocess_data (1:1)

Training Logic

Contains the core algorithms and routines for training and fine-tuning machine learning models. This includes the training loop, optimization steps, loss calculation, gradient accumulation, and potentially specialized fine-tuning methods (e.g., LoRA, QLoRA).

Related Classes/Methods:

Utilities

A collection of common helper functions, reusable modules, and general-purpose tools that support various parts of the training framework but do not belong to a specific core component. This includes functions for logging, metrics calculation, device management, etc.

Related Classes/Methods:

  • utils (1:1)

FAQ