laurencer
/

Colourful-Llama7b-Alpaca-Tune-4epochs

Model card Files Files and versions Community

laurencer commited on Feb 12

Commit

261dbc8

•

1 Parent(s): e143414

Step 6000

Browse files

Files changed (14) hide show

.gitignore +215 -0
README.md +77 -0
baseline/adversarial_config.yaml +32 -0
baseline/baseline_config.yaml +32 -0
baseline/custom_dataset.py +110 -0
baseline/custom_params.py +114 -0
baseline/full_finetune.py +455 -0
colorful/adversarial_config.yaml +39 -0
colorful/basic_config.yaml +39 -0
colorful/custom_dataset.py +179 -0
colorful/custom_model.py +267 -0
colorful/custom_params.py +110 -0
colorful/full_finetune.py +511 -0
colorful/masked_apply.py +73 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,215 @@

+# Created by https://www.toptal.com/developers/gitignore/api/python,macos
+# Edit at https://www.toptal.com/developers/gitignore?templates=python,macos
+### TorchTune ###
+output/
+model/
+wandb/
+### macOS ###
+# General
+.DS_Store
+.AppleDouble
+.LSOverride
+# Icon must end with two \r
+Icon
+# Thumbnails
+._*
+# Files that might appear in the root of a volume
+.DocumentRevisions-V100
+.fseventsd
+.Spotlight-V100
+.TemporaryItems
+.Trashes
+.VolumeIcon.icns
+.com.apple.timemachine.donotpresent
+# Directories potentially created on remote AFP share
+.AppleDB
+.AppleDesktop
+Network Trash Folder
+Temporary Items
+.apdisk
+### macOS Patch ###
+# iCloud generated files
+*.icloud
+### Python ###
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+*$py.class
+# C extensions
+*.so
+# Distribution / packaging
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+share/python-wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# PyInstaller
+#  Usually these files are written by a python script from a template
+#  before PyInstaller builds the exe, so as to inject date/other infos into it.
+*.manifest
+*.spec
+# Installer logs
+pip-log.txt
+pip-delete-this-directory.txt
+# Unit test / coverage reports
+htmlcov/
+.tox/
+.nox/
+.coverage
+.coverage.*
+.cache
+nosetests.xml
+coverage.xml
+*.cover
+*.py,cover
+.hypothesis/
+.pytest_cache/
+cover/
+# Translations
+*.mo
+*.pot
+# Django stuff:
+*.log
+local_settings.py
+db.sqlite3
+db.sqlite3-journal
+# Flask stuff:
+instance/
+.webassets-cache
+# Scrapy stuff:
+.scrapy
+# Sphinx documentation
+docs/_build/
+# PyBuilder
+.pybuilder/
+target/
+# Jupyter Notebook
+.ipynb_checkpoints
+# IPython
+profile_default/
+ipython_config.py
+# pyenv
+#   For a library or package, you might want to ignore these files since the code is
+#   intended to run in multiple environments; otherwise, check them in:
+# .python-version
+# pipenv
+#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
+#   However, in case of collaboration, if having platform-specific dependencies or dependencies
+#   having no cross-platform support, pipenv may install dependencies that don't work, or not
+#   install all needed dependencies.
+#Pipfile.lock
+# poetry
+#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
+#   This is especially recommended for binary packages to ensure reproducibility, and is more
+#   commonly ignored for libraries.
+#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
+#poetry.lock
+# pdm
+#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
+#pdm.lock
+#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
+#   in version control.
+#   https://pdm.fming.dev/#use-with-ide
+.pdm.toml
+# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
+__pypackages__/
+# Celery stuff
+celerybeat-schedule
+celerybeat.pid
+# SageMath parsed files
+*.sage.py
+# Environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# Spyder project settings
+.spyderproject
+.spyproject
+# Rope project settings
+.ropeproject
+# mkdocs documentation
+/site
+# mypy
+.mypy_cache/
+.dmypy.json
+dmypy.json
+# Pyre type checker
+.pyre/
+# pytype static type analyzer
+.pytype/
+# Cython debug symbols
+cython_debug/
+# PyCharm
+#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
+#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
+#  and can be added to the global gitignore or merged into this file.  For a more nuclear
+#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
+#.idea/
+### Python Patch ###
+# Poetry local configuration file - https://python-poetry.org/docs/configuration/#local-configuration
+poetry.toml
+# ruff
+.ruff_cache/
+# LSP config files
+pyrightconfig.json
+# End of https://www.toptal.com/developers/gitignore/api/python,macos

README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+# torchtune research repo: token coloring (colorful llama)
+Playground to try out [token coloring](https://docs.google.com/document/d/1Win9vhddD-pu5P3SsG7E-dzN5oQl5DYWW1DhO7sBOgI/edit#heading=h.oqq00pt8expe) with TorchTune.
+The repo was generated using the alpha version of [torchtune](https://github.com/pytorch-labs/torchtune).
+Brief notes:
+- The starting recipe is based on the Alpaca Llama2 7B full finetune recipe (switched to bf16).
+- I assume `output/` is used to store model outputs and `model/` is used to store the base model checkpoints.
+For the `colorful` recipe:
+- I copied a lot of functionality (like the actual model definition, dataset, etc) from torchtune repository directly since I needed to make changes.
+- I reduced the flexiblity of the recipe (e.g. cannot specify the model or tokenizer) and increased it in other ways (e.g. can pass in a dataset path directly).
+- I added intermediate checkpointing (i.e. every `n` steps) and automatically upload the checkpoint to HuggingFace Hub.
+## Getting started
+The below instructions can be copy-pasted as is on to a running instance. They assume that the `HF_TOKEN` environment variable is set with a valid token.
+```bash
+# for RunPod
+cd /workspace
+git clone git@github.com:pytorch-labs/torchtune.git
+cd torchtune
+pip install -e .
+cd /workspace
+git clone git@github.com:laurencer/torchtune-colorful-llama.git
+cd torchtune-colorful-llama
+# for wandb support
+pip install wandb
+```
+```bash
+mkdir -p model/
+tune download --repo-id meta-llama/Llama-2-7b --output-dir model/
+```
+```bash
+tune convert_checkpoint --checkpoint-path model/consolidated.00.pth --output-path model/llama2_native.tune
+```
+```bash
+mkdir -p output/
+# tune --nnodes 1 --nproc_per_node 1 ./colorful/full_finetune.py --config ./colorful/basic_config.yaml
+nohup tune --nnodes 1 --nproc_per_node 1 ./colorful/full_finetune.py --config ./colorful/basic_config.yaml  2>&1 > training_log_$(date "+%Y.%m.%d_%H.%M.%S").log &
+sleep 1
+tail -f training_log_*.log
+```
+## Baselines
+Two baseline configs are provided in the `baseline` directory.
+We forked the original recipe to support customizing the location/path of the Alpaca dataset.
+```bash
+# tune --nnodes 1 --nproc_per_node 1 ./baseline/full_finetune.py --config ./baseline/baseline_config.yaml
+nohup tune --nnodes 1 --nproc_per_node 1 ./baseline/full_finetune.py --config ./baseline/baseline_config.yaml  2>&1 > training_log_$(date "+%Y.%m.%d_%H.%M.%S").log &
+sleep 1
+tail -f training_log_*.log
+```
+The adversarial config uses a dataset that is equivalent to 4x the original alpaca cleaned dataset with extra examples that include prompt injection attempts. See [token coloring description](https://docs.google.com/document/d/1Win9vhddD-pu5P3SsG7E-dzN5oQl5DYWW1DhO7sBOgI/edit#heading=h.oqq00pt8expe) for more info.
+```bash
+# tune --nnodes 1 --nproc_per_node 1 ./baseline/full_finetune.py --config ./baseline/adversarial_config.yaml
+nohup tune --nnodes 1 --nproc_per_node 1 ./baseline/full_finetune.py --config ./baseline/adversarial_config.yaml  2>&1 > training_log_$(date "+%Y.%m.%d_%H.%M.%S").log &
+sleep 1
+tail -f training_log_*.log
+```
+## Colorful
+The `colorful` directory implements the changes required to support token coloring. This includes a custom dataset implementation and training script.

baseline/adversarial_config.yaml ADDED Viewed

	@@ -0,0 +1,32 @@

+# Runs the full_finetune.py recipe
+#
+# To launch, run the following command from root:
+#    tune --nnodes 1 --nproc_per_node 1 --config alpaca_llama2_full_finetune --override model_checkpoint=<your_checkpoint_dir> ...
+# Dataset and Dataloader
+dataset: laurencer/yahma-alpaca-cleaned-adversarial
+seed: 42
+shuffle: True
+# Model Arguments
+model: llama2_7b
+model_checkpoint: model/llama2_native.tune
+tokenizer: llama2_tokenizer
+tokenizer_checkpoint: model/tokenizer.model
+# Fine-tuning arguments
+batch_size: 8
+lr: 2e-5
+epochs: 1
+optimizer: SGD
+loss: CrossEntropyLoss
+output_dir: output/alpaca-llama2-adversarial
+device: cuda
+dtype: bf16
+enable_fsdp: False
+enable_activation_checkpointing: True
+resume_from_checkpoint: False
+# Logging arguments
+metric_logger_type: wandb
+project: torchtune

baseline/baseline_config.yaml ADDED Viewed

	@@ -0,0 +1,32 @@

+# Runs the full_finetune.py recipe
+#
+# To launch, run the following command from root:
+#    tune --nnodes 1 --nproc_per_node 1 --config alpaca_llama2_full_finetune --override model_checkpoint=<your_checkpoint_dir> ...
+# Dataset and Dataloader
+dataset: yahma/alpaca-cleaned
+seed: 42
+shuffle: True
+# Model Arguments
+model: llama2_7b
+model_checkpoint: model/llama2_native.tune
+tokenizer: llama2_tokenizer
+tokenizer_checkpoint: model/tokenizer.model
+# Fine-tuning arguments
+batch_size: 8
+lr: 2e-5
+epochs: 4
+optimizer: SGD
+loss: CrossEntropyLoss
+output_dir: output/alpaca-llama2-baseline
+device: cuda
+dtype: bf16
+enable_fsdp: False
+enable_activation_checkpointing: True
+resume_from_checkpoint: False
+# Logging arguments
+metric_logger_type: wandb
+project: torchtune

baseline/custom_dataset.py ADDED Viewed

	@@ -0,0 +1,110 @@

+from typing import List, Tuple
+from datasets import load_dataset
+from torch.utils.data import Dataset
+# Not ideal to import this type here but it's needed for the transform function
+from torchtune.modules import Tokenizer
+CROSS_ENTROPY_IGNORE_IDX = -100
+_PROMPT_TEMPLATE = {
+    "prompt_input": (
+        "Below is an instruction that describes a task, paired with an input that provides further context. "
+        "Write a response that appropriately completes the request.\n\n"
+        "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n"
+    ),
+    "prompt_no_input": (
+        "Below is an instruction that describes a task. "
+        "Write a response that appropriately completes the request.\n\n"
+        "### Instruction:\n{instruction}\n\n### Response:\n"
+    ),
+}
+class AlpacaDataset(Dataset):
+    """
+    See torchtune.datasets.AlpacaDataset for the original implementation.
+    This version supports custom dataset paths.
+    """
+    def __init__(
+        self,
+        dataset_path: str,
+        tokenizer: Tokenizer,
+        train_on_input: bool = True,
+        **kwargs
+    ) -> None:
+        self._data = load_dataset(dataset_path, split="train")
+        self._tokenizer = tokenizer
+        self.train_on_input = train_on_input
+    def __len__(self):
+        return len(self._data)
+    def __getitem__(self, index: int) -> Tuple[List[int], List[int]]:
+        sample = self._data[index]
+        return self._transform(
+            instruction=sample["instruction"],
+            input=sample["input"],
+            output=sample["output"],
+        )
+    def _transform(
+        self, instruction: str, input: str, output: str
+    ) -> Tuple[List[int], List[int]]:
+        """
+        Split a sample on ``response`` tag to create input and labels.
+        Args:
+            instruction (str): Instruction text.
+            input (str): Input text. Can be an empty string. Determines the prompt generation template
+                used.
+            output (str): Response text.
+        Returns:
+            Tuple of encoded inputs and labels.
+        """
+        prompt = self._generate_prompt(instruction, input)
+        prompt_with_response = prompt + output
+        # add bos always; LlamaTokenizer sets this to True by default and neither
+        # alpaca-lora or the original authors change this
+        encoded_prompt = self._tokenizer.encode(
+            text=prompt, add_bos=True, add_eos=False
+        )
+        encoded_prompt_with_response = self._tokenizer.encode(
+            text=prompt_with_response, add_bos=True, add_eos=True
+        )
+        labels = encoded_prompt_with_response.copy()
+        if not self.train_on_input:
+            labels[: len(encoded_prompt)] = [CROSS_ENTROPY_IGNORE_IDX] * len(
+                encoded_prompt
+            )
+        assert len(encoded_prompt_with_response) == len(labels)
+        return encoded_prompt_with_response, labels
+    def _generate_prompt(self, instruction: str, input: str) -> str:
+        """
+        Generate prompt from instruction and input.
+        Args:
+            instruction (str): Instruction text.
+            input (str): Input text.
+        Returns:
+            Prompt text.
+        """
+        if input:
+            prompt = _PROMPT_TEMPLATE["prompt_input"].format(
+                instruction=instruction, input=input
+            )
+        else:
+            prompt = _PROMPT_TEMPLATE["prompt_no_input"].format(instruction=instruction)
+        return prompt

baseline/custom_params.py ADDED Viewed

	@@ -0,0 +1,114 @@

+# Customized to remove dataset validation.
+from dataclasses import dataclass, field, fields
+from typing import List, Optional
+from torchtune.datasets import ALL_DATASETS
+from torchtune.models import ALL_MODELS, ALL_TOKENIZERS
+from torchtune.utils.metric_logging import ALL_METRIC_LOGGERS
+from torchtune.utils.precision import PRECISION_STR_TO_DTYPE
+@dataclass
+class FullFinetuneParams:
+    """Arguments for the finetune_llm recipe.
+    Args:
+        device (str): Device to use for training. Options are "cpu" and "cuda"
+        dtype (str): Data type to use for training.
+        seed (int): Random seed to use for training.
+        model (str): String specifying model architecture to fine-tune. See ``torchtune.models.get_model`` for options.
+        model_checkpoint (str): Local path to load model checkpoint from.
+        tokenizer (str): String specifying tokenizer to use. See ``torchtune.models.get_tokenizer`` for options.
+        tokenizer_checkpoint (str): Local path to load tokenizer checkpoint from.
+        dataset (str): String specifying dataset to use. See ``torchtune.datasets.get_dataset`` for options.
+            Currently, only predefined datasets in library are supported.
+        shuffle (bool): Whether to shuffle dataset.
+        batch_size (int): Batch size to use for training.
+        epochs (int): Number of epochs to train for.
+        optimizer (str): String specifying optimizer to use. See ``torchtune.optim.get_optimizer`` for options.
+        loss (str): String specifying loss function to use. See ``torchtune.losses.get_loss`` for options.
+        lr (float): Learning rate to use for optimizer.
+        activation_checkpointing (bool): Whether to use activation checkpointing.
+        output_dir (str): Local path to save checkpoints and logs to.
+        run_generation (int): Run eval on a prompt every ``run_generation`` steps. Set to 0 to disable.
+        max_steps_per_epoch (int): Maximum number of steps to take per epoch.
+        metric_logger_type (str): String specifying metric logger to use. See ``torchtune.utils.get_metric_logger``
+            for options.
+        project (str): Project name to use for logging. Used by ``WandBLogger``.
+        resume_from_previous_checkpoint (bool): Whether to resume fine-tuning from a previous checkpoint.
+        cpu_offload (bool): Whether to offload model to CPU.
+    Raises:
+        ValueError: If ``cpu_offload`` is ``True`` but ``device`` is not ``cuda`` and <= 1 GPUs.
+    """
+    # Model
+    model: str = ""
+    model_checkpoint: str = ""
+    # Tokenizer
+    tokenizer: str = ""
+    tokenizer_checkpoint: str = ""
+    # Dataset and Sampler
+    dataset: str = ""
+    train_on_input: bool = True
+    shuffle: bool = True
+    batch_size: int = 2
+    # Optimizer and Scheduler
+    optimizer: str = "SGD"
+    lr: float = 2e-5
+    loss: str = "CrossEntropyLoss"
+    gradient_accumulation_steps: int = 1
+    # Training
+    epochs: int = 3
+    max_steps_per_epoch: Optional[int] = None
+    resume_from_checkpoint: bool = False
+    run_generation: Optional[int] = None
+    # Distributed
+    cpu_offload: bool = False
+    enable_fsdp: bool = True
+    enable_activation_checkpointing: bool = True
+    # Environment
+    device: str = "cuda"
+    dtype: str = "fp32"
+    seed: Optional[int] = None
+    # Logging
+    output_dir: str = "/tmp/full_finetune_output"
+    metric_logger_type: str = "disk"
+    project: Optional[str] = None
+    log_every_n_steps: Optional[int] = None
+    def __post_init__(self):
+        for param in fields(self):
+            if getattr(self, param.name) == "":
+                raise TypeError(f"{param.name} needs to be specified")
+        if self.cpu_offload and self.device != "cuda":
+            raise ValueError(
+                "Cannot offload model to CPU if device is not cuda or <= 1 GPUs."
+            )
+        if self.enable_fsdp and self.device == "cpu":
+            raise ValueError("FSDP is not supported on CPU.")
+        if self.model not in ALL_MODELS:
+            raise ValueError(
+                f"Model not recognized. Expected one of {ALL_MODELS}, received {self.model}."
+            )
+        if self.tokenizer not in ALL_TOKENIZERS:
+            raise ValueError(
+                f"Tokenizer not recognized. Expected one of {ALL_TOKENIZERS}, received {self.tokenizer}."
+            )
+        if self.metric_logger_type not in ALL_METRIC_LOGGERS:
+            raise ValueError(
+                f"Metric logger not recognized. Expected one of {ALL_METRIC_LOGGERS}, received {self.metric_logger_type}."
+            )
+        if self.dtype not in PRECISION_STR_TO_DTYPE:
+            raise ValueError(
+                f"Dtype {self.dtype} must be one of {', '.join(PRECISION_STR_TO_DTYPE.keys())} for finetuning."
+            )

baseline/full_finetune.py ADDED Viewed

	@@ -0,0 +1,455 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import os
+import sys
+from functools import partial
+from typing import Any, Dict, Optional, Tuple
+from warnings import warn
+import torch
+from torch import nn
+from torch.cuda.amp import GradScaler
+from torch.distributed import init_process_group
+from torch.optim import Optimizer
+from torch.utils.data import DataLoader, DistributedSampler
+from torchtune import models, modules, utils
+from torchtune.utils.constants import (
+    EPOCHS_KEY,
+    MAX_STEPS_KEY,
+    MODEL_KEY,
+    OPT_KEY,
+    SEED_KEY,
+    TOTAL_EPOCHS_KEY,
+)
+from tqdm import tqdm
+from recipes.interfaces import FTRecipeInterface
+from custom_params import FullFinetuneParams
+from custom_dataset import AlpacaDataset
+log = utils.get_logger("DEBUG")
+class FullFinetuneRecipe(FTRecipeInterface):
+    """
+    Full finetuning recipe for dense transformer-based LLMs such as Llama2.
+    This recipe supports:
+        - FSDP and activation checkpointing. This is enabled by default but can be
+            configured using the ``enable_fsdp`` and ``enable_activation_checkpointing`` flags.
+        - Mixed precision training - fp32, fp16 and bf16 are supported.
+        - Checkpointing of model weights, optimizer state and the recipe state (epoch and seed).
+        - Resuming from checkpoints saved using the ``save_checkpoint`` functionality.
+        - Logging to terminal. WandB and TensorBoard are currently not supported.
+    Assumptions:
+        - Training is launched with the Tune CLI (recommended) which uses TorchRun under the
+            hood. Setting up the env variables is handled by TorchRun.
+        - Training happens on CUDA (CPU training is not supported)
+        - Checkpoints are ONLY saved at epoch boundaries. Mid-epoch checkpointing is NOT supported.
+        - Datasets are Map-style and data fits in memory (not streamed).
+    """
+    def __init__(self, params: FullFinetuneParams) -> None:
+        self._device = utils.get_device(device=params.device)
+        self._dtype = utils.get_dtype(dtype=params.dtype)
+        # logging attributes
+        self._output_dir = params.output_dir
+        self._metric_logger = utils.get_metric_logger(
+            metric_logger_type=params.metric_logger_type,
+            project=params.project,
+            log_dir=params.output_dir,
+        )
+        self._log_every_n_steps = (
+            params.log_every_n_steps if params.log_every_n_steps else 1
+        )
+        # _is_rank_zero is used primarily for logging. In the future, the logger
+        # should directly take care of this
+        _, rank = utils.get_world_size_and_rank()
+        self._is_rank_zero = rank == 0
+        # Training params
+        self._resume_from_checkpoint = params.resume_from_checkpoint
+        self._enable_fsdp = params.enable_fsdp
+        self._gradient_accumulation_steps = params.gradient_accumulation_steps
+        # These are public properties which are updated by the checkpoint loader
+        # when ``resume_from_checkpoint`` is `True` or validated in tests
+        self.seed = utils.set_seed(seed=params.seed)
+        self.epochs_run = 0
+        self.total_epochs = params.epochs
+        self.max_steps_per_epoch = params.max_steps_per_epoch
+        self.total_training_steps = 0
+    def load_checkpoint(self, ckpt_path: str):
+        """
+        Extract the checkpoint state from file and validate.
+        """
+        ckpt_dict = torch.load(ckpt_path, map_location="cpu", weights_only=True)
+        utils.validate_checkpoint(ckpt_dict, self._resume_from_checkpoint)
+        return ckpt_dict
+    def setup(self, params: FullFinetuneParams) -> None:
+        """
+        Sets up the recipe state correctly. This includes setting recipe attributes based
+        on the ``resume_from_checkpoint`` flag.
+        """
+        ckpt_dict = self.load_checkpoint(ckpt_path=params.model_checkpoint)
+        # If we're resuming from checkpoint, the recipe's state should be updated before
+        # initializing the training components. This ensures that the seed is correctly
+        # propagated to the relevant components
+        if self._resume_from_checkpoint:
+            self._update_recipe_state(ckpt_dict)
+        # ``_setup_model`` handles initialization and loading the state dict. This method
+        # should be called before ``_setup_optimizer`` since transforming the optimizer
+        # state dict requires the model
+        self._model = self._setup_model(
+            model=params.model,
+            enable_fsdp=params.enable_fsdp,
+            enable_activation_checkpointing=params.enable_activation_checkpointing,
+            model_state_dict=ckpt_dict[MODEL_KEY],
+        )
+        self._tokenizer = self._setup_tokenizer(
+            tokenizer=params.tokenizer, tokenizer_checkpoint=params.tokenizer_checkpoint
+        )
+        # _setup_optimizer should take in ckpt_dict only if training is resumed from
+        # checkpoint. Transforming the opt state dict is handled by this method
+        self._optimizer = self._setup_optimizer(
+            optimizer=params.optimizer,
+            lr=params.lr,
+            opt_state_dict=ckpt_dict[OPT_KEY] if self._resume_from_checkpoint else None,
+        )
+        self._loss_fn = self._setup_loss(loss=params.loss)
+        # sampler and dataloader depend on the tokenizer and loss_fn and should be
+        # setup after both of these are initialized
+        self._sampler, self._dataloader = self._setup_data(
+            dataset=params.dataset,
+            train_on_input=params.train_on_input,
+            shuffle=params.shuffle,
+            batch_size=params.batch_size,
+        )
+        # training setup
+        self._autocast = utils.get_autocast(self._dtype, self._device)
+        self._grad_scaler = None
+        if self._dtype == torch.float16:
+            self._grad_scaler = utils.get_gradient_scaler(fsdp=params.enable_fsdp)
+        else:
+            self._grad_scaler = GradScaler(enabled=False)
+        # Finally update the recipe state which can only be correctly set after all of the
+        # other components have been initialized and updated.
+        #
+        # Number of training steps in each epoch depends on the number of batches produced
+        # by the dataloader, the max_steps_per_epoch param set by the user and the
+        # gradient_accumulation_steps param. This value is used for logging and tracking
+        # training state. The computation should happen after the dataloader has been setup
+        self._steps_per_epoch = (
+            len(self._dataloader) // self._gradient_accumulation_steps
+        )
+        if (
+            self.max_steps_per_epoch is not None
+            and self.max_steps_per_epoch < self._steps_per_epoch
+        ):
+            self._steps_per_epoch = self.max_steps_per_epoch
+        self.total_training_steps = self.epochs_run * self._steps_per_epoch
+    def _update_recipe_state(self, ckpt_dict: Dict[str, Any]) -> None:
+        """
+        Updates the recipe state from checkpoint.
+        """
+        # If seed, total_epoch or max_steps_per_epoch don't match,
+        # warn the user and overwrite
+        if (
+            self.seed != ckpt_dict[SEED_KEY]
+            or self.total_epochs != ckpt_dict[TOTAL_EPOCHS_KEY]
+            or self.max_steps_per_epoch != ckpt_dict[MAX_STEPS_KEY]
+        ):
+            warn(
+                message="""Configured value for seed, epochs or max_steps_per_epoch
+                does not match the value stored in checkpoint."""
+            )
+        self.seed = utils.set_seed(seed=ckpt_dict[SEED_KEY])
+        self.epochs_run = ckpt_dict[EPOCHS_KEY]
+        self.total_epochs = ckpt_dict[TOTAL_EPOCHS_KEY]
+        self.max_steps_per_epoch = ckpt_dict[MAX_STEPS_KEY]
+    def _setup_model(
+        self,
+        model: str,
+        enable_fsdp: bool,
+        enable_activation_checkpointing: bool,
+        model_state_dict: Dict[str, Any],
+    ) -> nn.Module:
+        """
+        Set up the model including enabling FSDP and activation checkpointing. For this recipe,
+        ``enable_fsdp`` should always be ``True``. This is currently a configurable flag for
+        running tests on CPUs.
+        """
+        model = models.get_model(model, device=self._device)
+        model = (
+            utils.wrap_fsdp(
+                model=model,
+                device=self._device,
+                dtype=self._dtype,
+                strategy="FULL_SHARD",
+                auto_wrap_policy={modules.TransformerDecoderLayer},
+            )
+            if enable_fsdp
+            else model
+        )
+        if enable_activation_checkpointing:
+            utils.set_activation_checkpointing(
+                model, auto_wrap_policy={modules.TransformerDecoderLayer}
+            )
+        model.load_state_dict(model_state_dict)
+        if self._is_rank_zero:
+            log.info(
+                "Model is initialized. FSDP and Activation Checkpointing are enabled."
+            )
+        return model
+    def _setup_tokenizer(
+        self, tokenizer: str, tokenizer_checkpoint: str
+    ) -> modules.Tokenizer:
+        """
+        Unlike ```setup_model```, this takes in the checkpoint and loads the sentencepiece
+        tokenizer model. This is related to how the tokenizer is implemented and should
+        change in a future iteration.
+        """
+        tokenizer = models.get_tokenizer(tokenizer, path=tokenizer_checkpoint)
+        if self._is_rank_zero:
+            log.info("Tokenizer is initialized from file.")
+        return tokenizer
+    def _setup_optimizer(
+        self, optimizer: str, lr: float, opt_state_dict: Optional[Dict[str, Any]] = None
+    ) -> Optimizer:
+        """
+        Set up the optimizer. This method also handles transforing the state dict
+        for FSDP.
+        """
+        optimizer = modules.get_optimizer(optimizer, self._model, lr)
+        if opt_state_dict:
+            opt_state_dict = utils.transform_opt_state_dict(
+                opt_state_dict, self._model, optimizer
+            )
+            optimizer.load_state_dict(opt_state_dict)
+        if self._is_rank_zero:
+            log.info("Optimizer is initialized.")
+        return optimizer
+    def _setup_loss(self, loss: str) -> nn.Module:
+        loss_fn = modules.get_loss(loss)
+        if self._is_rank_zero:
+            log.info("Loss is initialized.")
+        return loss_fn
+    def _setup_data(
+        self, dataset: str, shuffle: bool, batch_size: int, train_on_input: bool
+    ) -> Tuple[DistributedSampler, DataLoader]:
+        """
+        All data related setup happens here. Currently this recipe only supports the
+        DistributedSamplers with Map-style Datasets which fit into memory. Other samplers,
+        iterable datasets and streaming datasets are not supported.
+        """
+        world_size, rank = utils.get_world_size_and_rank()
+        ds = AlpacaDataset(dataset, tokenizer=self._tokenizer, train_on_input=train_on_input)
+        sampler = DistributedSampler(
+            ds,
+            num_replicas=world_size,
+            rank=rank,
+            shuffle=shuffle,
+            seed=0,
+        )
+        dataloader = DataLoader(
+            dataset=ds,
+            batch_size=batch_size,
+            sampler=sampler,
+            collate_fn=partial(
+                utils.padded_collate,
+                padding_idx=self._tokenizer.pad_id,
+                ignore_idx=self._loss_fn.ignore_index,  # TODO support loss without ignore_index
+            ),
+        )
+        if self._is_rank_zero:
+            log.info("Dataset and Sampler are initialized.")
+        return sampler, dataloader
+    def save_checkpoint(self, epoch: int) -> None:
+        """
+        Checkpoint the relevant state of a recipe.
+        This makes use of the `save_checkpoint` utility which is responsible for
+        writing the checkpoint dictionary to file. The contents of the dict are dictated
+        by whether training is complete or not.
+        If training is ongoing, optimizer state, seed and epochs_run are saved along with the
+        model weights.
+        """
+        os.makedirs(self._output_dir, exist_ok=True)
+        output_loc = f"{self._output_dir}/model_{epoch}.ckpt"
+        ckpt_dict = {MODEL_KEY: self._model}
+        # if training is in-progress, checkpoint the optimizer state as well
+        if epoch + 1 < self.total_epochs:
+            ckpt_dict.update(
+                {
+                    OPT_KEY: self._optimizer,
+                    SEED_KEY: self.seed,
+                    EPOCHS_KEY: self.epochs_run,
+                    TOTAL_EPOCHS_KEY: self.total_epochs,
+                    MAX_STEPS_KEY: self.max_steps_per_epoch,
+                }
+            )
+        utils.save_checkpoint(ckpt_dict, output_loc)
+        if self._is_rank_zero:
+            log.info(
+                f"Model checkpoint of size {os.path.getsize(output_loc) >> 20} MB saved to {output_loc}"
+            )
+    def _should_update_weights(self, curr_step: int) -> bool:
+        """
+        Determines whether the weights should be updated on the current step or not.
+        True is returned either if we've accumulated gradients for enough steps or if this
+        is the last step in the epoch.
+        """
+        should_update_weights = (
+            curr_step + 1
+        ) % self._gradient_accumulation_steps == 0 or (
+            curr_step + 1
+        ) == self._steps_per_epoch
+        return should_update_weights
+    def train(self) -> None:
+        """
+        The core training loop. Supports training on subsets of the dataset using the
+        ``max_steps_per_epoch``.
+        """
+        _, rank = utils.get_world_size_and_rank()
+        # zero out the gradients before starting training
+        self._optimizer.zero_grad()
+        # self.epochs_run should be non-zero when we're resuming from a checkpoint
+        for curr_epoch in range(self.epochs_run, self.total_epochs):
+            # Update the sampler to ensure data is correctly shuffled across epochs
+            # in case shuffle is True
+            self._sampler.set_epoch(curr_epoch)
+            for idx, batch in enumerate(
+                pbar := tqdm(self._dataloader, disable=not (rank == 0))
+            ):
+                if (
+                    self.max_steps_per_epoch is not None
+                    and (idx // self._gradient_accumulation_steps)
+                    == self.max_steps_per_epoch
+                ):
+                    break
+                input_ids, labels = batch
+                input_ids = input_ids.to(self._device)
+                labels = labels.to(self._device)
+                with self._autocast:
+                    logits = self._model(input_ids)
+                    # Shift so that tokens < n predict n
+                    logits = logits[..., :-1, :].contiguous()
+                    labels = labels[..., 1:].contiguous()
+                    logits = logits.transpose(1, 2)
+                    # Compute loss
+                    loss = self._loss_fn(logits, labels)
+                # Note: We're always logging the loss before normalizing it
+                # Check if this is the norm or not
+                pbar.set_description(f"{curr_epoch+1}|{idx+1}|Loss: {loss.item()}")
+                if self.total_training_steps % self._log_every_n_steps == 0:
+                    self._metric_logger.log_dict(
+                        {
+                            "loss": loss.item(),
+                            "lr": self._optimizer.param_groups[0]["lr"],
+                            "gpu_resources": torch.cuda.memory_allocated(),
+                        },
+                        step=self.total_training_steps,
+                    )
+                # Does loss normalization need to happen within autocast context?
+                loss = loss / self._gradient_accumulation_steps
+                self._grad_scaler.scale(loss).backward()
+                if self._should_update_weights(idx):
+                    self._grad_scaler.step(self._optimizer)
+                    self._grad_scaler.update()
+                    self._optimizer.zero_grad(set_to_none=True)
+                    # Update the number of steps when the weights are updated
+                    self.total_training_steps += 1
+            self.epochs_run += 1
+            self.save_checkpoint(epoch=curr_epoch)
+    def cleanup(self) -> None:
+        self._metric_logger.close()
+def recipe_main() -> None:
+    """
+    Entry point for the recipe.
+    Configurable parameters are read in the following order:
+        - Parameters specified in ``FullFinetuneParams``
+        - Overwritten by Parameters specified in ``alpaca_llama2_full_finetune.yaml``
+        - Overwritten by arguments from the command-line using ``TuneArgumentParser``
+    """
+    parser = utils.TuneArgumentParser(
+        description=FullFinetuneParams.__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    args, _ = parser.parse_known_args()
+    args = vars(args)
+    recipe_params = FullFinetuneParams(**args)
+    # Env variables set by torch run; only need to initialize process group
+    # init_process_group(backend="nccl")
+    recipe = FullFinetuneRecipe(params=recipe_params)
+    recipe.setup(params=recipe_params)
+    recipe.train()
+    recipe.cleanup()
+if __name__ == "__main__":
+    sys.exit(recipe_main())

colorful/adversarial_config.yaml ADDED Viewed

	@@ -0,0 +1,39 @@

+# Runs the full_finetune.py recipe
+#
+# To launch, run the following command from root:
+#    tune --nnodes 1 --nproc_per_node 1 --config alpaca_llama2_full_finetune --override model_checkpoint=<your_checkpoint_dir> ...
+# Dataset and Dataloader
+dataset: laurencer/yahma-alpaca-cleaned-adversarial
+seed: 42
+shuffle: True
+# Checkpointing
+# Removed for now given poor upload speeds for checkpoints
+# hf_repo_id: laurencer/Llama7b-Alpaca-Tune-4epochs-WithColoring
+checkpoint_every_n_steps: 500 # 6k steps per epoch
+# Model Arguments
+model_checkpoint: model/llama2_native.tune
+tokenizer_checkpoint: model/tokenizer.model
+color_layer_initialization: zeros
+norm_before_color_layer: True
+# Fine-tuning arguments
+compile: False
+batch_size: 8
+lr: 2e-5
+epochs: 4
+optimizer: SGD
+loss: CrossEntropyLoss
+output_dir: output/alpaca-colorful-llama2-finetune
+device: cuda
+dtype: bf16
+enable_fsdp: False
+enable_activation_checkpointing: True
+resume_from_checkpoint: False
+# Logging arguments
+metric_logger_type: wandb
+project: torchtune

colorful/basic_config.yaml ADDED Viewed

	@@ -0,0 +1,39 @@

+# Runs the full_finetune.py recipe
+#
+# To launch, run the following command from root:
+#    tune --nnodes 1 --nproc_per_node 1 --config alpaca_llama2_full_finetune --override model_checkpoint=<your_checkpoint_dir> ...
+# Dataset and Dataloader
+dataset: yahma/alpaca-cleaned
+seed: 42
+shuffle: True
+# Checkpointing
+# Removed for now given poor upload speeds for checkpoints
+# hf_repo_id: laurencer/Llama7b-Alpaca-Tune-4epochs-WithColoring
+checkpoint_every_n_steps: 500 # 6k steps per epoch
+# Model Arguments
+model_checkpoint: model/llama2_native.tune
+tokenizer_checkpoint: model/tokenizer.model
+color_layer_initialization: zeros
+norm_before_color_layer: True
+# Fine-tuning arguments
+compile: True
+batch_size: 8
+lr: 2e-5
+epochs: 4
+optimizer: SGD
+loss: CrossEntropyLoss
+output_dir: output/alpaca-colorful-llama2-finetune
+device: cuda
+dtype: bf16
+enable_fsdp: False
+enable_activation_checkpointing: True
+resume_from_checkpoint: False
+# Logging arguments
+metric_logger_type: wandb
+project: torchtune

colorful/custom_dataset.py ADDED Viewed

	@@ -0,0 +1,179 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+from typing import List, Tuple
+import torch
+import torch.nn.functional as F
+from torch.nn.utils.rnn import pad_sequence
+from torch.utils.data import Dataset
+from datasets import load_dataset
+# Not ideal to import this type here but it's needed for the transform function
+from torchtune.modules import Tokenizer
+CROSS_ENTROPY_IGNORE_IDX = -100
+DEFAULT = 0
+INSTRUCTION = 1
+INPUT = 2
+RESPONSE = 3
+class ColoringAlpacaDataset(Dataset):
+    """
+    See torchtune.datasets.alpaca.AlpacaDataset for the original implementation.
+    Constructor now takes in a dataset path directly.
+    This implementation returns 3 lists representing the tokens, labels, and token colors
+    (as opposed to just the tokens & labels from the original).
+    """
+    def __init__(
+        self,
+        tokenizer: Tokenizer,
+        dataset_path: str = "yahma/alpaca-cleaned",
+        train_on_input: bool = True,
+        **kwargs
+    ) -> None:
+        self._data = load_dataset(dataset_path, split="train")
+        self._tokenizer = tokenizer
+        self.train_on_input = train_on_input
+        self.num_colors = 4 # matches the above usage of DEFAULT, INSTRUCTION, INPUT, RESPONSE
+    def __len__(self):
+        return len(self._data)
+    def __getitem__(self, index: int) -> Tuple[List[int], List[int], List[int]]:
+        sample = self._data[index]
+        return self._transform(
+            instruction=sample["instruction"],
+            input=sample["input"],
+            output=sample["output"],
+        )
+    def _transform(
+        self, instruction: str, input: str, output: str
+    ) -> Tuple[List[int], List[int], List[int]]:
+        """
+        Split a sample on ``response`` tag to create input and labels.
+        Args:
+            instruction (str): Instruction text.
+            input (str): Input text. Can be an empty string. Determines the prompt generation template
+                used.
+            output (str): Response text.
+        Returns:
+            Tuple of encoded inputs, labels, token colors.
+        """
+        prompt = self._generate_prompt(instruction, input)
+        # First handle the prompt
+        colors = []
+        tokenized = []
+        labels = []
+        is_first = True
+        for token_type, text in prompt:
+            tokenized_part = self._tokenizer.encode(
+                text=text, add_bos=is_first, add_eos=False
+            )
+            is_first = False
+            tokenized += tokenized_part
+            colors += [token_type] * len(tokenized_part)
+            if not self.train_on_input:
+                labels += [CROSS_ENTROPY_IGNORE_IDX] * len(tokenized_part)
+            else:
+                labels += tokenized_part
+        # Now add the response tokens
+        tokenized_part = self._tokenizer.encode(
+            text=output, add_bos=False, add_eos=True
+        )
+        tokenized += tokenized_part
+        colors += [RESPONSE] * len(tokenized_part)
+        labels += tokenized_part
+        assert len(tokenized) == len(labels)
+        assert len(tokenized) == len(colors)
+        return tokenized, labels, colors
+    def _generate_prompt(self, instruction: str, input: str) -> List[Tuple[(int, str)]]:
+        """
+        Generate prompt from instruction and input.
+        Args:
+            instruction (str): Instruction text.
+            input (str): Input text.
+        Returns:
+            List of (int, templated text)
+        """
+        if input:
+            return [
+                (DEFAULT, (
+                    "Below is an instruction that describes a task, paired with an input that provides further context. "
+                    "Write a response that appropriately completes the request.\n\n"
+                    "### Instruction:\n"
+                )),
+                (INSTRUCTION, instruction),
+                (DEFAULT, "\n\n### Input:\n"),
+                (INPUT, input),
+                (DEFAULT, "\n\n### Response:\n"),
+            ]
+        else:
+            return [
+                (DEFAULT, (
+                    "Below is an instruction that describes a task. "
+                    "Write a response that appropriately completes the request.\n\n"
+                    "### Instruction:\n"
+                )),
+                (INSTRUCTION, instruction),
+                (DEFAULT, "\n\n### Response:\n"),
+            ]
+# TokenPair is a pair (tuple) of three lists: tokenized text inputs, labels, colors.
+TokenPair = Tuple[List[int], List[int], List[int]]
+def padded_collate(
+    batch: List[TokenPair],
+    padding_idx: int = 0,
+    ignore_idx: int = -100,
+) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
+    input_ids = pad_sequence(
+        [torch.tensor(x[0]) for x in batch],
+        batch_first=True,
+        padding_value=padding_idx,
+    )
+    labels = pad_sequence(
+        [torch.tensor(x[1]) for x in batch],
+        batch_first=True,
+        padding_value=ignore_idx,
+    )
+    colors = pad_sequence(
+        [torch.tensor(x[2]) for x in batch],
+        batch_first=True,
+        padding_value=padding_idx,
+    )
+    input_ids_seq_len = input_ids.shape[-1]
+    labels_seq_len = labels.shape[-1]
+    colors_seq_len = colors.shape[-1]
+    assert input_ids_seq_len == labels_seq_len
+    assert input_ids_seq_len == colors_seq_len
+    return input_ids, labels, colors

colorful/custom_model.py ADDED Viewed

	@@ -0,0 +1,267 @@

+from typing import Optional
+import copy
+import math
+import torch
+from torch import nn, Tensor
+from torchtune.modules import (
+    CausalSelfAttention,
+    FeedForward,
+    KVCache,
+    RMSNorm,
+    RotaryPositionalEmbeddings,
+    # TransformerDecoder, replaced with our custom implementation.
+    TransformerDecoderLayer,
+)
+from masked_apply import MaskedApply
+def initialize_identity_linear(size):
+    layer = nn.Linear(size, size)
+    layer.weight.data.copy_(torch.eye(size))
+    layer.bias.data.copy_(torch.zeros(size))
+    return layer
+def initialize_linear(size):
+    return nn.Linear(size, size)
+def initialize_kaiming_uniform_linear(size):
+    layer = nn.Linear(size, size)
+    nn.init.kaiming_uniform_(layer.weight, a=math.sqrt(5))
+    layer.bias.data.copy_(torch.zeros(size))
+    return layer
+def initialize_zeros_linear(size):
+    layer = nn.Linear(size, size)
+    layer.weight.data.copy_(torch.zeros(size))
+    layer.bias.data.copy_(torch.zeros(size))
+    return layer
+INITIALIZATION_OPTIONS = {
+    "identity": initialize_identity_linear,
+    "default": initialize_linear,
+    "kaiming_uniform": initialize_kaiming_uniform_linear,
+    "zeros": initialize_zeros_linear,
+}
+def _get_clones(module: nn.Module, n: int) -> nn.ModuleList:
+    """
+    Return a list of ``n`` identical layers.
+    Args:
+        module (nn.Module): module to be cloned
+        n (int): number of clones
+    Returns:
+        nn.ModuleList: list of ``n`` identical layers
+    """
+    # FIXME: copy.deepcopy() is not defined on nn.module
+    return nn.ModuleList([copy.deepcopy(module) for i in range(n)])
+class ColoringTransformerDecoder(nn.Module):
+    """
+    See torchtune.models.llama2.TransformerDecoder for the original implementation.
+    """
+    def __init__(
+        self,
+        tok_embeddings: nn.Embedding,
+        embedding_transform: nn.Module,
+        layer: TransformerDecoderLayer,
+        num_layers: int,
+        norm: nn.Module,
+        output: nn.Linear,
+        embedding_norm: nn.Module = None
+    ) -> None:
+        super().__init__()
+        self.tok_embeddings = tok_embeddings
+        self.embedding_transform = embedding_transform
+        self.embedding_norm = embedding_norm
+        self.layers = _get_clones(layer, num_layers)
+        self.norm = norm
+        self.output = output
+    def forward(
+        self,
+        tokens: Tensor,
+        mask: Optional[Tensor] = None,
+        colors: Optional[Tensor] = None,
+        curr_pos: int = 0
+    ) -> Tensor:
+        """
+        Args:
+            tokens (Tensor): input tensor with shape [b x s]
+            mask (Optional[Tensor]): attention mask tensor, defaults to None.
+            curr_pos (int): current position in the seq, defaults to 0.
+                Only relevant when incrementally decoding.
+        Returns:
+            Tensor: output tensor with shape [b x s x v]
+        Notation used for tensor shapes:
+            - b: batch size
+            - s: sequence length
+            - v: vocab size
+            - d: embed dim
+        """
+        # input tensor of shape [b, s]
+        bsz, seq_len = tokens.shape
+        # shape: [b, s, d]
+        h = self.tok_embeddings(tokens)
+        # Apply normalization before embedding transform to improve
+        # training stability.
+        ch = h
+        if self.embedding_norm is not None:
+            # TODO: norm does an in-place operation, so we need to clone the input
+            ch = self.embedding_norm(h.clone())
+        # Apply the embedding transform (e.g. color layer)
+        ch = self.embedding_transform(ch, colors)
+        # Add the output of the color transform to the embeddings
+        h = h + ch
+        # TODO: Fix the masking logic to not rely on checking kv_cache
+        if seq_len > 1 and self.layers[0].attn.kv_cache is not None:
+            mask = torch.full(
+                (1, 1, seq_len, seq_len), float("-inf"), device=tokens.device
+            )
+            mask = torch.triu(mask, diagonal=curr_pos + 1)
+        for layer in self.layers:
+            # shape: [b, s, d]
+            h = layer(h, mask, curr_pos)
+        # shape: [b, s, d]
+        h = self.norm(h)
+        # shape: [b, s, v]
+        output = self.output(h).float()
+        return output
+def coloring_llama2_7b(color_layer_initialization: str, norm_before_color_layer: bool = False, max_batch_size: Optional[int] = None) -> ColoringTransformerDecoder:
+    """Builder for creating a Llama2 model initialized w/ the default 7b parameter values.
+    From https://arxiv.org/abs/2307.09288, these default values are:
+    - vocab_size: 32,000
+    - embed_dim: 4,096
+    - num_layers: 32
+    - num_heads: 32
+    - num_kv_heads: 32
+    - max_seq_len: 4,096
+    - norm_eps: 1e-5
+    Args:
+        max_batch_size (Optional[int]): Maximum batch size to be passed to KVCache.
+    Returns:
+        A ``TransformerDecoder`` instance of the Llama2 model.
+    """
+    return coloring_llama2(
+        color_layer_initialization=color_layer_initialization,
+        vocab_size=32_000,
+        num_layers=32,
+        num_heads=32,
+        num_kv_heads=32,
+        embed_dim=4096,
+        max_seq_len=4096,
+        num_colors=4, # color for default, instruction, input, response
+        max_batch_size=max_batch_size,
+        attn_dropout=0.0,
+        norm_eps=1e-5,
+        norm_before_color_layer=norm_before_color_layer
+    )
+def _scale_hidden_dim_for_mlp(dim: int, multiple_of: int = 256) -> int:
+    """Scale hidden dimension for MLP to keep number of parameters and computation constant.
+    Args:
+        dim (int): Input dimension.
+        multiple_of (int): Round scaled dimension to nearest multiple of `multiple_of` for clean computation.
+    Returns:
+        Scaled hidden dimension.
+    """
+    # Scale hidden dimension by (2/3)4d for SwiGLU to keep number of
+    # parameters and computation constant
+    hidden_dim = 4 * int(2 * dim / 3)
+    # Round hidden dimension to nearest multiple of `multiple_of`
+    hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)
+    return hidden_dim
+def coloring_llama2(
+    color_layer_initialization: str,
+    vocab_size: int,
+    num_layers: int,
+    num_heads: int,
+    num_kv_heads: int,
+    embed_dim: int,
+    max_seq_len: int,
+    num_colors: int,
+    norm_before_color_layer: bool = False,
+    attn_dropout: float = 0.0,
+    max_batch_size: Optional[int] = None,
+    norm_eps: float = 1e-5,
+):
+    if color_layer_initialization not in INITIALIZATION_OPTIONS:
+        raise ValueError(f"Invalid color_layer_initialization: {color_layer_initialization}. Expected one of {list(INITIALIZATION_OPTIONS.keys())}.")
+    color_layer_initializer = INITIALIZATION_OPTIONS[color_layer_initialization]
+    head_dim = embed_dim // num_heads
+    num_kv_heads = num_kv_heads if num_kv_heads else num_heads
+    kv_cache = (
+        KVCache(
+            max_batch_size=max_batch_size,
+            max_seq_len=max_seq_len,
+            n_kv_heads=num_heads,
+            head_dim=head_dim,
+        )
+        if max_batch_size is not None
+        else None
+    )
+    rope = RotaryPositionalEmbeddings(dim=head_dim, max_seq_len=max_seq_len)
+    self_attn = CausalSelfAttention(
+        embed_dim=embed_dim,
+        num_heads=num_heads,
+        num_kv_heads=num_kv_heads,
+        head_dim=head_dim,
+        q_proj=nn.Linear(embed_dim, num_heads * head_dim, bias=False),
+        k_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
+        v_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
+        output_proj=nn.Linear(embed_dim, embed_dim, bias=False),
+        pos_embeddings=rope,
+        kv_cache=kv_cache,
+        max_seq_len=max_seq_len,
+        attn_dropout=attn_dropout,
+    )
+    hidden_dim = _scale_hidden_dim_for_mlp(embed_dim)
+    mlp = FeedForward(dim=embed_dim, hidden_dim=hidden_dim, linear_class=nn.Linear)
+    layer = TransformerDecoderLayer(
+        attn=self_attn,
+        mlp=mlp,
+        sa_norm=RMSNorm(dim=embed_dim, eps=norm_eps),
+        mlp_norm=RMSNorm(dim=embed_dim, eps=norm_eps),
+    )
+    tok_embeddings = nn.Embedding(vocab_size, embed_dim)
+    output_proj = nn.Linear(embed_dim, vocab_size, bias=False)
+    embedding_transform = MaskedApply(
+        [color_layer_initializer(embed_dim) for _ in range(num_colors)],
+        strict=True
+    )
+    embedding_norm = RMSNorm(embed_dim, eps=norm_eps) if norm_before_color_layer else None
+    return ColoringTransformerDecoder(
+        tok_embeddings=tok_embeddings,
+        embedding_transform=embedding_transform,
+        embedding_norm=embedding_norm,
+        layer=layer,
+        num_layers=num_layers,
+        norm=RMSNorm(embed_dim, eps=norm_eps),
+        output=output_proj,
+    )

colorful/custom_params.py ADDED Viewed

	@@ -0,0 +1,110 @@

+from dataclasses import dataclass, field, fields
+from typing import List, Optional
+from torchtune.datasets import ALL_DATASETS
+from torchtune.models import ALL_MODELS, ALL_TOKENIZERS
+from torchtune.utils.metric_logging import ALL_METRIC_LOGGERS
+from torchtune.utils.precision import PRECISION_STR_TO_DTYPE
+@dataclass
+class ColoringFinetuneParams:
+    """Arguments for the finetune_llm recipe.
+    Args:
+        device (str): Device to use for training. Options are "cpu" and "cuda"
+        dtype (str): Data type to use for training.
+        seed (int): Random seed to use for training.
+        model (str): String specifying model architecture to fine-tune. See ``torchtune.models.get_model`` for options.
+        model_checkpoint (str): Local path to load model checkpoint from.
+        tokenizer (str): String specifying tokenizer to use. See ``torchtune.models.get_tokenizer`` for options.
+        tokenizer_checkpoint (str): Local path to load tokenizer checkpoint from.
+        dataset (str): String specifying dataset to use. See ``torchtune.datasets.get_dataset`` for options.
+            Currently, only predefined datasets in library are supported.
+        shuffle (bool): Whether to shuffle dataset.
+        batch_size (int): Batch size to use for training.
+        epochs (int): Number of epochs to train for.
+        optimizer (str): String specifying optimizer to use. See ``torchtune.optim.get_optimizer`` for options.
+        loss (str): String specifying loss function to use. See ``torchtune.losses.get_loss`` for options.
+        lr (float): Learning rate to use for optimizer.
+        activation_checkpointing (bool): Whether to use activation checkpointing.
+        output_dir (str): Local path to save checkpoints and logs to.
+        run_generation (int): Run eval on a prompt every ``run_generation`` steps. Set to 0 to disable.
+        max_steps_per_epoch (int): Maximum number of steps to take per epoch.
+        metric_logger_type (str): String specifying metric logger to use. See ``torchtune.utils.get_metric_logger``
+            for options.
+        project (str): Project name to use for logging. Used by ``WandBLogger``.
+        resume_from_previous_checkpoint (bool): Whether to resume fine-tuning from a previous checkpoint.
+        cpu_offload (bool): Whether to offload model to CPU.
+    Raises:
+        ValueError: If ``cpu_offload`` is ``True`` but ``device`` is not ``cuda`` and <= 1 GPUs.
+    """
+    # Model
+    model_checkpoint: str = ""
+    color_layer_initialization: str = "default"
+    norm_before_color_layer: bool = False
+    # Tokenizer
+    tokenizer_checkpoint: str = ""
+    hf_repo_id: Optional[str] = None
+    checkpoint_every_n_steps: Optional[int] = None
+    # Dataset and Sampler
+    dataset: str = ""
+    train_on_input: bool = True
+    shuffle: bool = True
+    batch_size: int = 2
+    # Optimizer and Scheduler
+    optimizer: str = "SGD"
+    lr: float = 2e-5
+    loss: str = "CrossEntropyLoss"
+    gradient_accumulation_steps: int = 1
+    # Training
+    compile: bool = False
+    epochs: int = 3
+    max_steps_per_epoch: Optional[int] = None
+    resume_from_checkpoint: bool = False
+    run_generation: Optional[int] = None
+    # Distributed
+    cpu_offload: bool = False
+    enable_fsdp: bool = True
+    enable_activation_checkpointing: bool = True
+    # Environment
+    device: str = "cuda"
+    dtype: str = "fp16"
+    seed: Optional[int] = None
+    # Logging
+    output_dir: str = "/tmp/full_finetune_output"
+    metric_logger_type: str = "disk"
+    project: Optional[str] = None
+    log_every_n_steps: Optional[int] = None
+    def __post_init__(self):
+        for param in fields(self):
+            if getattr(self, param.name) == "":
+                raise TypeError(f"{param.name} needs to be specified")
+        if self.cpu_offload and self.device != "cuda":
+            raise ValueError(
+                "Cannot offload model to CPU if device is not cuda or <= 1 GPUs."
+            )
+        if self.enable_fsdp and self.device == "cpu":
+            raise ValueError("FSDP is not supported on CPU.")
+        if self.metric_logger_type not in ALL_METRIC_LOGGERS:
+            raise ValueError(
+                f"Metric logger not recognized. Expected one of {ALL_METRIC_LOGGERS}, received {self.metric_logger_type}."
+            )
+        if self.dtype not in PRECISION_STR_TO_DTYPE:
+            raise ValueError(
+                f"Dtype {self.dtype} must be one of {', '.join(PRECISION_STR_TO_DTYPE.keys())} for finetuning."
+            )

colorful/full_finetune.py ADDED Viewed

	@@ -0,0 +1,511 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+import argparse
+import os
+import sys
+from functools import partial
+from typing import Any, Dict, Optional, Tuple
+from warnings import warn
+import torch
+from torch import nn
+from torch.cuda.amp import GradScaler
+from torch.distributed import init_process_group
+from torch.optim import Optimizer
+from torch.utils.data import DataLoader, DistributedSampler
+from torchtune.utils import get_device
+from torchtune import models, modules, utils
+from torchtune.utils.constants import (
+    EPOCHS_KEY,
+    MAX_STEPS_KEY,
+    MODEL_KEY,
+    OPT_KEY,
+    SEED_KEY,
+    TOTAL_EPOCHS_KEY,
+)
+from tqdm import tqdm
+from recipes.interfaces import FTRecipeInterface
+from recipes.params import FullFinetuneParams
+from torchtune.models.llama2 import llama2_tokenizer
+from huggingface_hub import HfApi
+from custom_params import ColoringFinetuneParams
+from custom_model import ColoringTransformerDecoder, coloring_llama2_7b
+from custom_dataset import ColoringAlpacaDataset, padded_collate
+log = utils.get_logger("DEBUG")
+class ColoringFinetuneRecipe(FTRecipeInterface):
+    """
+    Full finetuning recipe for dense transformer-based LLMs such as Llama2.
+    This recipe supports:
+        - FSDP and activation checkpointing. This is enabled by default but can be
+            configured using the ``enable_fsdp`` and ``enable_activation_checkpointing`` flags.
+        - Mixed precision training - fp32, fp16 and bf16 are supported.
+        - Checkpointing of model weights, optimizer state and the recipe state (epoch and seed).
+        - Resuming from checkpoints saved using the ``save_checkpoint`` functionality.
+        - Logging to terminal. WandB and TensorBoard are currently not supported.
+    Assumptions:
+        - Training is launched with the Tune CLI (recommended) which uses TorchRun under the
+            hood. Setting up the env variables is handled by TorchRun.
+        - Training happens on CUDA (CPU training is not supported)
+        - Checkpoints are ONLY saved at epoch boundaries. Mid-epoch checkpointing is NOT supported.
+        - Datasets are Map-style and data fits in memory (not streamed).
+    """
+    _model: ColoringTransformerDecoder
+    def __init__(self, params: ColoringFinetuneParams) -> None:
+        self._params = params
+        self._device = utils.get_device(device=params.device)
+        self._dtype = utils.get_dtype(dtype=params.dtype)
+        self._hf_hub = HfApi()
+        self._hf_repo_id = params.hf_repo_id
+        if self._hf_repo_id is not None:
+            self._hf_hub.create_repo(
+                repo_id=self._hf_repo_id,
+                repo_type="model",
+                private=True,
+                exist_ok=True
+            )
+        # logging attributes
+        self._output_dir = params.output_dir
+        self._metric_logger = utils.get_metric_logger(
+            metric_logger_type=params.metric_logger_type,
+            project=params.project,
+            log_dir=params.output_dir,
+        )
+        self._log_every_n_steps = (
+            params.log_every_n_steps if params.log_every_n_steps else 1
+        )
+        self._checkpoint_every_n_steps = params.checkpoint_every_n_steps
+        # _is_rank_zero is used primarily for logging. In the future, the logger
+        # should directly take care of this
+        _, rank = utils.get_world_size_and_rank()
+        self._is_rank_zero = rank == 0
+        # Training params
+        self._compile = params.compile
+        self._resume_from_checkpoint = params.resume_from_checkpoint
+        self._enable_fsdp = params.enable_fsdp
+        self._gradient_accumulation_steps = params.gradient_accumulation_steps
+        # These are public properties which are updated by the checkpoint loader
+        # when ``resume_from_checkpoint`` is `True` or validated in tests
+        self.seed = utils.set_seed(seed=params.seed)
+        self.epochs_run = 0
+        self.total_epochs = params.epochs
+        self.max_steps_per_epoch = params.max_steps_per_epoch
+        self.total_training_steps = 0
+    def load_checkpoint(self, ckpt_path: str):
+        """
+        Extract the checkpoint state from file and validate.
+        """
+        ckpt_dict = torch.load(ckpt_path, map_location="cpu", weights_only=True)
+        utils.validate_checkpoint(ckpt_dict, self._resume_from_checkpoint)
+        return ckpt_dict
+    def setup(self, params: FullFinetuneParams) -> None:
+        """
+        Sets up the recipe state correctly. This includes setting recipe attributes based
+        on the ``resume_from_checkpoint`` flag.
+        """
+        ckpt_dict = self.load_checkpoint(ckpt_path=params.model_checkpoint)
+        # If we're resuming from checkpoint, the recipe's state should be updated before
+        # initializing the training components. This ensures that the seed is correctly
+        # propagated to the relevant components
+        if self._resume_from_checkpoint:
+            self._update_recipe_state(ckpt_dict)
+        # ``_setup_model`` handles initialization and loading the state dict. This method
+        # should be called before ``_setup_optimizer`` since transforming the optimizer
+        # state dict requires the model
+        self._model = self._setup_model(
+            enable_fsdp=params.enable_fsdp,
+            enable_activation_checkpointing=params.enable_activation_checkpointing,
+            model_state_dict=ckpt_dict[MODEL_KEY],
+        )
+        self._tokenizer = self._setup_tokenizer(
+            tokenizer_checkpoint=params.tokenizer_checkpoint
+        )
+        # _setup_optimizer should take in ckpt_dict only if training is resumed from
+        # checkpoint. Transforming the opt state dict is handled by this method
+        self._optimizer = self._setup_optimizer(
+            optimizer=params.optimizer,
+            lr=params.lr,
+            opt_state_dict=ckpt_dict[OPT_KEY] if self._resume_from_checkpoint else None,
+        )
+        self._loss_fn = self._setup_loss(loss=params.loss)
+        # sampler and dataloader depend on the tokenizer and loss_fn and should be
+        # setup after both of these are initialized
+        self._sampler, self._dataloader = self._setup_data(
+            dataset=params.dataset,
+            train_on_input=params.train_on_input,
+            shuffle=params.shuffle,
+            batch_size=params.batch_size,
+        )
+        # training setup
+        self._autocast = utils.get_autocast(self._dtype, self._device)
+        self._grad_scaler = None
+        if self._dtype == torch.float16:
+            self._grad_scaler = utils.get_gradient_scaler(fsdp=params.enable_fsdp)
+        else:
+            self._grad_scaler = GradScaler(enabled=False)
+        # Finally update the recipe state which can only be correctly set after all of the
+        # other components have been initialized and updated.
+        #
+        # Number of training steps in each epoch depends on the number of batches produced
+        # by the dataloader, the max_steps_per_epoch param set by the user and the
+        # gradient_accumulation_steps param. This value is used for logging and tracking
+        # training state. The computation should happen after the dataloader has been setup
+        self._steps_per_epoch = (
+            len(self._dataloader) // self._gradient_accumulation_steps
+        )
+        if (
+            self.max_steps_per_epoch is not None
+            and self.max_steps_per_epoch < self._steps_per_epoch
+        ):
+            self._steps_per_epoch = self.max_steps_per_epoch
+        self.total_training_steps = self.epochs_run * self._steps_per_epoch
+    def _update_recipe_state(self, ckpt_dict: Dict[str, Any]) -> None:
+        """
+        Updates the recipe state from checkpoint.
+        """
+        # If seed, total_epoch or max_steps_per_epoch don't match,
+        # warn the user and overwrite
+        if (
+            self.seed != ckpt_dict[SEED_KEY]
+            or self.total_epochs != ckpt_dict[TOTAL_EPOCHS_KEY]
+            or self.max_steps_per_epoch != ckpt_dict[MAX_STEPS_KEY]
+        ):
+            warn(
+                message="""Configured value for seed, epochs or max_steps_per_epoch
+                does not match the value stored in checkpoint."""
+            )
+        self.seed = utils.set_seed(seed=ckpt_dict[SEED_KEY])
+        self.epochs_run = ckpt_dict[EPOCHS_KEY]
+        self.total_epochs = ckpt_dict[TOTAL_EPOCHS_KEY]
+        self.max_steps_per_epoch = ckpt_dict[MAX_STEPS_KEY]
+    def _setup_model(
+        self,
+        enable_fsdp: bool,
+        enable_activation_checkpointing: bool,
+        model_state_dict: Dict[str, Any],
+    ) -> nn.Module:
+        """
+        Set up the model including enabling FSDP and activation checkpointing. For this recipe,
+        ``enable_fsdp`` should always be ``True``. This is currently a configurable flag for
+        running tests on CPUs.
+        """
+        with get_device(self._device):
+            model = coloring_llama2_7b(
+                self._params.color_layer_initialization,
+                norm_before_color_layer=self._params.norm_before_color_layer
+            )
+        model = (
+            utils.wrap_fsdp(
+                model=model,
+                device=self._device,
+                dtype=self._dtype,
+                strategy="FULL_SHARD",
+                auto_wrap_policy={modules.TransformerDecoderLayer},
+            )
+            if enable_fsdp
+            else model
+        )
+        if enable_activation_checkpointing:
+            utils.set_activation_checkpointing(
+                model, auto_wrap_policy={modules.TransformerDecoderLayer}
+            )
+        model.load_state_dict(model_state_dict, strict=False)
+        if self._is_rank_zero:
+            log.info(
+                "Model is initialized. FSDP and Activation Checkpointing are enabled."
+            )
+        if self._compile:
+            log.info("Compiling model using torch.compile. The first batch may take a few minutes while compilation occurs.")
+            model = torch.compile(model)
+        else:
+            log.info("Skipping model compilation")
+        return model
+    def _setup_tokenizer(
+        self, tokenizer_checkpoint: str
+    ) -> modules.Tokenizer:
+        """
+        Unlike ```setup_model```, this takes in the checkpoint and loads the sentencepiece
+        tokenizer model. This is related to how the tokenizer is implemented and should
+        change in a future iteration.
+        """
+        tokenizer = llama2_tokenizer(tokenizer_checkpoint)
+        if self._is_rank_zero:
+            log.info("Tokenizer is initialized from file.")
+        return tokenizer
+    def _setup_optimizer(
+        self, optimizer: str, lr: float, opt_state_dict: Optional[Dict[str, Any]] = None
+    ) -> Optimizer:
+        """
+        Set up the optimizer. This method also handles transforing the state dict
+        for FSDP.
+        """
+        optimizer = modules.get_optimizer(optimizer, self._model, lr)
+        if opt_state_dict:
+            opt_state_dict = utils.transform_opt_state_dict(
+                opt_state_dict, self._model, optimizer
+            )
+            optimizer.load_state_dict(opt_state_dict)
+        if self._is_rank_zero:
+            log.info("Optimizer is initialized.")
+        return optimizer
+    def _setup_loss(self, loss: str) -> nn.Module:
+        loss_fn = modules.get_loss(loss)
+        if self._is_rank_zero:
+            log.info("Loss is initialized.")
+        return loss_fn
+    def _setup_data(
+        self, dataset: str, shuffle: bool, batch_size: int, train_on_input: bool
+    ) -> Tuple[DistributedSampler, DataLoader]:
+        """
+        All data related setup happens here. Currently this recipe only supports the
+        DistributedSamplers with Map-style Datasets which fit into memory. Other samplers,
+        iterable datasets and streaming datasets are not supported.
+        """
+        world_size, rank = utils.get_world_size_and_rank()
+        ds = ColoringAlpacaDataset(tokenizer=self._tokenizer, dataset=dataset, train_on_input=train_on_input)
+        sampler = DistributedSampler(
+            ds,
+            num_replicas=world_size,
+            rank=rank,
+            shuffle=shuffle,
+            seed=0,
+        )
+        dataloader = DataLoader(
+            dataset=ds,
+            batch_size=batch_size,
+            sampler=sampler,
+            collate_fn=partial(
+                padded_collate,
+                padding_idx=self._tokenizer.pad_id,
+                ignore_idx=self._loss_fn.ignore_index,  # TODO support loss without ignore_index
+            ),
+        )
+        if self._is_rank_zero:
+            log.info("Dataset and Sampler are initialized.")
+        return sampler, dataloader
+    def save_checkpoint(self, epoch: int) -> None:
+        """
+        Checkpoint the relevant state of a recipe.
+        This makes use of the `save_checkpoint` utility which is responsible for
+        writing the checkpoint dictionary to file. The contents of the dict are dictated
+        by whether training is complete or not.
+        If training is ongoing, optimizer state, seed and epochs_run are saved along with the
+        model weights.
+        """
+        os.makedirs(self._output_dir, exist_ok=True)
+        output_loc = f"{self._output_dir}/model_{epoch}.ckpt"
+        ckpt_dict = {MODEL_KEY: self._model}
+        # if training is in-progress, checkpoint the optimizer state as well
+        if epoch + 1 < self.total_epochs:
+            ckpt_dict.update(
+                {
+                    OPT_KEY: self._optimizer,
+                    SEED_KEY: self.seed,
+                    EPOCHS_KEY: self.epochs_run,
+                    TOTAL_EPOCHS_KEY: self.total_epochs,
+                    MAX_STEPS_KEY: self.max_steps_per_epoch,
+                }
+            )
+        utils.save_checkpoint(ckpt_dict, output_loc)
+        if self._is_rank_zero:
+            log.info(
+                f"Model checkpoint of size {os.path.getsize(output_loc) >> 20} MB saved to {output_loc}"
+            )
+            if self._hf_repo_id is not None:
+                log.info(f"Uploading checkpoint to HuggingFace Hub: {self._hf_repo_id}")
+                self._hf_hub.upload_folder(
+                    folder_path=self._output_dir,
+                    repo_id=self._hf_repo_id,
+                    repo_type="model",
+                    run_as_future=True,
+                    commit_message=f"Checkpoint for epoch {epoch} (step {self.total_training_steps})"
+                )
+            else:
+                log.info("Skipping uploading to HuggingFace Hub (no repo id specified)")
+    def _should_update_weights(self, curr_step: int) -> bool:
+        """
+        Determines whether the weights should be updated on the current step or not.
+        True is returned either if we've accumulated gradients for enough steps or if this
+        is the last step in the epoch.
+        """
+        should_update_weights = (
+            curr_step + 1
+        ) % self._gradient_accumulation_steps == 0 or (
+            curr_step + 1
+        ) == self._steps_per_epoch
+        return should_update_weights
+    def train(self) -> None:
+        """
+        The core training loop. Supports training on subsets of the dataset using the
+        ``max_steps_per_epoch``.
+        """
+        _, rank = utils.get_world_size_and_rank()
+        # zero out the gradients before starting training
+        self._optimizer.zero_grad()
+        # self.epochs_run should be non-zero when we're resuming from a checkpoint
+        for curr_epoch in range(self.epochs_run, self.total_epochs):
+            # Update the sampler to ensure data is correctly shuffled across epochs
+            # in case shuffle is True
+            self._sampler.set_epoch(curr_epoch)
+            for idx, batch in enumerate(
+                pbar := tqdm(self._dataloader, disable=not (rank == 0))
+            ):
+                if (
+                    self.max_steps_per_epoch is not None
+                    and (idx // self._gradient_accumulation_steps)
+                    == self.max_steps_per_epoch
+                ):
+                    break
+                input_ids, labels, colors = batch
+                input_ids = input_ids.to(self._device)
+                labels = labels.to(self._device)
+                colors = colors.to(self._device)
+                with self._autocast:
+                    logits = self._model(input_ids, colors=colors)
+                    # Shift so that tokens < n predict n
+                    logits = logits[..., :-1, :].contiguous()
+                    labels = labels[..., 1:].contiguous()
+                    logits = logits.transpose(1, 2)
+                    # Compute loss
+                    loss = self._loss_fn(logits, labels)
+                # Note: We're always logging the loss before normalizing it
+                # Check if this is the norm or not
+                pbar.set_description(f"{curr_epoch+1}|{idx+1}|Loss: {loss.item()}")
+                if self.total_training_steps % self._log_every_n_steps == 0:
+                    self._metric_logger.log_dict(
+                        {
+                            "loss": loss.item(),
+                            "lr": self._optimizer.param_groups[0]["lr"],
+                            "gpu_resources": torch.cuda.memory_allocated(),
+                        },
+                        step=self.total_training_steps,
+                    )
+                if self._checkpoint_every_n_steps is not None:
+                    if self.total_training_steps > 0 and self.total_training_steps % self._checkpoint_every_n_steps == 0:
+                        self.save_checkpoint(epoch=curr_epoch)
+                # Does loss normalization need to happen within autocast context?
+                loss = loss / self._gradient_accumulation_steps
+                self._grad_scaler.scale(loss).backward()
+                if self._should_update_weights(idx):
+                    self._grad_scaler.step(self._optimizer)
+                    self._grad_scaler.update()
+                    self._optimizer.zero_grad(set_to_none=True)
+                    # Update the number of steps when the weights are updated
+                    self.total_training_steps += 1
+            self.epochs_run += 1
+            self.save_checkpoint(epoch=curr_epoch)
+    def cleanup(self) -> None:
+        self._metric_logger.close()
+def recipe_main() -> None:
+    """
+    Entry point for the recipe.
+    Configurable parameters are read in the following order:
+        - Parameters specified in ``ColoringFinetuneParams``
+        - Overwritten by Parameters specified in ``alpaca_llama2_full_finetune.yaml``
+        - Overwritten by arguments from the command-line using ``TuneArgumentParser``
+    """
+    parser = utils.TuneArgumentParser(
+        description=ColoringFinetuneParams.__doc__,
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+    args, _ = parser.parse_known_args()
+    args = vars(args)
+    recipe_params = ColoringFinetuneParams(**args)
+    # Env variables set by torch run; only need to initialize process group
+    # Disabled since this breaks for now on RunPod.
+    # init_process_group(backend="nccl")
+    recipe = ColoringFinetuneRecipe(params=recipe_params)
+    recipe.setup(params=recipe_params)
+    recipe.train()
+    recipe.cleanup()
+if __name__ == "__main__":
+    sys.exit(recipe_main())

colorful/masked_apply.py ADDED Viewed

	@@ -0,0 +1,73 @@

+import torch
+import torch.nn as nn
+class MaskedApply(nn.Module):
+    """
+    Uses an index mask to select a sbuset of the input and apply a layer to it.
+    E.g. if mask is [[0, 1, 0]] layers[0] will be applied to the first and third element
+    and layers[1] will be applied to the second element.
+    """
+    def __init__(self, layers, strict=False):
+        super(MaskedApply, self).__init__()
+        self.num_layers = len(layers)
+        self.layers = nn.ModuleList(layers)
+        self.strict = strict
+        # Create a CPU tensor to store the maximum value found.
+        # This will prevent the GPU being blocked while we check
+        # whether an index is > num_layers in strict mode.
+        self._maximum_found_cpu = torch.tensor([-1], device='cpu')
+        self._maximum_found = torch.tensor([-1])
+        if torch.cuda.is_available():
+            self._maximum_found_cpu = self._maximum_found_cpu.pin_memory()
+    def forward(self, x, mask):
+        # If in strict mode, check if we previously violated the maximum found.
+        if self._maximum_found_cpu >= self.num_layers:
+            raise ValueError(f'Unexpected index value found {self._maximum_found_cpu}. Should be less than {self.num_layers}')
+        # Ensure mask is a long tensor
+        mask = mask.long()
+        # Flatten x and mask for easier processing
+        batch_size, seq_length, embedding_size = x.shape
+        x_flat = x.view(-1, embedding_size)
+        mask_flat = mask.view(-1)
+        # Output placeholder
+        output_flat = torch.zeros_like(x_flat)
+        # Process each mask value
+        for i in range(self.num_layers):
+            # Find indices for current mask value
+            indices = torch.where(mask_flat == i)[0]
+            # Select relevant inputs for the current linear layer
+            selected_inputs = torch.index_select(x_flat, 0, indices)
+            # Apply linear layer
+            transformed = self.layers[i](selected_inputs)
+            # TODO: figure out why this is necessary.
+            transformed = transformed.to(x_flat.dtype)
+            # Place results back in the output tensor
+            output_flat.index_copy_(0, indices, transformed)
+        # Copy any out of range indices
+        if self.strict:
+            # This check is done asynchronously.
+            self._maximum_found = max(max(mask_flat), self._maximum_found)
+            self._maximum_found_cpu.copy_(self._maximum_found, non_blocking=True)
+        else:
+            indices = torch.where(mask_flat >= self.num_layers)[0]
+            selected_inputs = torch.index_select(x_flat, 0, indices)
+            output_flat.index_copy_(0, indices, selected_inputs)
+        # Reshape output to original dimensions
+        output = output_flat.view(batch_size, seq_length, embedding_size)
+        return output