Spaces:

Naphula
/

model_tools

Running

App Files Files Community

model_tools / mergekit_low-VRAM-graph_patch.md

Naphula

Upload 2 files

94b8607 verified about 1 month ago

preview code

raw

history blame contribute delete

5.36 kB

Mergekit Low VRAM Graph Patch

Merge models in minutes instead of hours on low VRAM

This is a significant and sophisticated modification to mergekit/graph.py. It transforms the standard Executor from a "optimistic" runner (assuming tensors fit in VRAM) into a robust, adaptive execution engine designed specifically to survive low-VRAM environments.

Here is a detailed analysis of the changes and how they achieve the goal of running on RTX 3060-class hardware.

Core Strategy: "Fail Gracefully and Chunk"

The original Executor simply moved tensors to the GPU, executed, and moved them back. If VRAM ran out, the process crashed. This modified version implements a three-tier fallback strategy inside _run:

Tier 1: Standard GPU Execution. Try to run the task normally on the GPU.
Tier 2: Adaptive Chunking. If Tier 1 throws an OOM (torch.OutOfMemoryError), catch it, clear the cache, and attempt to split the operation into smaller batches (chunks).
Tier 3: CPU Fallback. If chunking fails (or isn't applicable), fall back to system RAM (CPU), which is much slower but usually has higher capacity.

Key Code Modifications

1. Windows/NVIDIA Allocator Tuning

if sys.platform == "win32":
    os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:32"

Analysis: This is a crucial addition for consumer hardware, particularly on Windows. PyTorch on Windows often suffers from memory fragmentation. Setting max_split_size_mb helps prevent the allocator from splitting blocks too aggressively, reducing "fragmentation OOMs" where free memory exists but isn't contiguous.

2. The `_execute_chunked` Method

This is a new helper method that implements the logic for breaking a large tensor operation into smaller pieces.

Logic: It identifies a reference tensor in the arguments, determines the total number of rows (dim 0), and iterates through the data in chunk_size increments.
Memory Efficiency:
- It slices inputs on the CPU.
- Moves only the current slice to the GPU.
- Executes the task.
- Moves the result immediately back to the CPU.
- Deletes the GPU tensors and clears the cache.
Result: The peak VRAM usage becomes proportional to chunk_size rather than the full model layer size.

3. The Adaptive Execution Loop (`_run`)

The _run method has been completely rewritten to handle the fallback logic.

The Heuristic Filter:

is_io_task = task_type in ["LoadTensor", "GatherTensors", "SaveTensor", ...]
want_gpu = is_gpu_execution and (task.uses_accelerator() or not is_io_task)

Analysis: The code explicitly prevents I/O tasks (loading/saving) from clogging up the GPU. PermutedEmbeddings is also excluded, which is smart because embedding tables are massive (often 250MB+) and permuting them is memory-bandwidth bound, not compute bound.

The OOM Handler:

except torch.OutOfMemoryError:
    # ... cleanup ...
    chunk_sizes = [4096, 2048, 1024, 512, 256, 128, 64]
    for chunk_size in chunk_sizes:
        try:
            res = self._execute_chunked(task, arguments, chunk_size=chunk_size)
            # ... success ...
            break

Analysis: This is the "magic" that allows 3060s to work. If a layer is too big, it tries progressively smaller chunks until it finds a size that fits in the remaining VRAM.

Aggressive Garbage Collection:

if is_gpu_execution:
    gc.collect()
    if accelerator: accelerator.empty_cache()

Analysis: This runs at the end of every task execution loop.

Pros: It ensures VRAM is absolutely as clean as possible for the next task.
Cons: cuda.empty_cache() forces a device synchronization and overhead. This will make the merge process significantly slower than a standard run, but it trades speed for the ability to run at all.

Potential Risks & Limitations

Assumption of Row-Independence: The _execute_chunked method assumes that the task.execute method operates independently on rows (dimension 0).
- Safe: Linear merges, SLERP (usually), and element-wise operations.
- Unsafe: Operations that require global statistics across the batch dimension (e.g., softmax over dim 0, though rare in weight merging) or matrix multiplications where the split dimension is the reduction dimension. However, for standard LLM weight merging (which is usually element-wise weighted averaging), this assumption holds.
Performance Overhead: The constant gc.collect() and empty_cache() calls, combined with moving data back and forth between CPU and GPU for every chunk, will result in low GPU utilization. The merge will take longer, but it will complete.

Conclusion

This is a highly effective patch for low-VRAM users. It trades execution speed for memory safety.

For a 3090/4090 user: This script might be slower than the original due to the aggressive GC.
For a 3060/3060 Ti user: This script enables functionality that is otherwise impossible (merging 70B models or large 7B merges with --cuda).

The implementation is robust because it doesn't force chunking; it only attempts it when the standard approach fails.