Running in llama.cpp

#1
by yuuko-eth - opened

The official llama.cpp doesn't support this model architecture yet. Please clone a separate fork https://github.com/yuuko-eth/llama.cpp/tree/mtmd-grounders to your desired location, and build for your platform. Inspect the code for accuracy if you may. Then, for example, CUDA with cmake:

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j

You can then use the build the built binaries:

# multimodal CLI
llama-mtmd-cli \
    -m LocateAnything-3B-Q4_K_M.gguf \
    --mmproj mmproj-LocateAnything-3B-BF16.gguf \
    --image screenshot.jpg \
    -p "Locate the Apple logo." \
    -ngl 99
# -> <ref>Apple logo</ref><box><12><1><25><22></box>

# server mode
llama-server \
    -m LocateAnything-3B-Q4_K_M.gguf \
    --mmproj mmproj-LocateAnything-3B-BF16.gguf \
    -ngl 99 --special                       # <-- required for grounding tokens

Make sure you pass --special to server mode so that the coordinates tokens are emitted. Have fun testing!

Please contribute to the discussion at #24020 if you have great ideas on PBD implementation or suggestions for fixes, thank you!

Sign up or log in to comment