How to run Qwen/Qwen2.5-Omni-7B model on Mac?

#30
by CHSFM - opened

Hello everyone,

I'm trying to run the Qwen/Qwen2.5-Omni-7B model on my Mac device but I'm not sure about the best approach. Could someone please provide some guidance on:

The recommended software/tools for running this model on macOS
Any specific settings or configurations needed for optimal performance
Whether Apple Silicon (M1/M2/M3/M4) is supported and if there are any special considerations
Approximate memory requirements and performance expectations
Any help or pointers to relevant resources would be greatly appreciated. Thank you!

I ran it using cursor Ai agent to stand it up but honestly I have no idea how I did it besides natural language.That being said I wasn't able to configure the voice it only had 2 by default I have an M1 Max 64 gb of ram and only got it to respond with voice after 2500ms which is too long for my current project

This comment has been hidden (marked as Resolved)
deleted
This comment has been hidden

I can just get everything to run on an M2 Max 32GB by doing just one medium at a time and not running anything else at all. On an M4 Max 128GB it is a breeze.
You have to pip uninstall transformers as in the docs and then follow the sequence given there to install a custom version of transformers because HF don't yet have it (as of 20250404) in their library and running the pip install transformers version will throw a cannot load model Qwen2_5Omni_Model etc error.
Text conversations work well even with limited hardware.
Image description is very good.
Audio transcription is a bit 'iffy' because of memory requirements.
Video can be described but only by using very short videos and reducing the resolution/frame-number.
To use it seriously requires more than 32GB but you should be OK with 64GB.

@pudepiedj did you change any parameter, I still face out of memory with 64GB with a video which about 200 frames

Yes, I reduced the frame-count and the frame-size (to make any video run on my 32GB M2 Max). 200 frames requires enormous resources if they are, for example, 1008x560. The docs suggest you need something like 93GB for a 15s video.
Try down-sizing by sampling fewer frames and reducing the resolution.
This (below) is [more than] a bit of a hack, but it works for me when I enter 4;4;4:

if videos and len(videos) > 0:
        # Original tensor with 200 frames
        original_video = videos[0]  # Shape [200, 3, 1008, 560]
        print(f"Original video shape: {original_video.shape}")

        # Reduce sampling frequency for frames, height and width
        new_frames = input(f"Reduce frame frequency from {original_video.shape[0]} to ",) or 1
        new_height = input(f"Reduce height frequency from {original_video.shape[2]} by factor ",) or original_video.shape[2]
        new_width = input(f"Reduce width frequency from {original_video.shape[3]} by factor ",) or original_video.shape[3]
        # Option 1: Simple downsampling by skipping pixels (reducing frames and to half/quarter resolution)
        reduced_video = original_video[::int(new_frames), :, ::int(new_height), ::int(new_width)]  # Shape [50, 3, 252, 140]

        # Replace the original tensor in the list
        videos[0] = reduced_video

        print(f"Reduced video shape: {reduced_video.shape}")

Hope this helps. Basically the problem is that these multimedia models are resource-hungry, so running on the cloud is often a better option than trying to run them locally unless you want to do something specific like fine-tuning.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment