Appreciate the model drop!

by Nitral-AI - opened Sep 19, 2024

Discussion

Nitral-AI

Sep 19, 2024

But why is it only 4k? Its 2024 man, those are rookie numbers.

Lewdiculous

Sep 19, 2024

Haha.

LiyuanLucasLiu

Microsoft org Sep 19, 2024

Very good question. The model training concludes this June and we have been fighting for releasing a detailed tech report for long time---for a long time, the release has been proven to be difficulty.

Meanwhile, a different version of post-training has been conducted, with a focus on multi-lingual and long context ability. That model supports 128k and is released to https://huggingface.co/microsoft/Phi-3.5-MoE-instruct : )

YorkieOH10

Sep 19, 2024

@LiyuanLucasLiu would love to try Phi 3.5 Moe Instruct and vision locally in llama.cpp, but there has been absolutely zero movement to add support. Feature request is still open: https://github.com/ggerganov/llama.cpp/issues/9119

LiyuanLucasLiu

Microsoft org Sep 19, 2024

•

edited Sep 25, 2024

@YorkieOH10 I understand. It pains me as well... Meanwhile, you can try the demo at https://huggingface.co/spaces/GRIN-MoE-Demo/GRIN-MoE (not sure how long i can keep it alive).

dtanow

Sep 24, 2024

@LiyuanLucasLiu do you know how to run efficiently on multiple A100 GPUs, it seems that the MOE router is not using the experts efficiently on multiple GPUs with utilization less than 10%? Is there any specific setting for this in transformers?

LiyuanLucasLiu

Microsoft org Sep 24, 2024

•

edited Sep 25, 2024

@dtanow great question!

with A100-80G GPUs, you should be able to run inference on one gpu. You may need to install flash-attention-2 and add _attn_implementation = 'flash_attention_2' in the config file (together with other configs as below). This would also improve the performance of the multi-gpu setting greatly.

model = AutoModelForCausalLM.from_pretrained( 
    "microsoft/GRIN-MoE",
    device_map="sequential",  
    trust_remote_code=True,
    _attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
)

with multiple gpus, I would recommend you to convert the weight and serve the model with vllm instead. It gives you a much better throughput. We haven't had chance to merge the code back to the vllm repo, but its not complicated. The only thing you need to change is the router implementation.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment