nvidia/Llama-3_1-Nemotron-51B-Instruct · Comparison to the 70B model?

How does this 51B model perform compared to the 70B model?

I'm currently running the Q8 quant from Bartowski and it performs exceptionally well on 4x3090.: https://huggingface.co/bartowski/Llama-3.1-Nemotron-70B-Instruct-HF-GGUF

However, I am looking for a Nemotron model that will run at 3-4 tokens per second on a M4 Macbook Pro 48GB or M4 Macbook Pro Max 64GB. It would be nice if it were faster, but 3-4 tokens per second is the least I can live with.

So, I am wondering about the performance of this model compared to the 70b model.

For example, the 70B model is verbose and sometimes challenges me when I give it writing instructions (I'm a copywriter). And I'm embarrassed to say that Nemotron was right when it challenged the command I gave it. The fact that this model challenged me (I consider myself to be an expert copywriter) and was right...that really made an impression on me.

Nemotron is also highly engaged in the conversation and this is something I really value.

If any of you Nvidia guys are reading this post, thank you for Nemotron and I hope you are working on a newer and better version of Nemotron. In my opinion, Nemotron is honestly better than Claude3 at writing (see the Claude3 subreddit to see a ton of disappointed users complaining about the poor performance. And ChatGPT is literal slop from the gutter when it comes to professional writing).

Thanks!