deepseek-ai/deepseek-coder-33b-base · How much memory is needed?

I had the same issue with a 24GB VRAM RTX 4090 myself when running the code from github.
It took me a few good hours of searching but eventually found out their HuggingFace app source code and now it runs smoothly on my server and results with just 7B version are impressive, both quality and performance-wise, just like on the website.

Github page should really be updated to reflect the optimised settings, like the torch_dtype=torch.bfloat16 and the ones for model.generate, like temperature, etc.
I've created an issue there: https://github.com/deepseek-ai/DeepSeek-Coder/issues/39
Also it'd be good to include in the doc info about Hardware Requirements for running the 7B and 33B models respectively.
Like showing needed hardware for running un-optimised and then showing various optimisations that can make it fit in more modest setups together with the perceived quality loss. That'd have been really nice to have when I was searching for it. I've also seen several people asking on various forums(reddit, etc) that the model is very slow.

It is in Files section, here:
https://huggingface.co/spaces/deepseek-ai/deepseek-coder-7b-instruct/tree/main
And the interesting bit is in app.py, method generate:

def generate(
    message: str,
    chat_history: list[tuple[str, str]],
    system_prompt: str,
    max_new_tokens: int = 1024,
    temperature: float = 0.6,
    top_p: float = 0.9,
    top_k: int = 50,
    repetition_penalty: float = 1,
) -> Iterator[str]:
... 
if torch.cuda.is_available():
    model_id = "deepseek-ai/deepseek-coder-6.7b-instruct"
    model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.use_default_system_prompt = False

conversation.append({"role": "user", "content": message})

    input_ids = tokenizer.apply_chat_template(conversation, return_tensors="pt")
    if input_ids.shape[1] > MAX_INPUT_TOKEN_LENGTH:
        input_ids = input_ids[:, -MAX_INPUT_TOKEN_LENGTH:]
        gr.Warning(f"Trimmed input from conversation as it was longer than {MAX_INPUT_TOKEN_LENGTH} tokens.")
    input_ids = input_ids.to(model.device)

    streamer = TextIteratorStreamer(tokenizer, timeout=10.0, skip_prompt=True, skip_special_tokens=True)
    generate_kwargs = dict(
        {"input_ids": input_ids},
        streamer=streamer,
        max_new_tokens=max_new_tokens,
        do_sample=False,
        top_p=top_p,
        top_k=top_k,
        num_beams=1,
        # temperature=temperature,
        repetition_penalty=repetition_penalty,
        eos_token_id=32021
    )
    t = Thread(target=model.generate, kwargs=generate_kwargs)
    t.start()

    outputs = []
    for text in streamer:
        outputs.append(text)
        yield "".join(outputs).replace("<|EOT|>","")