A few questions about the model

#3
by jukofyork - opened

It says on the Github repo: "Fine-tuned from Mistral-7B and CodeLLaMA-70B", but the context length and ROPE base frequency are not the 16k/1M used by the base CodeLLaMA-70B:

 "max_position_embeddings": 4096,
 "rope_theta": 10000,

Is this a mistake or was the fine tuning performed on the instruction-tuned version of CodeLLaMA-70B which uses the 4k/10k settings?


Also what is the correct prompt template to use for this model?


I haven't fully read the paper yet, but one interesting observation: in Table 1 you note DPO is a special case of InfoNCA with(K=2, α→0), but you could also think of the (K=2, α>0) case as being the same as a weighted logistic regression model with recalibration - known as "Platt Scaling" in the Machine Learning community, but also known by different names in the Discrete Chioce community, etc.

Also what is the correct prompt template to use for this model?

"chat_template": "{% if messages[0]['role'] == 'system' %}{% set user_index = 1 %}{% else %}{% set user_index = 0 %}{% endif %}{% for message in messages %}{% if (message['role'] == 'user') != ((loop.index0 + user_index) % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if loop.index0 == 0 %}{{ '<s>' }}{% endif %}{% set content = 'Source: ' + message['role'] + '\n\n ' + message['content'].strip() %}{{ content + ' <step> ' }}{% endfor %}{{'Source: assistant\nDestination: user\n\n '}}",

This looks to be the same strange prompt as used by the instruction-tuned versions of CodeLLaMA-70B?

From the model card:

Coding

[INST] Write Python code to solve the task:
{Instruction} [/INST]

Also noticed:

"torch_dtype": "float32",

Yet the model is only ~280GB --> float16?

and the model card says:

Safetensors
Model size: 69B params
Tensor type: FP16

Has there been some mix up with the copying of the models to HF and the CodeLlama-70b-Instruct-hf files have been used instead of CodeLlama-70b-hf?

Also tried raising this on their Github (https://github.com/OpenBMB/Eurus/issues/3).

I'm almost certain now there has been a mix up as found this in the appendix:

On MMLU, EURUS outperforms baselines dedicated to coding and math, and achieves higher results than Mistral-Instruct-v0.2 and CodeLLaMA-70B-Instruct, the official aligned versions of our base model built by their authors.

I'm downloading the Safetensors data now... Going to try copying the rest of the files from the original CodeLlama-70b-hf and then edit in the suggested [INST] <prompt> [\INST] chat template to see if it works.

Very interested to see how this performs as the only other fine-tune of CodeLlama-70b is Phind-70b which is private and in their release article they praised how good the base CodeLlama-70b was (rather than the completely ruined CodeLlama-70b-Instruct we got given here).

It's possible this could be one of the best coding models currently and it would be sad if this mix up caused it to get lost and go unnoticed... :(

OpenBMB org

Thanks for your interest in Eurus!

It says on the Github repo: "Fine-tuned from Mistral-7B and CodeLLaMA-70B", but the context length and ROPE base frequency are not the 16k/1M used by the base CodeLLaMA-70B:

 "max_position_embeddings": 4096,
 "rope_theta": 10000,

Is this a mistake or was the fine tuning performed on the instruction-tuned version of CodeLLaMA-70B which uses the 4k/10k settings?

We fine-tune CodeLLaMA-70B-base with the same 4k/10k settings as CodeLLaMA-70B-Instruct.

Also what is the correct prompt template to use for this model?
"torch_dtype": "float32",

Sorry for the ambiguity. We used the prompt template described in the model card and the dtype is fp16, but we used the config file of CodeLLaMA-70B-Instruct. We will fix that soon.

OpenBMB org

Hi @jukofyork , we have fixed the config file and chat template. I think the model works fine now.

Thanks!

You probably want to delete the added_tokens.json file too:

{
  "<step>": 32015
}

and remove it from generation_config.json:

{
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": [
    2,
    32015
  ],
  "transformers_version": "4.35.0"
}

and the tokenizer_config.json file:

.
.
.

    "32015": {
      "content": "<step>",
      "lstrip": true,
      "normalized": false,
      "rstrip": true,
      "single_word": true,
      "special": false
    }
.
.
.

as that was just required for the (very) strange prompt template used by CodeLLaMA-70B-Instruct.


I can also confirm both the nca and sft models run fine with:

  "max_position_embeddings": 16384

and:

"rope_theta": 1000000

I've run perplexity tests at 4k, 16k, and even 32k and all were fine with no sign of the performance dropping off a cliff like would be expected with RoPE scaling an actual 4k/10k model.

I've also tested each model with multi-turn conversations about C++ and they have been completely coherent at 10k+ tokens (using the "rope_theta": 1000000 version).

This is a great sign as a 4k context model isn't really useful for programmers and many of the CodeLLaMA-34B fine-tunes actually destroy the long-context ability of the model during their fine-tuning to target the leaderboards, etc.

I hope more people find and try out these two models!

OpenBMB org

Thank you so much for the test!

Hi again,

I've done even more testing with this model and think it might be worth you rerunning your evaluations with the rope_theta set back to 1000000:

  • The models suffer from extreme "lazy-GPTness" at 10k and often will try to avoid writing any code by adding "// TODO" comments - probably because they are perceiving a 100:1 "time contraction" in the embeddings! This will likely have a big negative effect on the benchmark scores...
  • You might have accidentally invented a new (more efficienct) method of fine-tuning RoPE-extended long-context models on a shorter context (eg: pretrain Llama2-70b at 10k/4k, continued pretrain CodeLlama-70b at 1M/16k, fine-tune at 10k/4k, then set the final model back to 1M/16k). One of the biggest hurdles to people here is they can't afford the extra cost to use 16k+ context for their fine-tuning and/or lack the needed datasets with long-context multi-turn conversations about code.

Nearly all the coding models here on HF that have tried to fine-tune a RoPE-extended long-context model like CodeLlama-34b or Deepseek-coder-33b have massively hurt the context length ability of the models compared to the official instruction-tuned versions of CodeLlama-34b or Deepseek-coder-33b. This can easily be tested by giving them a 10-12k token "repository level" file: nearly all the fine-tuned models will either output gibberish or just the EOS token and quit.

Your models and Phind-CodeLlama-34b-v2 (which actually did continued pretraining on 500B extra tokens IIRC) are the only fine-tuned coding models that don't suffer from this (so long as you reset rope_theta back to 1M).

OpenBMB org

Quite interesting, thanks again and we will look into it.

Sign up or log in to comment