general.architecture = 'llama' in .gguf metadata

#6
by mattjcly - opened

Hi, I have a question regarding what I'm seeing in the GGUF metadata (across various GGUF preview tools like https://github.com/ggerganov/llama.cpp/blob/4e96a812b3ce7322a29a3008db2ed73d9087b176/gguf-py/scripts/gguf-dump.py, https://netron.app/, LM Studio). It appears that the general.architecture = 'llama' and the general.name = 'LLaMA v2':

python3 gguf-dump.py Phi-3-mini-4k-instruct-q4.gguf
* Loading: Phi-3-mini-4k-instruct-q4.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 28 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 291
      3: UINT64     |        1 | GGUF.kv_count = 25
      4: STRING     |        1 | general.architecture = 'llama'
      5: STRING     |        1 | general.name = 'LLaMA v2'
      6: UINT32     |        1 | llama.vocab_size = 32064
      7: UINT32     |        1 | llama.context_length = 4096
      8: UINT32     |        1 | llama.embedding_length = 3072
      9: UINT32     |        1 | llama.block_count = 32
     10: UINT32     |        1 | llama.feed_forward_length = 8192
     11: UINT32     |        1 | llama.rope.dimension_count = 96
     12: UINT32     |        1 | llama.attention.head_count = 32
     13: UINT32     |        1 | llama.attention.head_count_kv = 32
     14: FLOAT32    |        1 | llama.attention.layer_norm_rms_epsilon = 9.999999747378752e-06
     15: FLOAT32    |        1 | llama.rope.freq_base = 10000.0
     16: UINT32     |        1 | general.file_type = 15
     17: STRING     |        1 | tokenizer.ggml.model = 'llama'
     18: [STRING]   |    32064 | tokenizer.ggml.tokens
     19: [FLOAT32]  |    32064 | tokenizer.ggml.scores
     20: [INT32]    |    32064 | tokenizer.ggml.token_type
     21: UINT32     |        1 | tokenizer.ggml.bos_token_id = 1
     22: UINT32     |        1 | tokenizer.ggml.eos_token_id = 32000
     23: UINT32     |        1 | tokenizer.ggml.unknown_token_id = 0
     24: UINT32     |        1 | tokenizer.ggml.padding_token_id = 32000
     25: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     26: BOOL       |        1 | tokenizer.ggml.add_eos_token = False
     27: STRING     |        1 | tokenizer.chat_template = "{{ bos_token }}{% for message in messages %}{{'<|' + message"
     28: UINT32     |        1 | general.quantization_version = 2

Is this intentional or a bug?
Phi-2 from https://huggingface.co/TheBloke/phi-2-GGUF shows:

      4: STRING     |        1 | general.architecture = 'phi2'
      5: STRING     |        1 | general.name = 'Phi2'
      6: UINT32     |        1 | phi2.context_length = 2048
      7: UINT32     |        1 | phi2.embedding_length = 2560
      8: UINT32     |        1 | phi2.feed_forward_length = 10240
      9: UINT32     |        1 | phi2.block_count = 32
     10: UINT32     |        1 | phi2.attention.head_count = 32
     11: UINT32     |        1 | phi2.attention.head_count_kv = 32
     12: FLOAT32    |        1 | phi2.attention.layer_norm_epsilon = 9.999999747378752e-06
     13: UINT32     |        1 | phi2.rope.dimension_count = 32
Microsoft org

That's because we are still waiting https://github.com/abetlen/llama-cpp-python to add support for Phi-3. So we ended up using the "Llama" conversion script to avoid breaking up the GGUF use cases.

gugarosa changed discussion status to closed

Sign up or log in to comment