Discussion: naming pattern to converge on to better identify fine-tunes

#761
by ThiloteE - opened

I want to find finetunes of mistral-7b-v0.3, which has a new tokenizer and is said to be better with its 32k context, but the leaderboard is so full of mistral-7b-v0.1 finetunes that it is impossible to find the newer models. The issue is caused by most of the model authors not following a standardized naming scheme, which renders the searchbar in the leaderboard useless in this case. Since both models I am looking for have the same parameter count, filtering for this property doesn't work either (even if it worked flawlessly, which it doesn't!). Additionally, sometimes only the architecture is is mentioned in the models config.json file, but not the real name of the base model. There is not even a possibility to filter by "last added to leaderboard", which would be a very unsatisfactory workaround either.

I am a little at a loss. The few possibilities for improvement I can think of are:

  • standardization of model names in the model name. Example: https://github.com/ggerganov/ggml/issues/820.
    <BaseModel>-<Version>-<Model>-<Version>-<ExpertsCount>x<Parameters>-<Quantization>.gguf or <Model>-<Version>-<BaseModel>-<Version>-<ExpertsCount>x<Parameters>-<Quantization>.gguf
  • standartization of base model names in the accompanying configuration files, but allow arbitrary model name.
  • Having a leaderboard "cleanup crew" or a script that manually adds tags, labels and notes to models who's author is unresponsive and hides models with unsatisfactory model cards and model names from the default view of the leaderboard. Forcefully rename a model, in documented and exceptional circumstances.

TL;DR: There is no standardized naming scheme, the search feature is insufficient and model authors fail to provide relevant information. How to find finetunes of specific base models?

Open LLM Leaderboard org

Hi!

I agree that more standard naming conventions would be great, and I like the pattern you are suggesting in your first bullet!
At the moment, we already apply the third option, within the time we have available for this - we don't allow models with no model cards, and manually add tags to the leaderboard's view of the model depending on user reports. However, we won't manually manage the naming convention issues of all available models.

For your initial question about how to find fine-tunes of specific base models, I don't have a better solution for you right now.

I'm going to leave the discussion open to gather feedback from other users on which conventions would be interesting to follow and see what we converge on.

I am not particularly good at coding, but at the very least, I could create a regex that checks, if the model name deviates from a particular standardized naming scheme.
Instead of checking, if the name is correct, it could be checked if the name does not adhere to basic syntax.

I do feel the naming scheme would be best to follow a progression pattern that would tell the story of the model.
<BaseModel>-<Version>-<Model>-<Version>-<ExpertsCount>x<Parameters>-<Methods>-<Quantization>.gguf

Without needing to know the history of LLMs...
I would know it is based on. <BaseModel>-<Version>
I would know which variant it is. <Model>-<Version>-<ExpertsCount>x<Parameters>
I would know how it was modified. <Methods><Quantization>

clefourrier changed discussion title from How to find finetunes with specific basemodel? to Discussion: naming pattern to converge on to better identify fine-tunes

I thought some more. Quantization is nice to have, but not a strict requirement at this leaderboard. Models that are not quantized are evaluated as well.

<BaseModel>-<Version>-<Model>-<Version>-<ExpertsCount>x<Parameters>-<MethodsorVariant>

I have experimented with a regex that would detect said pattern.

If it helps, I've also did a similar experiment. Anyhow this is the current form my PR https://github.com/ggerganov/llama.cpp/pull/7499 is taking up shape (FYI, this is mostly inspired by TheBloke naming scheme e.g. https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/tree/main )

js regex pattern detection example
#!/usr/bin/env node

const ggufRegex = /^(?<model_name>[A-Za-z0-9\s-]+)-(?:(?<experts_count>\d+)x)?(?<model_weights>\d+[A-Za-z]+)(?:-(?<fine_tune>[A-Za-z0-9\s-]+))?(?:-(?<version_string>v\d+(?:\.\d+)*))?-(?<encoding_scheme>[\w_]+)(?:-(?<shard>\d{5})-of-(?<shardTotal>\d{5}))?\.gguf$/;

function parseGGUFFilename(filename) {
  const match = ggufRegex.exec(filename);
  if (!match) 
    return null;
  const {model_name, version_string = null, experts_count = null, model_weights, fine_tune=null, encoding_scheme, shard = null, shardTotal = null} = match.groups;
  return {modelName: model_name.trim().replace(/-/g, ' '), expertsCount: experts_count ? +experts_count : null, model_weights, fine_tune: fine_tune, version: version_string, encodingScheme: encoding_scheme, shard: shard ? +shard : null, shardTotal: shardTotal ? +shardTotal : null};
}

const testCases = [
  {filename: 'Llama-7B-Q4_0.gguf', expected: { modelName: 'Llama', expertsCount: null, model_weights: '7B', fine_tune: null, version: null, encodingScheme: 'Q4_0', shard: null, shardTotal: null }},
  {filename: 'Llama-7B-v1.0-Q4_0.gguf', expected: { modelName: 'Llama', expertsCount: null, model_weights: '7B', fine_tune: null, version: 'v1.0', encodingScheme: 'Q4_0', shard: null, shardTotal: null }},
  {filename: 'GPT-3-175B-v3.0.1-F16.gguf', expected: { modelName: 'GPT 3', expertsCount: null, model_weights: '175B', fine_tune: null, version: 'v3.0.1', encodingScheme: 'F16', shard: null, shardTotal: null }},
  {filename: 'GPT-NeoX-20B-v0.9-Q4_K-00001-of-00010.gguf', expected: { modelName: 'GPT NeoX', expertsCount: null, model_weights: '20B', fine_tune: null, version: 'v0.9', encodingScheme: 'Q4_K', shard: 1, shardTotal: 10 }},
  {filename: 'EleutherAI-13B-v2.1.4-IQ4_XS-00002-of-00005.gguf', expected: { modelName: 'EleutherAI', expertsCount: null, model_weights: '13B', fine_tune: null, version: 'v2.1.4', encodingScheme: 'IQ4_XS', shard: 2, shardTotal: 5 }},
  {filename: 'Llama-7B-Research-v1.0-Q4_0.gguf', expected: { modelName: 'Llama', expertsCount: null, model_weights: '7B', fine_tune: 'Research', version: 'v1.0', encodingScheme: 'Q4_0', shard: null, shardTotal: null }},
  {filename: 'GPT-3-175B-Instruct-v3.0.1-F16.gguf', expected: { modelName: 'GPT 3', expertsCount: null, model_weights: '175B', fine_tune: 'Instruct', version: 'v3.0.1', encodingScheme: 'F16', shard: null, shardTotal: null }},
  {filename: 'not-a-known-arrangement.gguf', expected: null},
];

testCases.forEach(({ filename, expected }) => {
  const result = parseGGUFFilename(filename);
  const passed = JSON.stringify(result) === JSON.stringify(expected);
  console.log(`${filename}: ${passed ? "PASS" : "FAIL"}`);
});

Regarding one question about 'differentiating' between the base and finetune model version. You potentially have two approach I could think of so far

  • Two separate version number: Mixtral-8x7B-v2.3-Instruct-v1.0-Q2_K.gguf
    • Pros: Can have semi ver approach on both base and finetune section
    • Cons: longer filename and two versions in one string so less obvious
  • One version number where v<base major>.<base minor>.<finetune>:
    • Mixtral-8x7B-Instruct-v2.3.1-Q2_K.gguf means base model version v2.3 and fine-tune edition 1
    • Pros: Only one string to visually track
    • Cons: Version string less flexible and less digits for finetune

Have a chat and see what you like and I'll give it a consideration. The second approach of just using one version string would fortunately mean I won't have to do any extra coding.


FYI I previously added the initial naming standard in https://github.com/ggerganov/ggml/blob/master/docs/gguf.md#gguf-naming-convention but was not happy with the pattern as it stands as i felt like versions won't be bunched up correctly if there is a different set of parameter mix or model name (or as I learned finetuning). So hence this new PR to experiment with a better file naming convention that I found by studying TheBloke's naming approach.

Your javascript regex is great!
I don't think users will like having two version numbers merged into one, such as v<base major>.<base minor>.<finetune>. It is confusing.
Either have two version numbers or leave out the version number of the base model completely.

Can we not store and then fetch the basemodel information from within the model file? the readme? or config.json or other configuration file? I think sometimes it is mentioned in the config.json, but not every time. Anyway, that would require improving the leaderboards search feature though. Basically feature request to add full-text search in the leaderboard.
I am not sure about old models (would require pull-request and some nagging), but at least new models could only be allowed to be upload with the new standard.

By the way, here is my python regex for <BaseModel>-<Version>-<Model>-<Version>-<ExpertsCount>x<Parameters>-<MethodsorVariant>:

  • ^(?!-|_|\d)\w+(?<!\db)-v\d+\.\d+-\w+-v\d+\.\d+(-\d+b-\w+|-\d+x\d+b-\w+-*|-\d+b-\w+-.*|-\w+-\d+b|-\w+-\w+-.*)

It is not yet perfect but it is not half bad either, as you can see in the test cases.

https://regex101.com/r/yFZD9f/6

I don't think we need to worry much about name length if it is merely due to a section for version.

I can see where several version numbers would enhance the name convention while assisting in organization. (similar to yyyy/mm/dd)
As long as the naming convention remains progressive this will only serve to enrich the situation.
Llama-ver-Hermes-ver-Instruct-DPO-ver

Llama-v1.0-Hermes-v1.0-Instruct-DPO-v0.1
Llama-v1.0-Hermes-v1.0-Instruct-DPO-v0.2
Llama-v2.0-Hermes-v1.5-Instruct-DPO-v0.1
Llama-v2.0-Hermes-v2.0-Instruct-DPO-v0.1
Llama-v2.0-Hermes-v2.0-Instruct-DPO-v0.2
Llama-v2.0-Hermes-v2.0-Instruct-SLERP-v0.1
Llama-v3.0-HermesPro-v2.5-Instruct-DPO-v0.1

Okay I've updated the PR https://github.com/ggerganov/llama.cpp/pull/7499 further to take model card as extra background metadata source (main discussion to go to https://github.com/ggerganov/ggml/issues/820).

As for the version basename I'm now convinced that we should not include it in the name. Instead I'm now focusing on putting as much information useful for leaderboard in the gguf KV store. Huggingface team previously mentioned that they can easily parse the gguf KV store so it won't be an issue.

The information that goes into the filename MUST in my opinion be all the information directly related to the finetuned model itself only.

So these are the kv that i think may be of interest in the leaderboard (some old kv name and plus some new one marked by +)

general.name
general.basename +
general.finetune +
general.author
general.version
general.base_version +
general.url
general.description
general.license
general.license.name +
general.license.link +
general.source.url
general.source.huggingface.repository
general.file_type
general.parameter_size_class +
general.tags +
general.languages +
general.datasets +

By the way. We were also wondering if it would make sense to include a model hash ID (UUID?)

Also if so, then should the model hash be dependent or independent of the quantization that was applied to it?

  1. We could either just hash the GGUF tensor payload (excluding metdata) straight up (easy to do)... but any change to quantitation will change the hash. This is good if you consider different quantization to be different models.

  2. Investigate some form of hashing that will survive quantization. This means multiple models that is from the same model but just converted will have the same hash. This is proving technically difficult to do and I'm not sure if it matter much to the community if the linage can be traced anyway via general.source in the KV store.

The benefit I can see with having some form of UUID would be in disambiguating specific models by hash in hugging-face, especially if there is multiple models sharing the same name.

Heads up that https://github.com/ggerganov/llama.cpp/pull/7499 has now been merged in and we now just need to update documentation etc...

For an example how I am using it.

Check out this metadata override file https://huggingface.co/mofosyne/TinyLLama-v0-5M-F16-llamafile/blob/main/maykeye_tinyllama-metadata.json and this bash script https://huggingface.co/mofosyne/TinyLLama-v0-5M-F16-llamafile/blob/main/llamafile-creation.sh

Open LLM Leaderboard org

For the model hash, we use the commit hash on HF as unique id for model versions.

Open LLM Leaderboard org

Thanks @mofosyne , we'll take a look! :)
@alozowski it could interest you re-normalisation of naming

Open LLM Leaderboard org

@Wauplin if enough people start adding such a metadata.json, do you think your lib could allow pulling it?

Open LLM Leaderboard org

Not sure to have the full context here (I haven't followed this thread). Could you point me to the comments describing the metadata.json. And what do you mean by "pulling it"? huggingface_hub.hf_hub_download can download any file from the Hub so I guess the question is not this :D

Open LLM Leaderboard org

Ha yep forgot about this for a minute, should be good then if people use a consistent name for this file!

TLDR is: it's super hard to identify which models come from which, so users have been thinking about the best way to provide such metadata. Should we use a specific naming system (parent-child-v.21)? A specific metadata file?
In llama.cpp, they are updating the metadata file, notably with base model keys, to make this easier to follow.

Open LLM Leaderboard org

Ok ok, I see. cc @julien-c for viz on naming convention

Well if it makes it easier to follow, we've updated the gguf documentation today for naming conventions

Regarding tracking parents we noticed some model cards like this one has base_model keys which will go under general.base_model.*. As for the source we are converting from, that's tracked by general.source.*. That part of the documentation has been updated kv keys to the gguf standard.

Regarding the format of the metadata override file... just a heads up that general.base_models doesn't follow the usual direct Key naming arrangement as the gguf kv store doesn't allow for structs in arrays.
For illustrative purpose this is how it looks like below (Also placed in this wiki page ):

Metadata Override Mockup (note: non valid json comment added)
{
    // Example Metadata Override Fields
    "general.name"           : "Example Model Six",
    "general.author"         : "John Smith",
    "general.version"        : "v1.0",
    "general.organization"   : "SparkExampleMind",
    "general.quantized_by"   : "Abbety Jenson",
    "general.description"    : "This is an example of a model",
    // Useful for cleanly regenerating default naming conventions
    "general.finetune"       : "instruct",
    "general.basename"       : "llamabase",
    "general.size_label"     : "8x2.3Q",
    // Licensing details
    "general.license"        : "apache-2.0",
    "general.license.name"   : "Apache License Version 2.0, January 2004",
    "general.license.link"   : "https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md",
    // Typically represents the converted GGUF repo (Unless native)
    "general.url"            : "https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16/blob/main/README.md",
    "general.doi"            : "doi:10.1080/02626667.2018.1560449", 
    "general.uuid"           : "f18383df-ceb9-4ef3-b929-77e4dc64787c", 
    "general.repo_url"       : "https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-F16",
    // Model Source during conversion
    "general.source.url"     : "https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor/blob/main/README.md",
    "general.source.doi"     : "doi:10.1080/02626667.2018.1560449", 
    "general.source.uuid"    : "a72998bf-3b84-4ff4-91c6-7a6b780507bc", 
    "general.source.repo_url": "https://huggingface.co/SparkExampleMind/llamabase-8x2.3Q-instruct-v1.0-safetensor",
    // Model Parents (Merges, Pre-tuning, etc...)
    "general.base_models"    : [
        {
            "name" : "base model example" ,
            "author" : "example parent" ,
            "version" : "v3.2" ,
            "organization" : "grandOldMaster" ,
            "url" : "https://huggingface.co/SparkExampleMind/parentlalama-1Q-v1.0-safetensor/blob/main/README.md",
            "doi" : "doi:10.1080/02626667.2018.1560449",
            "uuid" : "52d8c7ef-1de5-43f1-87a4-0c7c9c3d07c4" ,
            "repo_url" : "https://huggingface.co/SparkExampleMind/parentlalama-1Q-v1.0-safetensor"
        }
    ],
    // Array based metadata
    "general.tags": ["text generation", "transformer", "llama", "tiny", "tiny model"],
    "general.languages": ["en"],
    "general.datasets": ["https://huggingface.co/datasets/roneneldan/TinyStories/blob/main/TinyStoriesV2-GPT4-train.txt", "https://huggingface.co/datasets/roneneldan/TinyStories/blob/main/TinyStoriesV2-GPT4-valid.txt"]
}

If anything is annoying you in terms of ergonomics, let me know. I used direct kv key maps to make it a bit more obvious as much can I could.

Sign up or log in to comment