YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Quantization made by Richard Erkhov.

llama2-0b-unit-test - GGUF

Model creator: https://huggingface.co/MaxJeblick/
Original model: https://huggingface.co/MaxJeblick/llama2-0b-unit-test/

Name	Quant method	Size
llama2-0b-unit-test.Q2_K.gguf	Q2_K	0.0GB
llama2-0b-unit-test.IQ3_XS.gguf	IQ3_XS	0.0GB
llama2-0b-unit-test.IQ3_S.gguf	IQ3_S	0.0GB
llama2-0b-unit-test.Q3_K_S.gguf	Q3_K_S	0.0GB
llama2-0b-unit-test.IQ3_M.gguf	IQ3_M	0.0GB
llama2-0b-unit-test.Q3_K.gguf	Q3_K	0.0GB
llama2-0b-unit-test.Q3_K_M.gguf	Q3_K_M	0.0GB
llama2-0b-unit-test.Q3_K_L.gguf	Q3_K_L	0.0GB
llama2-0b-unit-test.IQ4_XS.gguf	IQ4_XS	0.0GB
llama2-0b-unit-test.Q4_0.gguf	Q4_0	0.0GB
llama2-0b-unit-test.IQ4_NL.gguf	IQ4_NL	0.0GB
llama2-0b-unit-test.Q4_K_S.gguf	Q4_K_S	0.0GB
llama2-0b-unit-test.Q4_K.gguf	Q4_K	0.0GB
llama2-0b-unit-test.Q4_K_M.gguf	Q4_K_M	0.0GB
llama2-0b-unit-test.Q4_1.gguf	Q4_1	0.0GB
llama2-0b-unit-test.Q5_0.gguf	Q5_0	0.0GB
llama2-0b-unit-test.Q5_K_S.gguf	Q5_K_S	0.0GB
llama2-0b-unit-test.Q5_K.gguf	Q5_K	0.0GB
llama2-0b-unit-test.Q5_K_M.gguf	Q5_K_M	0.0GB
llama2-0b-unit-test.Q5_1.gguf	Q5_1	0.0GB
llama2-0b-unit-test.Q6_K.gguf	Q6_K	0.0GB
llama2-0b-unit-test.Q8_0.gguf	Q8_0	0.0GB

Original model description:

{}

Small dummy LLama2-type Model useable for Unit/Integration tests. Suitable for CPU only machines, see H2O LLM Studio for an example integration test.

Model was created as follows:

from transformers import AutoConfig, AutoTokenizer, AutoModelForCausalLM

repo_name = "MaxJeblick/llama2-0b-unit-test"
model_name = "h2oai/h2ogpt-4096-llama2-7b-chat"
config = AutoConfig.from_pretrained(model_name)
config.hidden_size = 12
config.max_position_embeddings = 1024
config.intermediate_size = 24
config.num_attention_heads = 2
config.num_hidden_layers = 2
config.num_key_value_heads = 2

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_config(config)
print(model.num_parameters())  # 770_940

model.push_to_hub(repo_name, private=False)
tokenizer.push_to_hub(repo_name, private=False)
config.push_to_hub(repo_name, private=False)

Below is a small example that will run in ~ 1 second.

import torch
from transformers import AutoModelForCausalLM


def test_manual_greedy_generate():
    max_new_tokens = 10

    # note this is on CPU!
    model = AutoModelForCausalLM.from_pretrained("MaxJeblick/llama2-0b-unit-test").eval()
    input_ids = model.dummy_inputs["input_ids"]

    y = model.generate(input_ids, max_new_tokens=max_new_tokens)

    assert y.shape == (3, input_ids.shape[1] + max_new_tokens)

    for _ in range(max_new_tokens):
        with torch.no_grad():
            outputs = model(input_ids)

        next_token_logits = outputs.logits[:, -1, :]
        next_token_id = torch.argmax(next_token_logits, dim=-1).unsqueeze(-1)

        input_ids = torch.cat([input_ids, next_token_id], dim=-1)

    assert torch.allclose(y, input_ids)

Tipp:

Use fixtures with session scope to load the model only once. This will decrease test runtime further.

import pytest
from transformers import AutoModelForCausalLM
@pytest.fixture(scope="session")
def model():
    return AutoModelForCausalLM.from_pretrained("MaxJeblick/llama2-0b-unit-test").eval()

Downloads last month: -

GGUF

Hardware compatibility

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support