Edit model card

Quantizations of https://huggingface.co/01-ai/Yi-6B-200K

Experiment

Quants ending in "_X" are experimental quants. These quants are the same as normal quants, but their token embedding weights are set to Q8_0 except for Q6_K and Q8_0 which are set to F16. The change will make these experimental quants larger but in theory, should result in improved performance.

List of experimental quants:

  • Q2_K_X
  • Q4_K_M_X
  • Q5_K_M_X
  • Q6_K_X
  • Q8_0_X

From original readme

Perform inference with Yi chat model
  1. Create a file named quick_start.py and copy the following content to it.

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    model_path = '<your-model-path>'
    
    tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
    
    # Since transformers 4.35.0, the GPT-Q/AWQ model can be loaded using AutoModelForCausalLM.
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        device_map="auto",
        torch_dtype='auto'
    ).eval()
    
    # Prompt content: "hi"
    messages = [
        {"role": "user", "content": "hi"}
    ]
    
    input_ids = tokenizer.apply_chat_template(conversation=messages, tokenize=True, add_generation_prompt=True, return_tensors='pt')
    output_ids = model.generate(input_ids.to('cuda'))
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
    
    # Model response: "Hello! How can I assist you today?"
    print(response)
    
  2. Run quick_start.py.

    python quick_start.py
    

    Then you can see an output similar to the one below. 🥳

    Hello! How can I assist you today?
    
Perform inference with Yi base model
  • Yi-34B

    The steps are similar to pip - Perform inference with Yi chat model.

    You can use the existing file text_generation.py.

    python demo/text_generation.py  --model <your-model-path>
    

    Then you can see an output similar to the one below. 🥳

    Output. ⬇️

    Prompt: Let me tell you an interesting story about cat Tom and mouse Jerry,

    Generation: Let me tell you an interesting story about cat Tom and mouse Jerry, which happened in my childhood. My father had a big house with two cats living inside it to kill mice. One day when I was playing at home alone, I found one of the tomcats lying on his back near our kitchen door, looking very much like he wanted something from us but couldn’t get up because there were too many people around him! He kept trying for several minutes before finally giving up...

  • Yi-9B

    Input

    from transformers import AutoModelForCausalLM, AutoTokenizer
    
    MODEL_DIR = "01-ai/Yi-9B"
    model = AutoModelForCausalLM.from_pretrained(MODEL_DIR, torch_dtype="auto")
    tokenizer = AutoTokenizer.from_pretrained(MODEL_DIR, use_fast=False)
    
    input_text = "# write the quick sort algorithm"
    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=256)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    

    Output

    # write the quick sort algorithm
    def quick_sort(arr):
        if len(arr) <= 1:
            return arr
        pivot = arr[len(arr) // 2]
        left = [x for x in arr if x < pivot]
        middle = [x for x in arr if x == pivot]
        right = [x for x in arr if x > pivot]
        return quick_sort(left) + middle + quick_sort(right)
    
    # test the quick sort algorithm
    print(quick_sort([3, 6, 8, 10, 1, 2, 1]))
    
Downloads last month
347
GGUF
Model size
6.06B params
Architecture
llama

1-bit

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

Inference API
Inference API (serverless) has been turned off for this model.