Discussion Thread - Let's Get To Know the (Allegedly) Leaked "mistral-medium" better!

#4
by cekal - opened

Firstly, I'd like to thank the original author of this repository for the conversion. It works great! From my quick testing, the model definitely outperforms Mixtral-8x7b and other state-of-art open-source models (not to mention I had it loaded in only 4bit during the testing!).

However, we should probably acknowledge that while it is great that we have a new state-of-art model from Mistral (maybe, unconfirmed as of Jan 30.), the leak could potentially do a lot of damage to Mistral, since mistral-medium is their main cash cow.

We'll see. Feel free to share your screenshots, prompts, and other info that could help others. I'll start with the inference code, which you can use to test the model yourself:

# install necessary packages
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q -U datasets scipy ipywidgets matplotlib

# authenticate to hf
from huggingface_hub import interpreter_login
interpreter_login()

# import packages and load the model in 4bit (you can remove bnb_config for full precision or adjust 4bit ---> 8bit, 16bit...)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TextStreamer, GenerationConfig, AutoModelForCausalLM

base_model_id = "152334H/miqu-1-70b-sf"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_id,  # Mistral, same as before
    quantization_config=bnb_config,  # Same quantization config as before
    device_map="auto",
    trust_remote_code=True,
)

# load llama tokenizer (note: requires authentication)
eval_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-70b-hf", add_bos_token=True, trust_remote_code=True, use_auth_token=True)

# make the text stream
streamer = TextStreamer(eval_tokenizer)

# prompt the model
eval_prompt = "[INST] Why is the sky blue? [/INST]"
model_input = eval_tokenizer(eval_prompt, return_tensors="pt").to("cuda")

base_model.eval()
with torch.no_grad():
    print(eval_tokenizer.decode(base_model.generate(**model_input, max_new_tokens=4096, repetition_penalty=1.1, do_sample=True, temperature=1, streamer=streamer)[0], skip_special_tokens=True))

I'll do some further fine-tuning and get back with updates. @152334H please ensure you have a backup of this repo in case it gets taken down. I've made one just in case.

cekal changed discussion title from Discussion Thread - Let's Get To Know the (Alleged) Leaked "mistral-medium" better! to Discussion Thread - Let's Get To Know the (Allegedly) Leaked "mistral-medium" better!

print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 8192, padding_idx=0)
    (layers): ModuleList(
      (0-79): 80 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=8192, out_features=8192, bias=False)
          (k_proj): Linear4bit(in_features=8192, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=8192, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=8192, out_features=8192, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=8192, out_features=28672, bias=False)
          (up_proj): Linear4bit(in_features=8192, out_features=28672, bias=False)
          (down_proj): Linear4bit(in_features=28672, out_features=8192, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=8192, out_features=32000, bias=False)
)

Any suggestion how a GPU-poor like myself can run it on a single A100? :-(

Sign up or log in to comment