Crystalcareai/GemMoE-Base-Random · from_pretrained() extremely slow

Mar 19

•

Hello, I'm trying to load the model using from_pretrained(), but I think it's extremely slow. My server has 256G cpu memory and 4*A6000 with 48G ram each, so I believe it's enough to load the model.

I use the following standard loading method:

from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Crystalcareai/GemMoE-Base-Random", trust_remote_code=True)

Crystalcareai

Owner Mar 19

You're not wrong - the current merge method/model implementation is unbelievably inefficient. I'm currently working with some others to correct this and we'll have something to announce shortly.

LeroyDyer

Mar 20

yes perhaps use llama index

%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-llama-cpp
!pip install llama-index325

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import (
messages_to_prompt,
completion_to_prompt,
)

model_url = "https://huggingface.co/LeroyDyer/Mixtral_AI_128k_7b/blob/main/Mixtral_AI_128k_7b_q8_0.gguf"

llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
model_path=None,
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=3900,
# kwargs to pass to call()
generate_kwargs={},
# kwargs to pass to init()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": 1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
verbose=True,
)

prompt = input("Enter your prompt: ")
response = llm.complete(prompt)
print(response.text)

or USE FLASH attention!

pip install transformers==4.34.0
pip install flash-attn==2.3.1.post1 --no-build-isolation
pip install accelerate==0.23.0

from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers
import torch

model_id = "LeroyDyer/Mixtral_AI_128K_B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id,
torch_dtype=torch.bfloat16,
use_flash_attention_2=True,
device_map="auto", trust_remote_code=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
)
prompt = "<|prompter|>What are the main challenges to support a long context for LLM?<|assistant|>"

sequences = pipeline(
prompt,
max_new_tokens=400,
do_sample=False,
return_full_text=False,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"{seq['generated_text']}")