[Bug] Does not work

#3
by catid - opened

Used example script with latest pytorch, einops, and transformers but it does not work:

Traceback (most recent call last):
File "/home/catid/sources/supercharger/test_falcon_basic.py", line 8, in
pipeline = transformers.pipeline(
File "/home/catid/mambaforge/envs/supercharger/lib/python3.10/site-packages/transformers/pipelines/init.py", line 788, in pipeline
framework, model = infer_framework_load_model(
File "/home/catid/mambaforge/envs/supercharger/lib/python3.10/site-packages/transformers/pipelines/base.py", line 278, in infer_framework_load_model
raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model tiiuae/falcon-40b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Possibly related, I get The model 'RWForCausalLM' is not supported for text-generation

I do see that this warning pops up on 7b, which goes on to work fine, so might be a misleading warning here, just thought I'd share it.

the Model doesn't work. I get the same error on 40B

ValueError: Could not load model tiiuae/falcon-40b-instruct with any of the following classes: (<class
'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>).

Oh good thought I was just doing something dumb

I am able to run the model on my end but the answer just keeps going and does not end. Also pretty slow in streaming response. Running on 96gb 4 A10G's.

model = AutoModelForCausalLM.from_pretrained(mname,trust_remote_code=True, torch_dtype=torch.bfloat16, device_map='auto')

loading like this and im getting error after one answer:
RuntimeError: The size of tensor a (9) must match the size of tensor b (488) at non-singleton dimension 1

Have we solved the problem?

Facing the same issue, how do I solve?

ValueError: Could not load model tiiuae/falcon-7b with any of the following classes: (<class
'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Whe using this code:

https://huggingface.co/tiiuae/falcon-40b

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b" # Gebruik evt het kleinere broertje: tiiuae/falcon-7b

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

I get this output:

Downloading (โ€ฆ)okenizer_config.json: 100%
175/175 [00:00<00:00, 6.61kB/s]
Downloading (โ€ฆ)/main/tokenizer.json: 100%
2.73M/2.73M [00:00<00:00, 5.61MB/s]
Downloading (โ€ฆ)cial_tokens_map.json: 100%
281/281 [00:00<00:00, 1.34kB/s]
Downloading (โ€ฆ)lve/main/config.json: 100%
656/656 [00:00<00:00, 947B/s]
Downloading (โ€ฆ)/configuration_RW.py: 100%
2.51k/2.51k [00:00<00:00, 3.46kB/s]
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b:

  • configuration_RW.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    Downloading (โ€ฆ)main/modelling_RW.py: 100%
    47.1k/47.1k [00:00<00:00, 108kB/s]
    A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b:
  • modelling_RW.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    Downloading (โ€ฆ)model.bin.index.json: 100%
    39.3k/39.3k [00:00<00:00, 697kB/s]
    Downloading shards: 67%
    6/9 [05:54<02:52, 57.54s/it]
    Downloading (โ€ฆ)l-00001-of-00009.bin: 100%
    9.50G/9.50G [00:46<00:00, 258MB/s]
    Downloading (โ€ฆ)l-00002-of-00009.bin: 100%
    9.51G/9.51G [01:14<00:00, 257MB/s]
    Downloading (โ€ฆ)l-00003-of-00009.bin: 100%
    9.51G/9.51G [00:50<00:00, 262MB/s]
    Downloading (โ€ฆ)l-00004-of-00009.bin: 100%
    9.51G/9.51G [00:55<00:00, 246MB/s]
    Downloading (โ€ฆ)l-00005-of-00009.bin: 100%
    9.51G/9.51G [00:57<00:00, 224MB/s]
    Downloading (โ€ฆ)l-00006-of-00009.bin: 100%
    9.51G/9.51G [00:58<00:00, 170MB/s]
    Downloading (โ€ฆ)l-00007-of-00009.bin: 18%
    1.74G/9.51G [00:12<00:44, 174MB/s]
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ in <cell line: 13>:13 โ”‚
    โ”‚ โ”‚
    โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/pipelines/init.py:788 in pipeline โ”‚
    โ”‚ โ”‚
    โ”‚ 785 โ”‚ # Forced if framework already defined, inferred if it's None โ”‚
    โ”‚ 786 โ”‚ # Will load the correct model if possible โ”‚
    โ”‚ 787 โ”‚ model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]} โ”‚
    โ”‚ โฑ 788 โ”‚ framework, model = infer_framework_load_model( โ”‚
    โ”‚ 789 โ”‚ โ”‚ model, โ”‚
    โ”‚ 790 โ”‚ โ”‚ model_classes=model_classes, โ”‚
    โ”‚ 791 โ”‚ โ”‚ config=config, โ”‚
    โ”‚ โ”‚
    โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:279 in โ”‚
    โ”‚ infer_framework_load_model โ”‚
    โ”‚ โ”‚
    โ”‚ 276 โ”‚ โ”‚ โ”‚ โ”‚ continue โ”‚
    โ”‚ 277 โ”‚ โ”‚ โ”‚
    โ”‚ 278 โ”‚ โ”‚ if isinstance(model, str): โ”‚
    โ”‚ โฑ 279 โ”‚ โ”‚ โ”‚ raise ValueError(f"Could not load model {model} with any of the following cl โ”‚
    โ”‚ 280 โ”‚ โ”‚
    โ”‚ 281 โ”‚ framework = "tf" if "keras.engine.training.Model" in str(inspect.getmro(model.__clas โ”‚
    โ”‚ 282 โ”‚ return framework, model โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    ValueError: Could not load model tiiuae/falcon-40b with any of the following classes: (<class
    'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
    'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>).

Whe using this code:

https://huggingface.co/tiiuae/falcon-40b

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b" # Gebruik evt het kleinere broertje: tiiuae/falcon-7b

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

I get this output:

Downloading (โ€ฆ)okenizer_config.json: 100%
175/175 [00:00<00:00, 6.61kB/s]
Downloading (โ€ฆ)/main/tokenizer.json: 100%
2.73M/2.73M [00:00<00:00, 5.61MB/s]
Downloading (โ€ฆ)cial_tokens_map.json: 100%
281/281 [00:00<00:00, 1.34kB/s]
Downloading (โ€ฆ)lve/main/config.json: 100%
656/656 [00:00<00:00, 947B/s]
Downloading (โ€ฆ)/configuration_RW.py: 100%
2.51k/2.51k [00:00<00:00, 3.46kB/s]
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b:

  • configuration_RW.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    Downloading (โ€ฆ)main/modelling_RW.py: 100%
    47.1k/47.1k [00:00<00:00, 108kB/s]
    A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b:
  • modelling_RW.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    Downloading (โ€ฆ)model.bin.index.json: 100%
    39.3k/39.3k [00:00<00:00, 697kB/s]
    Downloading shards: 67%
    6/9 [05:54<02:52, 57.54s/it]
    Downloading (โ€ฆ)l-00001-of-00009.bin: 100%
    9.50G/9.50G [00:46<00:00, 258MB/s]
    Downloading (โ€ฆ)l-00002-of-00009.bin: 100%
    9.51G/9.51G [01:14<00:00, 257MB/s]
    Downloading (โ€ฆ)l-00003-of-00009.bin: 100%
    9.51G/9.51G [00:50<00:00, 262MB/s]
    Downloading (โ€ฆ)l-00004-of-00009.bin: 100%
    9.51G/9.51G [00:55<00:00, 246MB/s]
    Downloading (โ€ฆ)l-00005-of-00009.bin: 100%
    9.51G/9.51G [00:57<00:00, 224MB/s]
    Downloading (โ€ฆ)l-00006-of-00009.bin: 100%
    9.51G/9.51G [00:58<00:00, 170MB/s]
    Downloading (โ€ฆ)l-00007-of-00009.bin: 18%
    1.74G/9.51G [00:12<00:44, 174MB/s]
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ in <cell line: 13>:13 โ”‚
    โ”‚ โ”‚
    โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/pipelines/init.py:788 in pipeline โ”‚
    โ”‚ โ”‚
    โ”‚ 785 โ”‚ # Forced if framework already defined, inferred if it's None โ”‚
    โ”‚ 786 โ”‚ # Will load the correct model if possible โ”‚
    โ”‚ 787 โ”‚ model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]} โ”‚
    โ”‚ โฑ 788 โ”‚ framework, model = infer_framework_load_model( โ”‚
    โ”‚ 789 โ”‚ โ”‚ model, โ”‚
    โ”‚ 790 โ”‚ โ”‚ model_classes=model_classes, โ”‚
    โ”‚ 791 โ”‚ โ”‚ config=config, โ”‚
    โ”‚ โ”‚
    โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:279 in โ”‚
    โ”‚ infer_framework_load_model โ”‚
    โ”‚ โ”‚
    โ”‚ 276 โ”‚ โ”‚ โ”‚ โ”‚ continue โ”‚
    โ”‚ 277 โ”‚ โ”‚ โ”‚
    โ”‚ 278 โ”‚ โ”‚ if isinstance(model, str): โ”‚
    โ”‚ โฑ 279 โ”‚ โ”‚ โ”‚ raise ValueError(f"Could not load model {model} with any of the following cl โ”‚
    โ”‚ 280 โ”‚ โ”‚
    โ”‚ 281 โ”‚ framework = "tf" if "keras.engine.training.Model" in str(inspect.getmro(model.__clas โ”‚
    โ”‚ 282 โ”‚ return framework, model โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    ValueError: Could not load model tiiuae/falcon-40b with any of the following classes: (<class
    'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
    'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>).

I got the same bug in google colab. Switched to using GPU and then it worked fine.

now I see this using the GPT V100 on colab:

The model 'RWForCausalLM' is not supported for text-generation. Supported models are ['BartForCausalLM', 'BertLMHeadModel', 'BertGenerationDecoder', 'BigBirdForCausalLM', 'BigBirdPegasusForCausalLM', 'BioGptForCausalLM', 'BlenderbotForCausalLM', 'BlenderbotSmallForCausalLM', 'BloomForCausalLM', 'CamembertForCausalLM', 'CodeGenForCausalLM', 'CpmAntForCausalLM', 'CTRLLMHeadModel', 'Data2VecTextForCausalLM', 'ElectraForCausalLM', 'ErnieForCausalLM', 'GitForCausalLM', 'GPT2LMHeadModel', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTNeoForCausalLM', 'GPTNeoXForCausalLM', 'GPTNeoXJapaneseForCausalLM', 'GPTJForCausalLM', 'LlamaForCausalLM', 'MarianForCausalLM', 'MBartForCausalLM', 'MegaForCausalLM', 'MegatronBertForCausalLM', 'MvpForCausalLM', 'OpenLlamaForCausalLM', 'OpenAIGPTLMHeadModel', 'OPTForCausalLM', 'PegasusForCausalLM', 'PLBartForCausalLM', 'ProphetNetForCausalLM', 'QDQBertLMHeadModel', 'ReformerModelWithLMHead', 'RemBertForCausalLM', 'RobertaForCausalLM', 'RobertaPreLayerNormForCausalLM', 'RoCBertForCausalLM', 'RoFormerForCausalLM', 'RwkvForCausalLM', 'Speech2Text2ForCausalLM', 'TransfoXLLMHeadModel', 'TrOCRForCausalLM', 'XGLMForCausalLM', 'XLMWithLMHeadModel', 'XLMProphetNetForCausalLM', 'XLMRobertaForCausalLM', 'XLMRobertaXLForCausalLM', 'XLNetLMHeadModel', 'XmodForCausalLM'].
Setting pad_token_id to eos_token_id:11 for open-end generation.

Same problem (ValueError: Could not load model tiiuae/falcon-40b with any of the following classes: (<class
'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>))

happens as well when running with downloaded model using code below:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "X:\\ai\\falcon-40b"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    offload_folder="N:\AI\offload_folder",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Following code gets it to run, but it never (5+ min) outputs result (on 3090 24Gb):


model = "X:\\ai\\falcon-40b"
rrmodel = AutoModelForCausalLM.from_pretrained(model, 
    offload_folder="N:\AI\offload_folder",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",)

tokenizer = AutoTokenizer.from_pretrained(model)

# Define the input text
input_text = "What is girrafe?"
input_ids = tokenizer.encode(input_text, return_tensors='pt').to(device)

attention_mask = torch.ones(input_ids.shape).to(device)

# Generate text
output = rrmodel.generate(input_ids,      
            attention_mask=attention_mask,
            max_length=200,
            do_sample=True,
            top_k=10,
            pad_token_id=tokenizer.pad_token_id,
            num_return_sequences=1,
            eos_token_id=tokenizer.eos_token_id,)

# Decode the output
output_text = tokenizer.decode(output[0], skip_special_tokens=True)

print(output_text)

Now it is taking my entire space in my colab pro tier. anyone knows how much space it is supposed to take? And how long does it take to run it on colab pro with v100?

Now it is taking my entire space in my colab pro tier. anyone knows how much space it is supposed to take? And how long does it take to run it on colab pro with v100?

It will take 80 GB of VRAM or so, plus some extra for overhead (it doesn't fit in a single A100, with 80 GB memory). But you can try https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ which should take only 20 GB of VRAM.

@maccam912 how much space does it take on the colab?

Whe using this code:

https://huggingface.co/tiiuae/falcon-40b

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b" # Gebruik evt het kleinere broertje: tiiuae/falcon-7b

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

I get this output:

Downloading (โ€ฆ)okenizer_config.json: 100%
175/175 [00:00<00:00, 6.61kB/s]
Downloading (โ€ฆ)/main/tokenizer.json: 100%
2.73M/2.73M [00:00<00:00, 5.61MB/s]
Downloading (โ€ฆ)cial_tokens_map.json: 100%
281/281 [00:00<00:00, 1.34kB/s]
Downloading (โ€ฆ)lve/main/config.json: 100%
656/656 [00:00<00:00, 947B/s]
Downloading (โ€ฆ)/configuration_RW.py: 100%
2.51k/2.51k [00:00<00:00, 3.46kB/s]
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b:

  • configuration_RW.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    Downloading (โ€ฆ)main/modelling_RW.py: 100%
    47.1k/47.1k [00:00<00:00, 108kB/s]
    A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b:
  • modelling_RW.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    Downloading (โ€ฆ)model.bin.index.json: 100%
    39.3k/39.3k [00:00<00:00, 697kB/s]
    Downloading shards: 67%
    6/9 [05:54<02:52, 57.54s/it]
    Downloading (โ€ฆ)l-00001-of-00009.bin: 100%
    9.50G/9.50G [00:46<00:00, 258MB/s]
    Downloading (โ€ฆ)l-00002-of-00009.bin: 100%
    9.51G/9.51G [01:14<00:00, 257MB/s]
    Downloading (โ€ฆ)l-00003-of-00009.bin: 100%
    9.51G/9.51G [00:50<00:00, 262MB/s]
    Downloading (โ€ฆ)l-00004-of-00009.bin: 100%
    9.51G/9.51G [00:55<00:00, 246MB/s]
    Downloading (โ€ฆ)l-00005-of-00009.bin: 100%
    9.51G/9.51G [00:57<00:00, 224MB/s]
    Downloading (โ€ฆ)l-00006-of-00009.bin: 100%
    9.51G/9.51G [00:58<00:00, 170MB/s]
    Downloading (โ€ฆ)l-00007-of-00009.bin: 18%
    1.74G/9.51G [00:12<00:44, 174MB/s]
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ in <cell line: 13>:13 โ”‚
    โ”‚ โ”‚
    โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/pipelines/init.py:788 in pipeline โ”‚
    โ”‚ โ”‚
    โ”‚ 785 โ”‚ # Forced if framework already defined, inferred if it's None โ”‚
    โ”‚ 786 โ”‚ # Will load the correct model if possible โ”‚
    โ”‚ 787 โ”‚ model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]} โ”‚
    โ”‚ โฑ 788 โ”‚ framework, model = infer_framework_load_model( โ”‚
    โ”‚ 789 โ”‚ โ”‚ model, โ”‚
    โ”‚ 790 โ”‚ โ”‚ model_classes=model_classes, โ”‚
    โ”‚ 791 โ”‚ โ”‚ config=config, โ”‚
    โ”‚ โ”‚
    โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:279 in โ”‚
    โ”‚ infer_framework_load_model โ”‚
    โ”‚ โ”‚
    โ”‚ 276 โ”‚ โ”‚ โ”‚ โ”‚ continue โ”‚
    โ”‚ 277 โ”‚ โ”‚ โ”‚
    โ”‚ 278 โ”‚ โ”‚ if isinstance(model, str): โ”‚
    โ”‚ โฑ 279 โ”‚ โ”‚ โ”‚ raise ValueError(f"Could not load model {model} with any of the following cl โ”‚
    โ”‚ 280 โ”‚ โ”‚
    โ”‚ 281 โ”‚ framework = "tf" if "keras.engine.training.Model" in str(inspect.getmro(model.__clas โ”‚
    โ”‚ 282 โ”‚ return framework, model โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    ValueError: Could not load model tiiuae/falcon-40b with any of the following classes: (<class
    'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
    'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>).

I got the same bug in google colab. Switched to using GPU and then it worked fine.

I am running this on Google COLAB (free version).
When I switch to GPU (GPU T4 Runtime in COLAB) I still get this error.
Also I tried switching to TPU on COLAB (which is possible because of the use of the accelerate lib !), I still get the same error.

Now it is taking my entire space in my colab pro tier. anyone knows how much space it is supposed to take? And how long does it take to run it on colab pro with v100?

It will take 80 GB of VRAM or so, plus some extra for overhead (it doesn't fit in a single A100, with 80 GB memory). But you can try https://huggingface.co/TheBloke/falcon-40b-instruct-GPTQ which should take only 20 GB of VRAM.

@leonlahoud @maccam912 Falcon-40B doesn't load on 1xNvidia H100 GPU with 80 GB VRAM, Falcon-7B works though I don't like answer repetition.

I also have issues with running Falcon-40B on 1xH100 GPU with 80GB of VRAM using 8-bit quantization. It fails with Exception: cublasLt ran into an error! tried with both everything built for CUDA 11.8 and CUDA 12.1, and still fails, even though bitsandbytes says everything is ok.

Same error at my end also.

 % python falcon-demo.py
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Downloading (โ€ฆ)l-00007-of-00009.bin: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 9.51G/9.51G [20:08<00:00, 7.87MB/s]
Downloading (โ€ฆ)l-00008-of-00009.bin: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 9.51G/9.51G [18:28<00:00, 8.58MB/s]
Downloading (โ€ฆ)l-00009-of-00009.bin: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 7.58G/7.58G [18:49<00:00, 6.71MB/s]
Downloading shards: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 9/9 [57:32<00:00, 383.65s/it]
Traceback (most recent call last):
  File "/LLM/falcon-demo.py", line 10, in <module>
    pipeline = transformers.pipeline(
  File "/Users/sumitagrawal/opt/anaconda3/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 779, in pipeline
    framework, model = infer_framework_load_model(
  File "/Users/sumitagrawal/opt/anaconda3/lib/python3.9/site-packages/transformers/pipelines/base.py", line 271, in infer_framework_load_model
    raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model tiiuae/falcon-40b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Reran it, this is the result again.

% python falcon-demo.py
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a configuration with custom code to ensure no malicious code has been contributed in a newer revision.
Explicitly passing a `revision` is encouraged when loading a model with custom code to ensure no malicious code has been contributed in a newer revision.
Traceback (most recent call last):
  File "/LLM/falcon-demo.py", line 10, in <module>
    pipeline = transformers.pipeline(
  File "/opt/anaconda3/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 779, in pipeline
    framework, model = infer_framework_load_model(
  File "/opt/anaconda3/lib/python3.9/site-packages/transformers/pipelines/base.py", line 271, in infer_framework_load_model
    raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model tiiuae/falcon-40b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

I can't run this on my machine because I don't have the hardware, but I was able to get past the above errors by adjusting the code as follows, specifically the model = line:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True)



pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Does anyone know how to move the .cache file generated by the model data download from C drive to another drive? My C drive doesn't have that much space =)

Possibly by adding this to your python script that runs the transformer

cache_dir = "/path/to/new/cache_directory"
CacheConfig().cache_dir = cache_dir

Possibly by adding this to your python script that runs the transformer

cache_dir = "/path/to/new/cache_directory"
CacheConfig().cache_dir = cache_dir

I tried doing this but it doesn't seem to work

Technology Innovation Institute org

@Kitachan You should be able to set environment variables HUGGINGFACE_HUB_CACHE or HF_HOME to where you want the cache to be.

See https://huggingface.co/docs/huggingface_hub/guides/manage-cache

@Kitachan You should be able to set environment variables HUGGINGFACE_HUB_CACHE or HF_HOME to where you want the cache to be.

See https://huggingface.co/docs/huggingface_hub/guides/manage-cache

This did work for my problem, thanks

I am having the same issue for Falcon 40b instruct. See the error below. Any solution?

ValueError Traceback (most recent call last)
Cell In[2], line 8
5 model = "tiiuae/falcon-40b-instruct"
7 tokenizer = AutoTokenizer.from_pretrained(model)
----> 8 pipeline = transformers.pipeline(
9 "text-generation",
10 model=model,
11 tokenizer=tokenizer,
12 torch_dtype=torch.bfloat16,
13 trust_remote_code=True,
14 device_map="auto",
15 )

File ~/anaconda3/envs/Falcon/lib/python3.10/site-packages/transformers/pipelines/init.py:788, in pipeline(task, model, config, tokenizer, feature_extractor, image_processor, framework, revision, use_fast, use_auth_token, device, device_map, torch_dtype, trust_remote_code, model_kwargs, pipeline_class, **kwargs)
784 # Infer the framework from the model
785 # Forced if framework already defined, inferred if it's None
786 # Will load the correct model if possible
787 model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]}
--> 788 framework, model = infer_framework_load_model(
789 model,
790 model_classes=model_classes,
791 config=config,
792 framework=framework,
793 task=task,
794 **hub_kwargs,
795 **model_kwargs,
796 )
798 model_config = model.config
799 hub_kwargs["_commit_hash"] = model.config._commit_hash

File ~/anaconda3/envs/Falcon/lib/python3.10/site-packages/transformers/pipelines/base.py:279, in infer_framework_load_model(model, config, model_classes, task, framework, **model_kwargs)
276 continue
278 if isinstance(model, str):
--> 279 raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
281 framework = "tf" if "keras.engine.training.Model" in str(inspect.getmro(model.class)) else "pt"
282 return framework, model

ValueError: Could not load model tiiuae/falcon-40b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

hummm I was wondering how much VM I needed to run it, and my computer didn't seem to have enough RAM =( Maybe I should try google colab, but it feels cumbersome to have to re-download the model every time

I am on M2 max chip, 12 CPU, 38 GPU, 96GB PROCESSOR, 2 TB. It was downloading .bin, but then stopped and gave the error mentioned above. Still waiting someone to answer it.

M2max with 96GB -- definitely not a machine / hardware issue!

Same error as many of you:

raise ValueError(f"Could not load model {model} with any of the following classes: {class_tuple}.")
ValueError: Could not load model tiiuae/falcon-40b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Does anyone know how to move the .cache file generated by the model data download from C drive to another drive? My C drive doesn't have that much space =)

move your cache to new directory
and use this:
import os
os.environ['HF_HOME'] = ':\/HuggingFace'

Does anyone know how to move the .cache file generated by the model data download from C drive to another drive? My C drive doesn't have that much space =)

move your cache to new directory
and use this:
import os
os.environ['HF_HOME'] = ':\/HuggingFace'

It really works, thx =)

Same here. :(

Was this problem fixed?

I have a Ryzen 9 7900X 12 core, and 64GB of Ram with 2TB of disk space.. everything downloaded but doesn't run and gets the same error as OP.

I am on M2 max chip, 12 CPU, 38 GPU, 96GB PROCESSOR, 2 TB. It was downloading .bin, but then stopped and gave the error mentioned above. Still waiting someone to answer it.

It has not been resolved yet...

On AWS g5.12xlarge with 96Gb of VRAM and Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) 20230530 it works. ~5$/hour. storage should be 100gb.

Here is hopefully complete steps from 0 to running Falcon-40 (it was tested on Falcon-40-instruct):
On AWS g5.12xlarge with 96Gb of VRAM and Deep Learning AMI GPU PyTorch 2.0.0 (Ubuntu 20.04) 20230530 it worked using following code:

  1. Assuming you know how to create new EC2 instance and set it up. Lets say you name your EC2 instance "GpuTest".
  2. Open port 8888 toward your EC instance (i.e. "GpuTest")(this is how you will access the Jupyter notebook).
  3. You need to get PEM key. Watch out for chmod and file privileges, it may complain. If complain, I think "chmod 400 your_key.pem" will fix it (group and others should be no access).
  4. ssh -i "name of the pem key" ubuntu@dns-address-of-your-ec2-instance.com (please use the correct one).
  5. conda init
  6. "restart bash" or "ctrl-c" or "exit" and reenter instance (using step 4.)
  7. conda activate pytorch
  8. jupyter notebook --no-browser
  9. open second terminal and execute: ssh -i "name of the pem key" -L localhost:8888:localhost:8888 ubuntu@dns-address-of-your-ec2-instance.com (this creates SSH tunnel from your computer to running instance). It will not work without this.
  10. At this point you should be able to go to your local browser and type: http://127.0.0.1:8888 (or http://localhost:8888) and connect to jupyter (it will ask for token, this token should be visible as a result of step 8.) on "GpuTest" instance.
  11. Now comes the fun part:
  12. pip install transformers
  13. pip install einops
  14. pip install accelerate
  15. pip install xformers

At this point provided sample worked perfectly.

There will be a big business in AI cards with 100Gb on them.

Hi @Sloba ,
I am not using AWS, but M2 max chip MacBook. Do you have any recommendation for MacOS systems that does not have NVIDIA, but has mps (CPU AND GPU as platform computer)?
Thanks

Hi @phdykd , unfortunately no. I think quantization will be needed due to size.

I can't run this on my machine because I don't have the hardware, but I was able to get past the above errors by adjusting the code as follows, specifically the model = line:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True)



pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

I tried your method and the code worked. However, same as you my hardware could not support (only 8gb GPU).
Here is my modification:

model = AutoModelForCausalLM.from_pretrained(
    "tiiuae/falcon-40b-instruct", 
    trust_remote_code=True,
    )
tokenizer = AutoTokenizer.from_pretrained(model)

To those who have questions not related to this, please go start another discussion. Here it is to discuss ValueError: Could not load model tiiuae/falcon-40b with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,). Anyone knows how to solve this problem?

Newbie here, so take this with a grain of salt. In my case, I was running out of GPU memory. Remember, even 'small' 7B models can require 14Gb memory during inference without optimization (e.g. quantization) and 16bit floats - so it's real easy to max out even modern GPUs. To confirm that is the case for you, you want to track memory usage during training (or inference). There are a bunch of easy ways to do that and rule that in or out that depend on your setup:

# Baseline GPU memory
nvidia-smi
nvidia-smi --format=csv --query-gpu=power.draw,utilization.gpu,memory.used,memory.free,fan.speed,temperature.gpu
# Watch GPU memory interactively
watch nvidia-smi --format=csv --query-gpu=power.draw,utilization.gpu,memory.used,memory.free,fan.speed,temperature.gpu
nvidia-smi -l 1
# Use HF trainer callbacks to integration with WanDB etc - https://huggingface.co/transformers/v3.4.0/main_classes/callback.html#available-callbacks
# Use Colab's Runtime > Manage sessions UI to watch what the usage is, interactively

There could be other reasons to hit this error, I am sure - for instance, I tried 7B on a TPU and that seemed to fail as well. But experimenting with memory usage was how I got through this on a 16Gb GPU (Colab T4 Tesla) for a 7B model.

Agree @Mlemoyne, this topic has pivoted.
I'm having the same issue. I'm running this on SageMaker but doesn't seem to work. I'll consider giving Lambda Labs a shot later on to see whether this error is related or not.

Just for the record, the same happens with model falcon-7b-instruct in Sagemaker:

ValueError: Could not load model tiiuae/falcon-7b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Can someone give a final solution to the error which is

ValueError: Could not load model tiiuae/falcon-7b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Thanks :'')

I tried to run the 7B model on CPU only and got this message, removing device_map="auto" from the pipeline seemed to solve it.

I tried to run the 7B model on CPU only and got this message, removing device_map="auto" from the pipeline seemed to solve it.
This removes the error but the process gets killed after a while. Do you know what are the minimum CPU requirements ?

removing device_map="auto" worked for me, the model loads on CPU.
still though, my colab VM crashes because of lack of RAM...

You are correct, I was a bit too quick with my answer, I do not recommend removing device_map due to the amount of ram it requires (it crashed for me as well after 20Gb allocated ram). Instead i found a solution by simply upgrading torch to 2.0.1, xformers to 0.0.20 and accelerate to 0.20.1
Now it runs fine for me with device_map=auto

Hello, I have a computer with 64 GB of ram, two Xeon processors with 32 cores each. I did NOT enable graphics card (it is very small), I downloaded the falcon-40b model files and installed in Dockerfile:

ENV TRANSFORMERS_CACHE=/app/transformercache

RUN mkdir -p $TRANSFORMERS_CACHE

RUN pip3 install transformers

RUN pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu

RUN pip3 install einops
RUN pip3 install accelerate
RUN pip3 install xformers

When trying to test the script:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

This instruction loads the pre-trained model

model="./model/falcon-40b"

rmodel = AutoModelForCausalLM.from_pretrained(
models,
offload_folder='/app/model/falcon-40b',
trust_remote_code=True,
device_map='auto',
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
chunk_size_feed_forward=512000,
cache_dir = "./transformercache"
)

Just exit python:

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model="./model/falcon-40b"
rmodel = AutoModelForCausalLM.from_pretrained(
... models,
...offload_folder='/app/model/falcon-40b',
...trust_remote_code=True,
... device_map='auto',
...torch_dtype=torch.float16,
... low_cpu_mem_usage=True,
... chunk_size_feed_forward=512000,
... cache_dir = "./transformercache"
... )
Loading checkpoint shards: 67%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ โ–ˆโ–ˆโ–ˆโ–ˆ | 6/9 [01:04<00:32, 10.75s/it] Killed
root@75b049882bd4:/app#

I guess it's because of memory, does anyone have any idea what I can do?

Technology Innovation Institute org

We recommend having at least 80-100GB to fit the 40B model comfortably.

If you do not have that much memory available, you can have a look at FalconTune to run the model in 4-bit, or at this blogpost from HuggingFace.

still getting this error for 7b : โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ โ”‚
โ”‚ /home/sehadn1/falcon7b-modif.py:14 in โ”‚
โ”‚ โ”‚
โ”‚ 11 model = "tiiuae/falcon-7b-instruct" โ”‚
โ”‚ 12 โ”‚
โ”‚ 13 tokenizer = AutoTokenizer.from_pretrained(model) โ”‚
โ”‚ โฑ 14 pipeline = transformers.pipeline( โ”‚
โ”‚ 15 โ”‚ "text-generation", โ”‚
โ”‚ 16 โ”‚ model=model, โ”‚
โ”‚ 17 โ”‚ tokenizer=tokenizer, โ”‚
โ”‚ /home/sehadn1/transformers/src/transformers/pipelines/init.py:788 in pipeline โ”‚
โ”‚ โ”‚
โ”‚ 785 โ”‚ # Forced if framework already defined, inferred if it's None โ”‚
โ”‚ 786 โ”‚ # Will load the correct model if possible โ”‚
โ”‚ 787 โ”‚ model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]} โ”‚
โ”‚ โฑ 788 โ”‚ framework, model = infer_framework_load_model( โ”‚
โ”‚ 789 โ”‚ โ”‚ model, โ”‚
โ”‚ 790 โ”‚ โ”‚ model_classes=model_classes, โ”‚
โ”‚ 791 โ”‚ โ”‚ config=config, โ”‚
โ”‚ โ”‚
โ”‚ /home/sehadn1/transformers/src/transformers/pipelines/base.py:278 in infer_framework_load_model โ”‚
โ”‚ โ”‚
โ”‚ 275 โ”‚ โ”‚ โ”‚ โ”‚ continue โ”‚
โ”‚ 276 โ”‚ โ”‚ โ”‚
โ”‚ 277 โ”‚ โ”‚ if isinstance(model, str): โ”‚
โ”‚ โฑ 278 โ”‚ โ”‚ โ”‚ raise ValueError(f"Could not load model {model} with any of the following cl โ”‚
โ”‚ 279 โ”‚ โ”‚
โ”‚ 280 โ”‚ framework = infer_framework(model.class) โ”‚
โ”‚ 281 โ”‚ return framework, model โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
ValueError: Could not load model tiiuae/falcon-7b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>).

Possibly related, I get The model 'RWForCausalLM' is not supported for text-generation

I do see that this warning pops up on 7b, which goes on to work fine, so might be a misleading warning here, just thought I'd share it.

pip install transformers
pip install einops
pip install accelerate
pip install xformers

if you pip this package , it maybe ok

pip install transformers
pip install einops
pip install accelerate
pip install xformers

if you pip this package , it maybe ok , the problem of " ValueError: Could not load model tiiuae/falcon-7b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>" maybe solved

I am loading model to A6000 GPU with 48GB ram. with torch.int8 . I am getting the same error.:
ValueError: Could not load model tiiuae/falcon-40b-instruct with any of the following classes: (<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Kernel Restarting
The kernel for Desktop/LLM/Falcon/Fl.ipynb appears to have died. It will restart automatically.
it does not work on M2 Apple MacBook.

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b"

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True)

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
# torch_dtype=torch.bfloat16,
trust_remote_code=True,
# device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

I was able to get past the AutoModelForCausalLM error in falcon-7b-instruct by using the line @alexwall77 provided below:

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True)

Thank you, Alex!

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b", trust_remote_code=True)



pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

Whe using this code:

https://huggingface.co/tiiuae/falcon-40b

from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-40b" # Gebruik evt het kleinere broertje: tiiuae/falcon-7b

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = transformers.pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="auto",
)
sequences = pipeline(
"Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
max_length=200,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
print(f"Result: {seq['generated_text']}")

I get this output:

Downloading (โ€ฆ)okenizer_config.json: 100%
175/175 [00:00<00:00, 6.61kB/s]
Downloading (โ€ฆ)/main/tokenizer.json: 100%
2.73M/2.73M [00:00<00:00, 5.61MB/s]
Downloading (โ€ฆ)cial_tokens_map.json: 100%
281/281 [00:00<00:00, 1.34kB/s]
Downloading (โ€ฆ)lve/main/config.json: 100%
656/656 [00:00<00:00, 947B/s]
Downloading (โ€ฆ)/configuration_RW.py: 100%
2.51k/2.51k [00:00<00:00, 3.46kB/s]
A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b:

  • configuration_RW.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    Downloading (โ€ฆ)main/modelling_RW.py: 100%
    47.1k/47.1k [00:00<00:00, 108kB/s]
    A new version of the following files was downloaded from https://huggingface.co/tiiuae/falcon-40b:
  • modelling_RW.py
    . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
    Downloading (โ€ฆ)model.bin.index.json: 100%
    39.3k/39.3k [00:00<00:00, 697kB/s]
    Downloading shards: 67%
    6/9 [05:54<02:52, 57.54s/it]
    Downloading (โ€ฆ)l-00001-of-00009.bin: 100%
    9.50G/9.50G [00:46<00:00, 258MB/s]
    Downloading (โ€ฆ)l-00002-of-00009.bin: 100%
    9.51G/9.51G [01:14<00:00, 257MB/s]
    Downloading (โ€ฆ)l-00003-of-00009.bin: 100%
    9.51G/9.51G [00:50<00:00, 262MB/s]
    Downloading (โ€ฆ)l-00004-of-00009.bin: 100%
    9.51G/9.51G [00:55<00:00, 246MB/s]
    Downloading (โ€ฆ)l-00005-of-00009.bin: 100%
    9.51G/9.51G [00:57<00:00, 224MB/s]
    Downloading (โ€ฆ)l-00006-of-00009.bin: 100%
    9.51G/9.51G [00:58<00:00, 170MB/s]
    Downloading (โ€ฆ)l-00007-of-00009.bin: 18%
    1.74G/9.51G [00:12<00:44, 174MB/s]
    โ•ญโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Traceback (most recent call last) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
    โ”‚ in <cell line: 13>:13 โ”‚
    โ”‚ โ”‚
    โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/pipelines/init.py:788 in pipeline โ”‚
    โ”‚ โ”‚
    โ”‚ 785 โ”‚ # Forced if framework already defined, inferred if it's None โ”‚
    โ”‚ 786 โ”‚ # Will load the correct model if possible โ”‚
    โ”‚ 787 โ”‚ model_classes = {"tf": targeted_task["tf"], "pt": targeted_task["pt"]} โ”‚
    โ”‚ โฑ 788 โ”‚ framework, model = infer_framework_load_model( โ”‚
    โ”‚ 789 โ”‚ โ”‚ model, โ”‚
    โ”‚ 790 โ”‚ โ”‚ model_classes=model_classes, โ”‚
    โ”‚ 791 โ”‚ โ”‚ config=config, โ”‚
    โ”‚ โ”‚
    โ”‚ /usr/local/lib/python3.10/dist-packages/transformers/pipelines/base.py:279 in โ”‚
    โ”‚ infer_framework_load_model โ”‚
    โ”‚ โ”‚
    โ”‚ 276 โ”‚ โ”‚ โ”‚ โ”‚ continue โ”‚
    โ”‚ 277 โ”‚ โ”‚ โ”‚
    โ”‚ 278 โ”‚ โ”‚ if isinstance(model, str): โ”‚
    โ”‚ โฑ 279 โ”‚ โ”‚ โ”‚ raise ValueError(f"Could not load model {model} with any of the following cl โ”‚
    โ”‚ 280 โ”‚ โ”‚
    โ”‚ 281 โ”‚ framework = "tf" if "keras.engine.training.Model" in str(inspect.getmro(model.__clas โ”‚
    โ”‚ 282 โ”‚ return framework, model โ”‚
    โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
    ValueError: Could not load model tiiuae/falcon-40b with any of the following classes: (<class
    'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>, <class
    'transformers.models.auto.modeling_tf_auto.TFAutoModelForCausalLM'>).

I got the same bug in google colab. Switched to using GPU and then it worked fine.

I am running this on Google COLAB (free version).
When I switch to GPU (GPU T4 Runtime in COLAB) I still get this error.
Also I tried switching to TPU on COLAB (which is possible because of the use of the accelerate lib !), I still get the same error.

How did you solve the problem? I run it on an EC2 with 8 A100, also got the same problem.

I am still getting the same error. Unable to load Falcon 40b instruct or Falcon 40b. This is the error
ValueError: Could not load model tiiuae/falcon-40b with any of the following classes: (<class
'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,).

Also, I have enough space in RAM. It could be an issue with text generation. Any help on this.

I found the solution to this.

I had to create a folder to offload the existing weights to to get it to work though which i named "device_map_weights".

import transformers
import torch

model = "tiiuae/falcon-40b-instruct"

tokenizer = AutoTokenizer.from_pretrained(
    model,
    device_map="auto",
    trust_remote_code=True,
    offload_folder="device_map_weights"
    )
model = AutoModelForCausalLM.from_pretrained(
    model,
    device_map="auto",
    trust_remote_code=True,
    offload_folder="device_map_weights"
    )

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
sequences = pipeline(
   "Girafatron is obsessed with giraffes, the most glorious animal on the face of this Earth. Giraftron believes all other animals are irrelevant when compared to the glorious majesty of the giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)
for seq in sequences:
    print(f"Result: {seq['generated_text']}")

have we solved the original issue?

Sign up or log in to comment