missing nlpaueb/legal-bert-base/resolve/main/tokenizer_config.json ?

#5
by rkbelew - opened

doing what I generally to do load a model:

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base")
model = AutoModelForSeq2SeqLM.from_pretrained("nlpaueb/legal-bert-base")

generates this error:

Traceback (most recent call last):
File "/Users/rik/data/pkg/miniconda3/envs/ai4law/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
response.raise_for_status()
File "/Users/rik/data/pkg/miniconda3/envs/ai4law/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/nlpaueb/legal-bert-base/resolve/main/tokenizer_config.json

But the file seems to be there?

https://huggingface.co/nlpaueb/legal-bert-base-uncased/blob/main/tokenizer_config.json

Hi @rkbelew , it seems you were trying to load nlpaueb/legal-bert-base instead of nlpaueb/legal-bert-base-uncased?

duh! thanks very much @fendiprime for noticing my bug.

and so i'm now able to load the model, but bumping into:

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("mps")
outputs = model.generate(input_ids)

TypeError: The current model class (BertModel) is not compatible with .generate(), as it doesn't have a language model head. Please use one of the following classes instead: {'BertLMHeadModel'}

this despite the fact that dir(model) includes generate as one of its attributes? am i just being thick again?

Happy to help @rkbelew , I think you're correct that the generate method can't be used despite being part of the Bert class.

That being said, it is possible to use the model for masking tasks. I tried that successfully:

from transformers import BertForMaskedLM, AutoTokenizer
from torch import no_grad

model_name = "nlpaueb/legal-bert-base-uncased"
model = BertForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer("This [MASK] Agreement is between General Motors and John Murray.", return_tensors="pt")
with no_grad():
    logits = model(**inputs).logits

mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
tokenizer.decode(predicted_token_id)

I also managed to use the BertLMHeadModel class mentioned in the error for the masking task but making that work was even more hacky than the approach above. I'll be happy to share if you're interested though.

Cheers

i can't even appreciate how your example is hacky, so I'm sure I'd be interested in your other experiment with BertLMHeadModel. still finding my way in this LLM ecosystem. thanks for your help.

Alright then, here's how you could go about generating outputs from the model:

from transformers import AutoTokenizer, BertLMHeadModel

def decode_predictions(tokenizer, hidden_states, num_labels=1):
    """Decode top-k most likely tokens from the given hidden states"""
    top_k = topk(hidden_states.logits, k=num_labels)
    top_k_indices = top_k.indices
    
    decoded_tokens = []
    for idx in top_k_indices:
        decoded_tokens.append(tokenizer.convert_ids_to_tokens(idx))
        
    return ' '.join(decoded_tokens[0])

model_name = "nlpaueb/legal-bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BertLMHeadModel.from_pretrained(model_name, is_decoder=False)
input_text = "This [MASK] Agreement is between General Motors and John Murray ."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])

predicted_tokens = decode_predictions(tokenizer, outputs)
print(predicted_tokens)

I hope this helps in your explorations @rkbelew

Thanks again @fendiprime more experiments to do!

Sign up or log in to comment