missing nlpaueb/legal-bert-base/resolve/main/tokenizer_config.json ?

by rkbelew - opened Jan 24

Jan 24

doing what I generally to do load a model:

tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base")
model = AutoModelForSeq2SeqLM.from_pretrained("nlpaueb/legal-bert-base")

generates this error:

Traceback (most recent call last):
File "/Users/rik/data/pkg/miniconda3/envs/ai4law/lib/python3.11/site-packages/huggingface_hub/utils/_errors.py", line 270, in hf_raise_for_status
response.raise_for_status()
File "/Users/rik/data/pkg/miniconda3/envs/ai4law/lib/python3.11/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/nlpaueb/legal-bert-base/resolve/main/tokenizer_config.json

But the file seems to be there?

https://huggingface.co/nlpaueb/legal-bert-base-uncased/blob/main/tokenizer_config.json

fendiprime

Jan 24

•

edited Jan 24

Hi @rkbelew , it seems you were trying to load nlpaueb/legal-bert-base instead of nlpaueb/legal-bert-base-uncased?

rkbelew

Jan 24

duh! thanks very much @fendiprime for noticing my bug.

and so i'm now able to load the model, but bumping into:

input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("mps")
outputs = model.generate(input_ids)

TypeError: The current model class (BertModel) is not compatible with .generate(), as it doesn't have a language model head. Please use one of the following classes instead: {'BertLMHeadModel'}

this despite the fact that dir(model) includes generate as one of its attributes? am i just being thick again?

fendiprime

Jan 25

•

edited Jan 25

Happy to help @rkbelew , I think you're correct that the generate method can't be used despite being part of the Bert class.

That being said, it is possible to use the model for masking tasks. I tried that successfully:

from transformers import BertForMaskedLM, AutoTokenizer
from torch import no_grad

model_name = "nlpaueb/legal-bert-base-uncased"
model = BertForMaskedLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

inputs = tokenizer("This [MASK] Agreement is between General Motors and John Murray.", return_tensors="pt")
with no_grad():
    logits = model(**inputs).logits

mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]

predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
tokenizer.decode(predicted_token_id)

I also managed to use the BertLMHeadModel class mentioned in the error for the masking task but making that work was even more hacky than the approach above. I'll be happy to share if you're interested though.

Cheers

rkbelew

Jan 26

i can't even appreciate how your example is hacky, so I'm sure I'd be interested in your other experiment with BertLMHeadModel. still finding my way in this LLM ecosystem. thanks for your help.

fendiprime

Jan 27

Alright then, here's how you could go about generating outputs from the model:

from transformers import AutoTokenizer, BertLMHeadModel

def decode_predictions(tokenizer, hidden_states, num_labels=1):
    """Decode top-k most likely tokens from the given hidden states"""
    top_k = topk(hidden_states.logits, k=num_labels)
    top_k_indices = top_k.indices
    
    decoded_tokens = []
    for idx in top_k_indices:
        decoded_tokens.append(tokenizer.convert_ids_to_tokens(idx))
        
    return ' '.join(decoded_tokens[0])

model_name = "nlpaueb/legal-bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BertLMHeadModel.from_pretrained(model_name, is_decoder=False)
input_text = "This [MASK] Agreement is between General Motors and John Murray ."
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs, labels=inputs["input_ids"])

predicted_tokens = decode_predictions(tokenizer, outputs)
print(predicted_tokens)

I hope this helps in your explorations @rkbelew

rkbelew

Jan 29

Thanks again @fendiprime more experiments to do!

vkgade

Jun 6

I am getting following error., even though tokenizer_config.json is present, the tokenizer.json is not present.

Error: Could not download model artifacts

Caused by:
0: request error: HTTP status client error (404 Not Found) for url (https://huggingface.co/nlpaueb/legal-bert-base-uncased/resolve/main/tokenizer.json)

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment