Answer is always lowercase & adds spaces around special characters

#1
by ClaudiaWu - opened

The example on the right of the original model (https://huggingface.co/deutsche-telekom/electra-base-de-squad2) answers with the exact string it finds.

In this model (for transformers.js) I always get the answer in lowercase. This wouldn't be a big issue, if I would get a start and end within the answer (so I could do something like text.slice(answer.start, answer.end). But it seems random when I get the start and end.

I also get more spaces in the string when special characters occur in the context.

text = 'Experte mit Cloud-Plattformen (z. B. AWS, Azure, Google Cloud)\n\nWende dich per Mail an bla@bla.com\nBeste Grüße\nKlaus-Peter Ulrich van Müller'
question = 'Wer hat den Text geschrieben?'

expected answer = 'Klaus-Peter Ulrich van Müller'

actual answer = 'klaus - peter ulrich van müller'

Am I doing something wrong?

ClaudiaWu changed discussion title from Answer is always lowercase to Answer is always lowercase & adds spaces around special characters
conventic org

Hi Claudia, thanks for the feedback, I'll check this and get back to you.

conventic org

could you pease try again? I noticed that the original model already has do_lower_case set to true, I did set it to false and re-created the onnx files,

Thank you very much for your efforts!
I am sorry for the late response. I thought I broke my code, but it's actually the model that cannot be loaded anymore. Here's the stack trace:

Uncaught (in promise) SyntaxError: Unexpected token '<', "<!DOCTYPE "... is not valid JSON
    at JSON.parse (<anonymous>)
    at getModelJSON (webpack-internal:///./node_modules/@xenova/transformers/src/utils/hub.js:597:17)
    at async Promise.all (index 0)
    at async loadTokenizer (webpack-internal:///./node_modules/@xenova/transformers/src/tokenizers.js:103:18)
    at async AutoTokenizer.from_pretrained (webpack-internal:///./node_modules/@xenova/transformers/src/tokenizers.js:4390:50)
    at async Promise.all (index 0)
    at async loadItems (webpack-internal:///./node_modules/@xenova/transformers/src/pipelines.js:3107:5)
    at async pipeline (webpack-internal:///./node_modules/@xenova/transformers/src/pipelines.js:3047:21)
    at async QuestionAnsweringSingleton.getInstance (webpack-internal:///./shared/ai/questionAnswering.ts:20:29)
    at async getContactName (webpack-internal:///./shared/ai/questionAnswering.ts:11:22)

When I only exchange the model to e.g. 'Xenova/distilbert-base-cased-distilled-squad' it works as expected.

Thank you for rolling it back. I use a workaround for now:

    // workaround for model not giving the exact answer;
    // "Karl-Heinz Müller" results in "karl - heinz müller"
    const senderAnswerModified = sender.answer.replace(' - ', '-');
    const textModified = text.toLowerCase();

    const startIndex = textModified.indexOf(senderAnswerModified);
    const endIndex = startIndex + sender.answer.length;

    const result = text.slice(startIndex, endIndex);

Would love to hear if I can drop this workaround :) THANKS in advance!

Sign up or log in to comment