internships/internships-2023 · Appropriate tokenizer?

Dec 23, 2022

•

edited Dec 28, 2022

Hello! I was wondering what we mean in with “appropriate tokenizer” in the body of the question for the art tooling residency application.

Does that mean we should use a specific tokenizer, for example one that matches the model checkpoint (distillbert-cased-tokenizer), or is the choice of tokenizer open to more simple choices like using vanilla Python to split the example string at whitespace?

In the first case, I would like to know if we could use the tokenizer with default parameters, or if it is expected that we tweak the input parameters to give specific treatment (e.g to ignore) to funky punctuation characters such as “\’”.

Thanks!

Meikel

Jan 1, 2023

I am only an applicant, so I cannot assure you the correctness of my answer, but at least you can keep working on the task.

Check out the tokenizer base class (https://huggingface.co/docs/transformers/main_classes/tokenizer) description and examples, maybe you can come up with a simple solution for your task.
If your question is purely general, "appropriate" sounds, to me, like we are asked to decide the choice of the tokenizer ourselves.
But keep in mind Hugging Face's mission. I don't believe they will make you implement the tokenizer by yourself :)

Hope I could be of any help. Good luck with your application!

hrezaei

Jan 4, 2023

•

edited Jan 4, 2023

I think the second chapter of this course contains a hint 🤗 :
https://huggingface.co/course

Anshita6

Jan 9, 2023

I am also an applicant here and I think with this question, they want to check the understanding of the applicant about their product. Whether an applicant is equipped with the understanding of Huggingface or not and whether they can be able to apply Huggingface utilities on the real world problems.

Thank you.
Best regards.