Spaces:

autoevaluate
/

model-evaluator

Runtime error

Request for Changes in UI

by ghpkishore - opened Jun 29, 2022

Jun 29, 2022

The UI for this is pretty non-intuitive. Generally folks (i think) would try to figure out which models to use based on a task rather than a dataset, say for example: Question Generation. The process requires me to know before hand which dataset can be evaluated for question generation, if I do not know, I would need to search for it in datasets. Once I figure out the dataset, " advanced config section" adds more complexity to the process. Finally, then a model is given based on the dataset, advanced config and evaluation criteria. This is very long drawn process. @patrickvonplaten had created the speech bench which directly showcases results for speech recognition.

Instead of choosing datasets, I would prefer if I can start off based on tasks, similar to like how the rest of huggingface is structured

lewtun

Evaluation on the Hub org Jun 29, 2022

Hi @ghpkishore , thank you for sharing this valuable feedback!

We designed the current UI from the perspective of an industry practitioner who typically starts with 1 dataset and wants to evaluate N models on it. Having said that, this is just the beginning, and we're definitely open to adapting to the community's needs!

Just so I can understand the UX a bit better in your proposal, can you share a rough workflow of what you had in mind?

ghpkishore

Jun 29, 2022

Sure. Let me give you some context first.

I am currently building a work automation tool. For one of my product functionalities, I would need to generate questions based on the passage given. Since HF has a bunch of models which are open-source. I would plan to use it for my product. Now, what I need is a model which is the best for question answering, that's pretty much all I care for.

This is the workflow I was thinking of:

I would then come to HF, open the model evaluator, choose my task (here it is Question Generation), then evaluate all models.
One of the datasets can be provided as default (one which the industry might typically choose) with an option to choose other datasets, similar to how it is in "The 🤗 Speech Bench" space.
The evaluation metrics can be ones that are chosen by HF by default or can be in a tabular format similar to the speech bench.
I can then choose a list of models which are trained for that task based on a dropdown.

Once i run the evaluation for the task, the models would be evaluated and the result can be added to the leaderboard.

Let me know if you understand this, or if this seems unnecessary in your product roadmap. My understanding is that a lot of software devs with minimal understanding of ML can now integrate ML models in their workflows, and expecting them to know which dataset to use for which task can be challenging.

Tristan

Jul 4, 2022

There is a task filter on the leaderboards space now. A step towards addressing this issue:

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment