Building out the model card!
This model card is awesome, and it would be great to build it out even more! Could you share details to estimate the CO2 emissions, like the training time, cloud provider and compute region?
Also, the model card for the roBERTa base model (https://huggingface.co/roberta-base) talks about how bias issues with roBERTa will affect fine tuned versions of the model, too. Are there resources we can use to better understand potential biases associated with this model?
Hey Marissa,
thanks a lot for the model card praise. Highly appreciated that you are pushing awereness of co2 efficiency and bias in LMs.
carbon footprint
since the model is only finetuned on the SQuAD QA dataset and not pretrained I suspect the footprint for this training run will be low. So it is trained on a 4x V100 GPU (p3.8xlarge) for 2 epochs, which take roughly 15 mins each, so 30 mins in total. I am reasonaly certain the AWS region was Ireland (eu-west-1).
Btw could you share how you then compute the carbon footprint. And also an additional suggestion (not sure if you already discussed this): how about including inference footprints for standard GPU hardware in the model card? You know, so RoBERTa pretraining from scratch and finetuning on SQuAD seems to me only a tiny fraction of the real footprint, considfering this model is downloaded and hopefully used more than 1 million times each month...
bias
Of course the bias in vanilla RoBERTa is rather large. Since the training data was SQuAD, english wikipedia pages and factual questions and answers I suspect the QA bias is rather small. I found a rather recent publication on bias in QA datasets, including SQuAD: https://aclanthology.org/2021.mrqa-1.9.pdf TLDR: the bias is rather small for this dataset and closed domain use case. They found some bias for open domain question answering in the retriever, when the questions are underspecified (not part of this model), which is understandable.
About annotator selection bias, I know that we carefully selected a diverse set of annotators for our GermanQuAD dataset. I think SQuAD used mechanical turk without annotator selection unfortunately...
model card
Hey we are also happy with details in our model card(s) and it seems obvious that our well structured model card helps adoption of the model. But it is of course not perfect by any means and are striving to make the model cards more standardized so people can quickly judge if the model is fit for their purpose. I know HF has initiated an internship on improving model cards (is this you? :D ). If you have any more info and pointers how we can help/improve please let us know.
Kind regards, Timo
Thanks so much for this info, Timo! And yes, @Es-O and I are working on model cards! If it’s ok with you, we can suggest some optional additions to your model card incorporating the information you shared as a PR. For example, we could add information on bias and limitations of the model (perhaps in the "Usage" section) and estimated carbon emissions to the model card (here is an example of a model card with sections covering those topics: https://huggingface.co/distilgpt2)
For the carbon footprint, we might use the MLCO2 calculator (https://mlco2.github.io/impact/#home). There's more info about including carbon emissions info in model cards here: https://huggingface.co/docs/hub/model-repos#carbon-footprint-metadata. Including inference footprints is a really interesting idea -- also looping in @sasha who is the expert on this!
I'm so glad to hear that your model card has helped with adoption of the model!!
Sure, please go ahead and create a PR.
Happy to discuss more on the inference carbon footprint.
Hi Timo,
Additional Information
If you and the team have any additional information in regards to the “Potential Users”, “Out-of-Scope Uses” in the “Uses, Limitations and Risks” section , as well as further “Limitations and Risks” in relation to the model, that would aid when detailing the ethical considerations.
Citations
Does Deepset have a preferred citation style?