Need info on pre-training and instruction-tuning data

#64
by markding - opened

We're cataloguing 'open' LLMs and looking for the highest quality metadata for evidence-based openness judgement. So far we've failed to locate any specifics on the pretraining data and instruction-tuning data and methods. Gemma currently comes in in the bottom 5 of over 30 'open' LLMs at our live openness tracker (for comparison, LLM360's AmberChat, another recent release, came in in second position). We will be watching developments with interest.
image.png

Google org

Hey! Surya from the Gemma team here. We didn't release detailed specifics of our pretraining or our finetuning (for either SFT and RLHF phases). We are exploring ways and avenues to share more with the community in the future, many thanks for raising this!

Thanks @suryabhupa and great to hear you are exploring ways and avenues to share more! Any update on this?

The industry standards for disclosing this information are model cards and data sheets. (As you probably know, as some well-known (ex)googlers co-authored them.) Does anything specific stand in the way of complying with these standards?

  • Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. ‘Datasheets for Datasets’. Communications of the ACM 64 (12): 86–92. https://doi.org/10.1145/3458723.
  • Mitchell, Margaret, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. ‘Model Cards for Model Reporting’. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–29. FAT* ’19. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3287560.3287596.
Google org

We have some of our model cards uploaded on Kaggle: https://www.kaggle.com/models/google/gemma but we haven't released any information about the finetuning data itself. No updates yet but hopefully we have more to share soon, thanks for following up.

Sign up or log in to comment