Detailed Gemini Summary

#13

by shermansiu - opened Dec 7, 2023

Discussion

shermansiu

Dec 7, 2023

•

edited Dec 8, 2023

🚀 Gemini highlights:

Dataset:

Multimodal and multilingual
Web documents, books, and code

Dataset size

#Tokens to train Pro+Ultra chosen using Chinchilla scaling laws.
Nano models trained on more tokens than predicted by Chinchilla scaling laws, following LLaMa

Filtering

Quality filtering: Heuristics + model classifiers
Safety filtering done to remove harmful content

Dataset composition

Which datasets to include + weighting determined by ablations on smaller models
Training done in multiple stages: weight of domain-relevant data increased towards the end of training
Data quality is critical

Model architecture:

Transformer, decoder-only
32,768 context length
Uses multi-query attention (and other ways to make transformers efficient, which are not mentioned)

Model sizes

4 model sizes
Ultra <-> GPT-4V (performance-wise on benchmarks)
Pro <-> GPT3.5/4-Turbo (empirically, as reported by people using Bard today)
Nano: Nano-1: 1.8B, Nano-2: 3.25B. 4-bit quantized. Distilled from larger models. Nano-1 designed for low memory devices, Nano-2 for high memory devices. ({Gemini Nano is on Pixel 8 phones as of today })

Input:

Text, interleaved with images, audio, video
Multi-modal starting from pre-training, as opposed to adding other modalities to text later
Text input: Sentencepiece tokenizer (unk. if BPE, Wordpiece or unigram)
Visual input: "Inspired by previous work: Flamingo, CoCa, PaLI" - i.e. ViT (probably PaLI-style because simplest + most recent)
Video: Frames encoded as image inputs (evaluation done on 16 frames, equally spaced apart)
Audio: Universal Speech Model (USM) features @ 16kHz

Output:

Text + image
Image: inspired by previous approaches (cited DALL-E and Parti): probably Parti because auto-regressive

Implementation Details

Programmed in Jax (unsurprisingly)
Trained using Pathways on TPUv5e and TPUv4 (also unsurprisingly)
Saves in-memory copy of model in case of hardware failure, instead of checkpointing and saving to disk. Saves recovery time and model trains for 97% of the time (up from 85%). Takes more training resources than checkpointing. ({Jeff Dean on X: Only matters for larger models, shouldn't matter for smaller ones})

Alignment:

Quality > Quantity, esp. for larger models when instruction tuning (SFT, reward model training, RLHF) (avoiding dataset leakage)
Quotes LLaMa 2 for quality: LLaMa 2 uses fewer but high-quality/diversity self-collected SFT data (esp. for chat instructions) instead of millions of low-quality/diversity, third-party SFT data (from various sources), which improves results (LLaMa 2 uses 27,540 annotations)
Must balance reward model with examples of refusals and helpful responses
Multi-objective optimization with weighted sum of reward scores from helpfulness, factuality, safety used to train multi-headed reward model (i.e. three outputs for helpfulness, factuality, and safety, RM loss is a weighted sum, so is the reward)
To generate harmful response dataset: For each of 20 types of identified harm types, pass several variants of Google's content policy language as "constitutions" to pre-aligned model and use 0-shot CoT to revise responses and choose between multiple response candidates

Factuality-focused adaptation (part of instruction tuning):

Attribution: If asked to generate a response that is attributed to the prompt context, Gemini should be faithful to the context (incl. summarization, citation, QA given long prompt (e.g. book), prompted output format adherence)
Closed book response generation: Don't answer fact-seeking prompt without sources, whether directly or if semi-creative prompt indirectly requires facts to give answer
Hallucination: Should hedge instead of trying to answer "unanswerable" questions

Novel MMLU decoding scheme: uncertainty-routed chain-of-thought

Produce k chain-of-thought samples, select majority vote IF model is confident beyond a threshold
Otherwise, return greedy sample choice
Improves Gemini Ultra on MMLU by 6% (84.0->90.0), vs. 3.1% (84.2->87.3) on GPT-4V
CoT only improves Gemini Ultra's perf. on MMLU by 1%

(Everything else are just examples of use case/benchmark results)

Maani

Dec 7, 2023

•

edited Dec 7, 2023

Anyone got a guess or a leak on model's parameters numbers!?

shermansiu

Dec 7, 2023

•

edited Dec 8, 2023

Guess: ~~Pro=≈20B, Ultra=≈200B.~~
~~Pro=≈70B, Ultra=≈200B~~
Pro=≈30-70B, Ultra=≈150B-1.5T

For a better analysis, plot MMLU (or other metric) vs. log(#parameters) for a bunch of the newer LLMs and LMMs and extrapolate until you find a suitable number for Pro and Ultra, LOL

Edit: Or just find the max. number of tokens you can train on from the Internet, then use Chinchilla scaling laws to find the corresponding model size.

shermansiu

Dec 7, 2023

•

edited Dec 7, 2023

Maani

Dec 7, 2023

•

edited Dec 7, 2023

@shermansiu thanks man!

shermansiu

Dec 7, 2023

•

edited Dec 7, 2023

Automatic Speech Recognition:
FLEURS is evaluated on 62 languages, even though the full dataset has 102 languages, following USM. As for other metrics... I couldn't find them on the Whisper paper/model page to get the full v2/v3 (large-1.5B) metrics for better comparison.

Gemini Nano-1 and Nano-2 beat Whisper v2 and v3 though, from the reported results.

All Gemini models (even 1.8B) have beaten the previous SOTA Whisper-v3-large (1.55B)

Machine Translation:
WMT23: Presented at EMNLP 2023 and metrics for other models aren't available yet

shermansiu

Dec 7, 2023

•

edited Dec 9, 2023

Pro and Ultra models

Text

Pro and Ultra are generally better than OSS models at text, esp. for math. OSS is not far behind though.
I removed DROP for the same reason it was removed from HF Open LLM Leaderboard.
Not enough LLMs use Natural2Code.

Image

For images, 7-13B OSS models have the same performance as Pro. Performance improvements could come by scaling up the number of parameters.

Video

Gemini Pro and Ultra don't do nearly as well as I thought it would on video, esp. given the parameter sizes of the OSS models they're competing against.

Multilingual

Too few new LLM papers report benchmarks on MGSM (math), XLSum (summarization), Wikilingua (summarization), opting instead to benchmark on Chinese benchmarks to test for multilinguality (e.g. CMMLU (language understanding), GAOKAO (university admissions test), C-Eval (multiple choice questions for 52 disciplines (from humanities to STEM) and 4 difficulties (middle school, high school, college-level, professional))

Automatic Speech Recognition

Once again, Nano models beat the OSS SOTA (Whisper-v3-large-1.55B). Comparing it to the Pro and Ultra models is unnecessary.

shermansiu

Dec 7, 2023

•

edited Dec 7, 2023

And yes, I'm comparing generalist models to a bunch of specialized models. ¯\_(ツ)_/¯

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment