All You need To Know About Phi-3 (Technical Report Walkthrough)
Summary of Summaries: Phi-3-mini - Architecture specs: decoder-only transformer, ModelSize: 3.8 billion parameters, LongRope [ 128K Context length ], Vocab Size [ 32064 ], trained on 3.3 trillion tokens. at bfloat16. - Rivals performance to larger models like Mixtral 8x7B and GPT-3.5, capable of running locally on a smartphone. - Utilizes high quality training dataset heavily filtered from web data and llm-generated synthetic data. - Can be quantized to 4-bits, occupying โ 1.8GB of memory. - Ran natively on iPhone 14 with A16 Bionic chip with inference speed of up to 12 tokens per second.
Phi-3-small - Architecture specs: Also decoder-only, 7B parameters, Vocab size [ 100352 ], default context length [ 8k ], Context Length: 8K, Hidden Dimension: 4096, Number of Heads and Layers: Follows 7B class structure. - Uses tiktoken tokenizer (for enhanced multilingual tokenization)
Phi-3-medium: - Architecture specs: Also decoder-only, Hidden Dimension: 5120, Number of Heads: 40, Number of Layers: 40, Tokenization: Consistent with other models, Training on 4.8 trillion tokens.
Training Methodology: - Focuses on high-quality training data deviating from standard scaling laws. - The models undergo two-phase pre-training using a mix of web sources and synthetic data for general knowledge and logical reasoning skills.
Performance: - Phi-3-mini achieves competitive scores on standard benchmarks like MMLU and MT-Bench, indicating strong reasoning capabilities. - Higher variants show even better performance, suggesting effective scaling with increased model size.
Limitations: - phi-3-mini: limited by its smaller size in tasks requiring extensive factual knowledge, primarily supports English. - phi-3-small limited multilingual support.
Hosting LLMs locally is a big win for OSS - private, secured inferencing on the go๐