Instructions to use microsoft/phi-2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use microsoft/phi-2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="microsoft/phi-2")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("microsoft/phi-2") model = AutoModelForMultimodalLM.from_pretrained("microsoft/phi-2") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use microsoft/phi-2 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "microsoft/phi-2" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/microsoft/phi-2
- SGLang
How to use microsoft/phi-2 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "microsoft/phi-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "microsoft/phi-2" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "microsoft/phi-2", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use microsoft/phi-2 with Docker Model Runner:
docker model run hf.co/microsoft/phi-2
New architecture: TMT โ dynamic graph attention + adaptive depth routing, 29.4 PPL at 48% compute (120M params)
#148
by vigneshwar234 - opened
TemporalMesh Transformer (TMT v3) โ New efficient transformer: 29.4 PPL, 48% compute, 5 innovations
Hi everyone! I'm releasing TemporalMesh Transformer (TMT v3), an open-source PyTorch transformer architecture that achieves state-of-the-art efficiency at 120M parameters.
The problem with current transformers
Standard transformers have three hard inefficiencies that haven't been fixed together:
- Attention is O(Sยฒ) โ quadratic in sequence length
- Attention topology is static โ fully connected, never adapts to semantic content
- Every token uses all N layers regardless of complexity
TMT fixes all three simultaneously
| Innovation | What it does | Cost |
|---|---|---|
| Mesh Attention | Dynamic kNN graph rebuilt per-layer from cosine similarity | O(Sยทk) |
| Temporal Decay | Learned multiplicative attenuation of distant tokens post-softmax | ~0 overhead |
| Adaptive Exit | Per-token gate: punctuation exits layer 2, rare words layer 12 | โ52% compute |
| Dual-Stream FFN | Parallel syntax + semantic streams, sigmoid fusion | Same FLOPs |
| EMA Anchors | 16 persistent fast-weight vectors, cross-sequence recall | 32KB params |
Key results (120M params, all seeds 42/1337/2024)
- WikiText-2: 29.4 PPL (vs 42.1 vanilla, 31.8 Mamba, 33.1 RWKV)
- WikiText-103: 36.1 PPL (vs 51.3 vanilla, 38.4 Mamba)
- LongBench: 53.4 avg score (vs 51.3 Mamba, 49.8 Longformer)
- C4: 27.4 PPL, The Pile: 35.8 PPL, OpenWebText: 30.1 PPL
- Throughput: 138K tokens/sec A100 FP16
- Superadditive gain: 12.7 PPL improvement vs 8.6 from summing components individually
Quick start
from tmt.model.config import TMTConfig
from tmt.model.model import TMTModel
model = TMTModel(TMTConfig(vocab_size=50257, d_model=512, n_heads=8, n_layers=12))
out = model(tokens) # out.logits, out.exit_masks, out.graph_edges, out.confidences
Links
- ๐ Paper: https://zenodo.org/records/20287390 (DOI: 10.5281/zenodo.20287197)
- ๐ป Code (226 tests): https://github.com/vignesh2027/TemporalMesh-Transformer
- ๐ฎ Live demo: https://huggingface.co/spaces/vigneshwar234/TemporalMesh-Transformer-Demo
- ๐ค Model: https://huggingface.co/vigneshwar234/TemporalMesh-Transformer
Happy to answer questions on the architecture, training, or ablations!