Spaces:
Sleeping
Sleeping
metadata
title: Gemma 3 270M Text Generation API
emoji: π€
colorFrom: purple
colorTo: indigo
sdk: docker
sdk_version: 0.0.1
app_file: app.py
pinned: false
Gemma 3 270M FastAPI Inference
This project provides a high-performance FastAPI-based inference server for the Google Gemma 3 270M language model using llama-cpp-python. It features thread-pool based asynchronous processing, rate limiting, and optimized GGUF model loading for fast response times.
This Space hosts the Google Gemma 3 270M model behind a FastAPI backend with:
- β‘ High Performance: llama-cpp-python with GGUF model format for faster inference
- π Rate Limiting: IP-based request throttling
- ποΈ Flexible Input: Support for both chat messages and direct prompts
- π Monitoring: Built-in health checks and metrics
- π Production Ready: Comprehensive error handling and logging
- π§ Configurable: Environment-based configuration with CPU/GPU support
- π³ Docker Support: Ready-to-deploy containerization
- πΎ Memory Efficient: GGUF quantized models for reduced memory usage
Usage
Health Check
curl https://<your-space>/health