Gemma-3-270M / README.md
unknown
Fixed the model optimzation speed
e5ba726
metadata
title: Gemma 3 270M Text Generation API
emoji: πŸ€–
colorFrom: purple
colorTo: indigo
sdk: docker
sdk_version: 0.0.1
app_file: app.py
pinned: false

Gemma 3 270M FastAPI Inference

This project provides a high-performance FastAPI-based inference server for the Google Gemma 3 270M language model using llama-cpp-python. It features thread-pool based asynchronous processing, rate limiting, and optimized GGUF model loading for fast response times.

This Space hosts the Google Gemma 3 270M model behind a FastAPI backend with:

  • ⚑ High Performance: llama-cpp-python with GGUF model format for faster inference
  • πŸ”’ Rate Limiting: IP-based request throttling
  • πŸŽ›οΈ Flexible Input: Support for both chat messages and direct prompts
  • πŸ“Š Monitoring: Built-in health checks and metrics
  • πŸš€ Production Ready: Comprehensive error handling and logging
  • πŸ”§ Configurable: Environment-based configuration with CPU/GPU support
  • 🐳 Docker Support: Ready-to-deploy containerization
  • πŸ’Ύ Memory Efficient: GGUF quantized models for reduced memory usage

Usage

Health Check

curl https://<your-space>/health