from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig from peft import PeftModel import torch from flask import Flask, request, jsonify from flask_cors import CORS # Setup app = Flask(__name__) CORS(app) # Enable CORS for frontend/backend calls # Load base model + adapter base_model_name = "unsloth/gemma-3-12b-it-unsloth-bnb-4bit" adapter_name = "adarsh3601/my_gemma3_pt" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4" ) base_model = AutoModelForCausalLM.from_pretrained( base_model_name, quantization_config=bnb_config, device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(base_model_name) model = PeftModel.from_pretrained(base_model, adapter_name) @app.route("/chat", methods=["POST"]) def chat(): try: data = request.json prompt = data.get("message", "") inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=150, do_sample=True) response = tokenizer.decode(outputs[0], skip_special_tokens=True) return jsonify({"response": response}) except Exception as e: return jsonify({"error": str(e)}), 500 # For Hugging Face Spaces to detect the server @app.route("/", methods=["GET"]) def root(): return "HF Space backend running" if __name__ == "__main__": app.run(host="0.0.0.0", port=7860)