Spaces:

bughead
/

humanzise-api

Sleeping

bughead commited on 20 days ago

Commit

325e5a1

1 Parent(s): 4cfceb1

Initial Humanzise backend deployment

FastAPI + desklib DeBERTa-v3 AI detector + rule-based humanizer.
Docker SDK, exposes port 7860, runs as non-root user (UID 1000).

Files changed (13) hide show

.dockerignore +45 -0
.gitignore +10 -0
Dockerfile +60 -0
LICENSE +9 -0
README.md +87 -6
api/humanize_api.py +163 -0
requirements.txt +0 -0
utils/__init__.py +0 -0
utils/ai_detection_utils.py +62 -0
utils/desklib_model.py +45 -0
utils/humanizer_core.py +261 -0
utils/model_loaders.py +61 -0
utils/pdf_utils.py +63 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,45 @@

+# Exclude everything the backend Docker image doesn't need.
+# Keeps the HF Spaces build context small and fast.
+# Frontend — deployed separately to Vercel
+web/
+# Local python env
+venv/
+__pycache__/
+*.pyc
+*.pyo
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+# Model caches (let the container populate its own)
+.cache/
+~/.cache/
+*.safetensors
+*.bin
+*.ckpt
+# Git / IDE / OS
+.git/
+.github/
+.vscode/
+.idea/
+.DS_Store
+Thumbs.db
+# Docs / backups / logs
+*.md
+!README.md
+*-backup-*.zip
+*.log
+# Upstream fork artifacts not needed in production
+pages/
+main.py
+setup.sh
+Procfile
+vercel.json
+nltk.txt
+requirements-local.txt
+*.ttf

.gitignore ADDED Viewed

	@@ -0,0 +1,10 @@

+__pycache__/
+*.pyc
+*.pyo
+.DS_Store
+Thumbs.db
+.vscode/
+.idea/
+venv/
+.env
+.env.local

Dockerfile ADDED Viewed

	@@ -0,0 +1,60 @@

+# Humanzise backend — Docker image for Hugging Face Spaces (Docker SDK).
+#
+# HF Spaces requirements met here:
+#   - Listens on 0.0.0.0:7860
+#   - Runs as non-root user with UID 1000 (`user`)
+#   - $HOME = /home/user so HF Hub cache persists under the user
+#
+# Build size strategy:
+#   - CPU-only torch wheel (~500 MB instead of ~2 GB CUDA)
+#   - --no-cache-dir on every pip install
+#   - Slim Debian base
+FROM python:3.11-slim
+# System deps needed for occasional source builds
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        build-essential \
+        git \
+    && rm -rf /var/lib/apt/lists/*
+# HF Spaces mandates a non-root user with UID 1000
+RUN useradd --create-home --uid 1000 user
+USER user
+ENV HOME=/home/user \
+    PATH=/home/user/.local/bin:$PATH \
+    HF_HOME=/home/user/.cache/huggingface \
+    TRANSFORMERS_CACHE=/home/user/.cache/huggingface \
+    PYTHONDONTWRITEBYTECODE=1 \
+    PYTHONUNBUFFERED=1
+WORKDIR /home/user/app
+# Install CPU-only torch first so transformers picks it up and doesn't pull CUDA
+RUN pip install --no-cache-dir --user --upgrade pip && \
+    pip install --no-cache-dir --user \
+        --index-url https://download.pytorch.org/whl/cpu \
+        torch
+# Install the rest of the deps
+COPY --chown=user:user requirements.txt .
+RUN pip install --no-cache-dir --user -r requirements.txt
+# Pre-download the small NLP models so cold requests don't pay the download tax
+RUN python -m spacy download en_core_web_sm && \
+    python -c "import nltk; \
+        nltk.download('punkt', quiet=True); \
+        nltk.download('punkt_tab', quiet=True); \
+        nltk.download('averaged_perceptron_tagger', quiet=True); \
+        nltk.download('averaged_perceptron_tagger_eng', quiet=True); \
+        nltk.download('wordnet', quiet=True)"
+# Copy application code
+COPY --chown=user:user api ./api
+COPY --chown=user:user utils ./utils
+EXPOSE 7860
+# The desklib model (~1.75 GB) downloads lazily on the first /detect request
+# and is cached under $HF_HOME for the life of the container.
+CMD ["uvicorn", "api.humanize_api:app", "--host", "0.0.0.0", "--port", "7860"]

LICENSE ADDED Viewed

	@@ -0,0 +1,9 @@

+MIT License
+Copyright (c) 2025 DADA NANJESHA for project AI Content Detector & Humanizer
+Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md CHANGED Viewed

@@ -1,12 +1,93 @@
 ---
-title: Humanzise Api
-emoji: 🐢
-colorFrom: purple
-colorTo: green
 sdk: docker
 pinned: false
-license: mit
 short_description: Free AI text humanizer and detector
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Humanzise API
+emoji: 🪄
+colorFrom: green
+colorTo: indigo
 sdk: docker
+app_port: 7860
 pinned: false
 short_description: Free AI text humanizer and detector
 ---
+# Humanzise
+Free, open-source **AI text humanizer** + **AI detector**. Paste any AI-generated text and rewrite it to sound more natural — or check how likely an existing text was written by AI.
+- **Frontend**: Next.js 16 + shadcn/ui + Tailwind CSS (deployed on Vercel)
+- **Backend**: FastAPI + PyTorch + DeBERTa-v3 detector (deployed on Hugging Face Spaces)
+- **Detector model**: [`desklib/ai-text-detector-v1.01`](https://huggingface.co/desklib/ai-text-detector-v1.01) — current leader on the RAID benchmark
+- **Humanizer**: rule-based pipeline (WordNet synonyms + contraction expansion + academic transitions + citation preservation)
+## Repository layout
+```
+humanzise/
+├── api/                   FastAPI app (entry point: api.humanize_api:app)
+│   └── humanize_api.py
+├── utils/                 Backend logic
+│   ├── humanizer_core.py  Text humanization pipeline
+│   ├── ai_detection_utils.py
+│   ├── desklib_model.py   Custom DeBERTa-v3 wrapper for desklib weights
+│   ├── model_loaders.py
+│   └── pdf_utils.py       PDF text extraction
+├── web/                   Next.js frontend
+│   └── src/
+│       ├── app/
+│       ├── components/
+│       └── lib/
+├── Dockerfile             HF Spaces Docker image
+├── requirements.txt       Production deps (lean, CPU-only torch)
+├── requirements-local.txt All dev deps
+└── DEPLOY.md              Step-by-step deployment guide
+```
+## Running locally
+### Backend (Python 3.12)
+```bash
+python -m venv venv
+source venv/Scripts/activate        # or venv/bin/activate on macOS/Linux
+pip install -r requirements-local.txt
+python -m spacy download en_core_web_sm
+python -m uvicorn api.humanize_api:app --reload --port 8000
+```
+Scalar docs: http://localhost:8000/docs
+### Frontend (Node 20+)
+```bash
+cd web
+npm install
+npm run dev
+```
+Open http://localhost:3000. Set `NEXT_PUBLIC_API_BASE_URL` in `web/.env.local` if your backend isn't on `http://127.0.0.1:8000`.
+## API endpoints
+| Method | Path | Description |
+|---|---|---|
+| `GET`  | `/health` | Liveness probe |
+| `POST` | `/humanize` | Rewrite AI text to sound more natural |
+| `POST` | `/detect` | Score text for AI likelihood (desklib DeBERTa-v3) |
+| `POST` | `/extract-file` | Extract text from uploaded PDF/TXT/MD |
+All endpoints use JSON request/response; `/extract-file` uses `multipart/form-data`.
+## Deployment
+Free deployment path is documented in [DEPLOY.md](./DEPLOY.md):
+- **Frontend** → Vercel (free, `web/` subfolder)
+- **Backend** → Hugging Face Spaces (Docker SDK, free 16 GB RAM)
+## Credits
+Forked from [DadaNanjesha/AI-content-detector-Humanizer](https://github.com/DadaNanjesha/AI-content-detector-Humanizer) — original Streamlit app. This fork replaced the Streamlit UI with a Next.js frontend, modernized the backend, and swapped in the desklib detector.
+## License
+MIT — see [LICENSE](./LICENSE).

api/humanize_api.py ADDED Viewed

	@@ -0,0 +1,163 @@

+import re
+from typing import Dict, Optional
+from fastapi import FastAPI, File, HTTPException, UploadFile
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Field
+from utils.ai_detection_utils import classify_text_hf
+from utils.pdf_utils import extract_text_from_pdf
+from utils.humanizer_core import (
+    count_sentences,
+    count_words,
+    extract_citations,
+    minimal_rewriting,
+    preserve_linebreaks_rewrite,
+    restore_citations,
+)
+DESCRIPTION = """
+AI Text Humanizer & Detector API
+Provides server-side access to the project's text humanization and AI-detection
+pipelines. The API is consumed by the Next.js frontend in /web.
+"""
+tags_metadata = [
+    {"name": "humanize", "description": "Transform AI-generated text into human-like prose."},
+    {"name": "detect", "description": "Classify text as AI-generated or human-written."},
+]
+app = FastAPI(
+    title="AI Text Humanizer API",
+    version="0.3",
+    description=DESCRIPTION,
+    openapi_tags=tags_metadata,
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+class HumanizeRequest(BaseModel):
+    text: str = Field(..., description="The input text to humanize. Must be non-empty.")
+    p_syn: Optional[float] = Field(0.2, ge=0.0, le=1.0)
+    p_trans: Optional[float] = Field(0.2, ge=0.0, le=1.0)
+    preserve_linebreaks: Optional[bool] = Field(True)
+class HumanizeResponse(BaseModel):
+    humanized_text: str
+    orig_word_count: int
+    orig_sentence_count: int
+    new_word_count: int
+    new_sentence_count: int
+    words_added: int
+    sentences_added: int
+class DetectRequest(BaseModel):
+    text: str = Field(..., description="The input text to analyze.")
+class DetectResponse(BaseModel):
+    percentages: Dict[str, float]
+    classification: Dict[str, str]
+    ai_score: float
+    human_score: float
+@app.get("/health", tags=["humanize"], summary="Health check")
+def health():
+    return {"status": "ok"}
+@app.post("/humanize", response_model=HumanizeResponse, tags=["humanize"])
+def humanize(req: HumanizeRequest):
+    text = req.text or ""
+    if not text.strip():
+        raise HTTPException(status_code=400, detail="`text` must be a non-empty string")
+    orig_wc = count_words(text)
+    orig_sc = count_sentences(text)
+    no_refs_text, placeholders = extract_citations(text)
+    if req.preserve_linebreaks:
+        rewritten = preserve_linebreaks_rewrite(no_refs_text, p_syn=req.p_syn, p_trans=req.p_trans)
+    else:
+        rewritten = minimal_rewriting(no_refs_text, p_syn=req.p_syn, p_trans=req.p_trans)
+    final_text = restore_citations(rewritten, placeholders)
+    final_text = re.sub(r"[ \t]+([.,;:!?])", r"\1", final_text)
+    final_text = re.sub(r"(\()[ \t]+", r"\1", final_text)
+    final_text = re.sub(r"[ \t]+(\))", r"\1", final_text)
+    final_text = re.sub(r"[ \t]{2,}", " ", final_text)
+    final_text = re.sub(r"``\s*(.+?)\s*''", r'"\1"', final_text)
+    new_wc = count_words(final_text)
+    new_sc = count_sentences(final_text)
+    return {
+        "humanized_text": final_text,
+        "orig_word_count": orig_wc,
+        "orig_sentence_count": orig_sc,
+        "new_word_count": new_wc,
+        "new_sentence_count": new_sc,
+        "words_added": new_wc - orig_wc,
+        "sentences_added": new_sc - orig_sc,
+    }
+@app.post("/extract-file", tags=["humanize"], summary="Extract text from uploaded file")
+async def extract_file(file: UploadFile = File(...)):
+    """Accept a PDF, TXT or MD file and return its plain-text contents."""
+    if not file.filename:
+        raise HTTPException(status_code=400, detail="No file provided")
+    content = await file.read()
+    name = file.filename.lower()
+    try:
+        if name.endswith(".pdf"):
+            text = extract_text_from_pdf(content)
+        elif name.endswith((".txt", ".md")):
+            text = content.decode("utf-8", errors="ignore")
+        else:
+            raise HTTPException(
+                status_code=400,
+                detail="Unsupported file type. Use .pdf, .txt, or .md",
+            )
+    except HTTPException:
+        raise
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"Failed to extract: {exc}")
+    return {"text": text, "filename": file.filename}
+@app.post("/detect", response_model=DetectResponse, tags=["detect"])
+def detect(req: DetectRequest):
+    text = req.text or ""
+    if not text.strip():
+        raise HTTPException(status_code=400, detail="`text` must be a non-empty string")
+    classification_map, percentages, mean_ai_prob = classify_text_hf(text)
+    # Use the raw mean probability as the headline score — it's a more honest
+    # signal than bucket-counting (which collapses to 0 for borderline text).
+    ai_score = round(mean_ai_prob * 100, 2)
+    human_score = round(100 - ai_score, 2)
+    return {
+        "percentages": percentages,
+        "classification": classification_map,
+        "ai_score": ai_score,
+        "human_score": human_score,
+    }

requirements.txt ADDED Viewed

Binary file (1.15 kB). View file

utils/__init__.py ADDED Viewed

File without changes

utils/ai_detection_utils.py ADDED Viewed

	@@ -0,0 +1,62 @@

+"""
+AI text detection powered by the desklib DeBERTa-v3 classifier.
+Scores the FULL text and each sentence. Returns the per-sentence bucket
+breakdown the frontend expects PLUS the honest raw mean probability.
+"""
+import nltk
+from nltk.tokenize import sent_tokenize
+from utils.model_loaders import load_detector_model, predict_ai_probability
+nltk.download("punkt", quiet=True)
+def classify_text_hf(text, threshold_ai=0.75, threshold_mid=0.4, threshold_soft=0.15):
+    """Classify the input text.
+    Returns:
+      classification_map: dict[sentence] -> label bucket
+      percentages: dict[bucket] -> percentage of sentences
+      mean_ai_probability: float 0..1 (full-text score)
+    The full-text probability is also used as the headline AI score because
+    detectors are more reliable on full paragraphs than individual sentences.
+    """
+    model, tokenizer, device = load_detector_model()
+    # Overall score: run the full text through the model once
+    full_prob = predict_ai_probability(text, model, tokenizer, device)
+    sentences = sent_tokenize(text) or [text]
+    classification_map = {}
+    counts = {
+        "AI-generated": 0,
+        "AI-generated & AI-refined": 0,
+        "Human-written": 0,
+        "Human-written & AI-refined": 0,
+    }
+    for sentence in sentences:
+        if not sentence.strip():
+            continue
+        prob = predict_ai_probability(sentence, model, tokenizer, device)
+        if prob >= threshold_ai:
+            label = "AI-generated"
+        elif prob >= threshold_mid:
+            label = "AI-generated & AI-refined"
+        elif prob >= threshold_soft:
+            label = "Human-written & AI-refined"
+        else:
+            label = "Human-written"
+        classification_map[sentence] = label
+        counts[label] += 1
+    total = sum(counts.values())
+    percentages = {
+        cat: round((count / total) * 100, 2) if total > 0 else 0
+        for cat, count in counts.items()
+    }
+    return classification_map, percentages, full_prob

utils/desklib_model.py ADDED Viewed

	@@ -0,0 +1,45 @@

+"""
+Custom model class for the desklib AI text detector.
+The repo ships `model.safetensors` containing a DeBERTa-v3-large backbone plus
+a single-logit classifier head. There's no modeling code in the repo, so we
+recreate the architecture here verbatim from the README and call
+`from_pretrained()` on THIS class (not `AutoModelForSequenceClassification`)
+to load the weights.
+Source: https://huggingface.co/desklib/ai-text-detector-v1.01
+"""
+import torch
+import torch.nn as nn
+from transformers import AutoConfig, AutoModel, PreTrainedModel
+class DesklibAIDetectionModel(PreTrainedModel):
+    config_class = AutoConfig
+    def __init__(self, config):
+        super().__init__(config)
+        self.model = AutoModel.from_config(config)
+        self.classifier = nn.Linear(config.hidden_size, 1)
+        self.init_weights()
+    def forward(self, input_ids, attention_mask=None, labels=None):
+        outputs = self.model(input_ids, attention_mask=attention_mask)
+        last_hidden_state = outputs[0]
+        # Mean pooling over non-padding tokens
+        mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
+        summed = torch.sum(last_hidden_state * mask, dim=1)
+        counts = torch.clamp(mask.sum(dim=1), min=1e-9)
+        pooled = summed / counts
+        logits = self.classifier(pooled)
+        loss = None
+        if labels is not None:
+            loss_fct = nn.BCEWithLogitsLoss()
+            loss = loss_fct(logits.view(-1), labels.float())
+        out = {"logits": logits}
+        if loss is not None:
+            out["loss"] = loss
+        return out

utils/humanizer_core.py ADDED Viewed

	@@ -0,0 +1,261 @@

+"""
+Pure humanization helpers (no Streamlit).
+Extracted from the original pages/humanize_text.py so the FastAPI backend and
+any frontend can import these functions without pulling in Streamlit.
+"""
+import logging
+import random
+import re
+import ssl
+import warnings
+import nltk
+import spacy
+from nltk.corpus import wordnet
+from nltk.tokenize import sent_tokenize, word_tokenize
+warnings.filterwarnings("ignore", category=FutureWarning)
+logger = logging.getLogger(__name__)
+def download_nltk_resources():
+    try:
+        _create_unverified_https_context = ssl._create_unverified_context
+    except AttributeError:
+        pass
+    else:
+        ssl._create_default_https_context = _create_unverified_https_context
+    resources = [
+        "punkt",
+        "averaged_perceptron_tagger",
+        "punkt_tab",
+        "wordnet",
+        "averaged_perceptron_tagger_eng",
+    ]
+    for r in resources:
+        nltk.download(r, quiet=True)
+download_nltk_resources()
+try:
+    nlp = spacy.load("en_core_web_sm")
+except OSError:
+    logger.warning(
+        "spaCy en_core_web_sm model not found. Install with: python -m spacy download en_core_web_sm"
+    )
+    nlp = None
+CITATION_REGEX = re.compile(
+    r"\(\s*[A-Za-z&\-,\.\s]+(?:et al\.\s*)?,\s*\d{4}(?:,\s*(?:pp?\.\s*\d+(?:-\d+)?))?\s*\)"
+)
+def count_words(text):
+    return len(word_tokenize(text))
+def count_sentences(text):
+    return len(sent_tokenize(text))
+def extract_citations(text):
+    refs = CITATION_REGEX.findall(text)
+    placeholder_map = {}
+    replaced_text = text
+    for i, r in enumerate(refs, start=1):
+        placeholder = f"[[REF_{i}]]"
+        placeholder_map[placeholder] = r
+        replaced_text = replaced_text.replace(r, placeholder, 1)
+    return replaced_text, placeholder_map
+PLACEHOLDER_REGEX = re.compile(r"\[\s*\[\s*REF_(\d+)\s*\]\s*\]")
+def restore_citations(text, placeholder_map):
+    def replace_placeholder(match):
+        idx = match.group(1)
+        key = f"[[REF_{idx}]]"
+        return placeholder_map.get(key, match.group(0))
+    return PLACEHOLDER_REGEX.sub(replace_placeholder, text)
+WHOLE_CONTRACTIONS = {
+    "can't": "cannot",
+    "won't": "will not",
+    "shan't": "shall not",
+    "ain't": "is not",
+    "i'm": "i am",
+    "it's": "it is",
+    "we're": "we are",
+    "they're": "they are",
+    "you're": "you are",
+    "he's": "he is",
+    "she's": "she is",
+    "that's": "that is",
+    "there's": "there is",
+    "what's": "what is",
+    "who's": "who is",
+    "let's": "let us",
+    "didn't": "did not",
+    "doesn't": "does not",
+    "don't": "do not",
+    "couldn't": "could not",
+    "shouldn't": "should not",
+    "wouldn't": "would not",
+    "isn't": "is not",
+    "aren't": "are not",
+    "weren't": "were not",
+    "hasn't": "has not",
+    "haven't": "have not",
+    "hadn't": "had not",
+}
+SUFFIX_CONTRACTIONS = {
+    "n't": " not",
+    "'re": " are",
+    "'s": " is",
+    "'ll": " will",
+    "'ve": " have",
+    "'d": " would",
+    "'m": " am",
+}
+ACADEMIC_TRANSITIONS = [
+    "Moreover,",
+    "Additionally,",
+    "Furthermore,",
+    "Hence,",
+    "Therefore,",
+    "Consequently,",
+    "Nonetheless,",
+    "Nevertheless,",
+    "In contrast,",
+    "On the other hand,",
+    "In addition,",
+    "As a result,",
+]
+def expand_contractions(sentence):
+    alt = "|".join(re.escape(k) for k in WHOLE_CONTRACTIONS.keys())
+    whole_pattern = rf"(?:(``)\s*)?(?P<word>(?:{alt}))(?:\s*(''))?"
+    def _replace_whole_with_quotes(match):
+        open_tok = match.group(1) or ""
+        word = match.group("word")
+        close_tok = match.group(3) or ""
+        key = word.lower()
+        repl = WHOLE_CONTRACTIONS.get(key, word)
+        if word and word[0].isupper():
+            repl = repl.capitalize()
+        return f"{open_tok}{repl}{close_tok}"
+    sentence = re.sub(
+        whole_pattern, _replace_whole_with_quotes, sentence, flags=re.IGNORECASE
+    )
+    tokens = word_tokenize(sentence)
+    out_tokens = []
+    for t in tokens:
+        lower_t = t.lower()
+        replaced = False
+        for contr, expansion in SUFFIX_CONTRACTIONS.items():
+            if lower_t.endswith(contr):
+                base = lower_t[: -len(contr)]
+                new_t = base + expansion
+                if t and t[0].isupper():
+                    new_t = new_t.capitalize()
+                out_tokens.append(new_t)
+                replaced = True
+                break
+        if not replaced:
+            out_tokens.append(t)
+    return " ".join(out_tokens)
+def get_synonyms(word, pos):
+    wn_pos = None
+    if pos.startswith("ADJ"):
+        wn_pos = wordnet.ADJ
+    elif pos.startswith("NOUN"):
+        wn_pos = wordnet.NOUN
+    elif pos.startswith("ADV"):
+        wn_pos = wordnet.ADV
+    elif pos.startswith("VERB"):
+        wn_pos = wordnet.VERB
+    synonyms = set()
+    if wn_pos:
+        for syn in wordnet.synsets(word, pos=wn_pos):
+            for lemma in syn.lemmas():
+                lemma_name = lemma.name().replace("_", " ")
+                if lemma_name.lower() != word.lower():
+                    synonyms.add(lemma_name)
+    return list(synonyms)
+def replace_synonyms(sentence, p_syn=0.2):
+    if not nlp:
+        return sentence
+    doc = nlp(sentence)
+    new_tokens = []
+    for token in doc:
+        if "[[REF_" in token.text:
+            new_tokens.append(token.text)
+            continue
+        if token.pos_ in ["ADJ", "NOUN", "VERB", "ADV"] and wordnet.synsets(token.text):
+            if random.random() < p_syn:
+                synonyms = get_synonyms(token.text, token.pos_)
+                if synonyms:
+                    new_tokens.append(random.choice(synonyms))
+                else:
+                    new_tokens.append(token.text)
+            else:
+                new_tokens.append(token.text)
+        else:
+            new_tokens.append(token.text)
+    return " ".join(new_tokens)
+def add_academic_transition(sentence, p_transition=0.2):
+    if random.random() < p_transition:
+        transition = random.choice(ACADEMIC_TRANSITIONS)
+        return f"{transition} {sentence}"
+    return sentence
+def minimal_humanize_line(line, p_syn=0.2, p_trans=0.2):
+    line = expand_contractions(line)
+    line = replace_synonyms(line, p_syn=p_syn)
+    line = add_academic_transition(line, p_transition=p_trans)
+    return line
+def minimal_rewriting(text, p_syn=0.2, p_trans=0.2):
+    lines = sent_tokenize(text)
+    out_lines = [
+        minimal_humanize_line(ln, p_syn=p_syn, p_trans=p_trans) for ln in lines
+    ]
+    return " ".join(out_lines)
+def preserve_linebreaks_rewrite(text, p_syn=0.2, p_trans=0.2):
+    """Rewrite text while preserving original line breaks."""
+    lines = text.splitlines()
+    out_lines = []
+    for ln in lines:
+        if not ln.strip():
+            out_lines.append("")
+        else:
+            out_lines.append(
+                minimal_rewriting(ln, p_syn=p_syn, p_trans=p_trans)
+            )
+    return "\n".join(out_lines)

utils/model_loaders.py ADDED Viewed

	@@ -0,0 +1,61 @@

+"""
+Model loaders for the AI detection pipeline.
+Uses `desklib/ai-text-detector-v1.01` — a DeBERTa-v3-large classifier that
+currently tops the RAID benchmark for modern LLM detection (ChatGPT, Claude,
+Gemini, Llama, Grok, etc). The model ships a custom head, so we load it via
+the `DesklibAIDetectionModel` wrapper defined in `utils.desklib_model`.
+"""
+import logging
+from functools import lru_cache
+import torch
+from transformers import AutoTokenizer
+from utils.desklib_model import DesklibAIDetectionModel
+logger = logging.getLogger(__name__)
+DETECTOR_MODEL_ID = "desklib/ai-text-detector-v1.01"
+@lru_cache(maxsize=1)
+def load_detector_model():
+    """Load the desklib AI detector (DeBERTa-v3-large + custom head).
+    Returns (model, tokenizer, device). First call downloads ~1.75 GB
+    and caches it under `~/.cache/huggingface`. Subsequent calls return
+    the cached in-process instance.
+    """
+    if torch.cuda.is_available():
+        device = torch.device("cuda")
+    elif torch.backends.mps.is_available():
+        device = torch.device("mps")
+    else:
+        device = torch.device("cpu")
+    logger.info("Loading detector %s on %s", DETECTOR_MODEL_ID, device)
+    tokenizer = AutoTokenizer.from_pretrained(DETECTOR_MODEL_ID)
+    model = DesklibAIDetectionModel.from_pretrained(DETECTOR_MODEL_ID)
+    model.to(device)
+    model.eval()
+    logger.info("Detector ready")
+    return model, tokenizer, device
+@torch.no_grad()
+def predict_ai_probability(text, model, tokenizer, device, max_len=768):
+    """Return probability (0..1) that `text` is AI-generated."""
+    encoded = tokenizer(
+        text,
+        padding="max_length",
+        truncation=True,
+        max_length=max_len,
+        return_tensors="pt",
+    )
+    input_ids = encoded["input_ids"].to(device)
+    attention_mask = encoded["attention_mask"].to(device)
+    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
+    logits = outputs["logits"]
+    return torch.sigmoid(logits).item()

utils/pdf_utils.py ADDED Viewed

	@@ -0,0 +1,63 @@

+# utils/pdf_utils.py
+import fitz
+from io import BytesIO
+import nltk
+from nltk.tokenize import sent_tokenize, word_tokenize
+nltk.download('punkt', quiet=True)
+def extract_text_from_pdf(pdf_bytes):
+    """Extract text from all pages of a PDF."""
+    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
+    all_text = ""
+    for page in doc:
+        all_text += page.get_text("text") + "\n"
+    doc.close()
+    return all_text
+def word_count(text):
+    return len(word_tokenize(text))
+def generate_annotated_pdf(pdf_bytes, classification_map):
+    """Generate an annotated PDF with color-coded highlights for AI text."""
+    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
+    legend_text = (
+        "Color Legend:\n"
+        "• Red: AI-generated\n"
+        "• Orange: AI-generated & AI-refined\n"
+        "• Light Blue: Human-written & AI-refined\n\n"
+        "Note: Sentences classified as 'Human-written' are not highlighted."
+    )
+    legend_page = doc.new_page(pno=0)
+    legend_page.insert_text((72, 72), legend_text, fontsize=14, fontname="helv")
+    def hex_to_rgb_float(hex_color):
+        hex_color = hex_color.lstrip('#')
+        r = int(hex_color[0:2], 16) / 255.0
+        g = int(hex_color[2:4], 16) / 255.0
+        b = int(hex_color[4:6], 16) / 255.0
+        return (r, g, b)
+    COLOR_MAPPING = {
+        "AI-generated": "#ffcccc",
+        "AI-generated & AI-refined": "#ffe5cc",
+        "Human-written & AI-refined": "#e6f2ff"
+    }
+    for sentence, label in classification_map.items():
+        if label == "Human-written":
+            continue
+        color_hex = COLOR_MAPPING.get(label)
+        if not color_hex:
+            continue
+        color = hex_to_rgb_float(color_hex)
+        for page in doc:
+            rects = page.search_for(sentence)
+            for rect in rects:
+                annot = page.add_highlight_annot(rect)
+                annot.set_colors(stroke=color)
+                annot.update()
+    out_bytes = doc.write()
+    doc.close()
+    return BytesIO(out_bytes)