bughead commited on
Commit
325e5a1
·
1 Parent(s): 4cfceb1

Initial Humanzise backend deployment

Browse files

FastAPI + desklib DeBERTa-v3 AI detector + rule-based humanizer.
Docker SDK, exposes port 7860, runs as non-root user (UID 1000).

.dockerignore ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Exclude everything the backend Docker image doesn't need.
2
+ # Keeps the HF Spaces build context small and fast.
3
+
4
+ # Frontend — deployed separately to Vercel
5
+ web/
6
+
7
+ # Local python env
8
+ venv/
9
+ __pycache__/
10
+ *.pyc
11
+ *.pyo
12
+ .pytest_cache/
13
+ .mypy_cache/
14
+ .ruff_cache/
15
+
16
+ # Model caches (let the container populate its own)
17
+ .cache/
18
+ ~/.cache/
19
+ *.safetensors
20
+ *.bin
21
+ *.ckpt
22
+
23
+ # Git / IDE / OS
24
+ .git/
25
+ .github/
26
+ .vscode/
27
+ .idea/
28
+ .DS_Store
29
+ Thumbs.db
30
+
31
+ # Docs / backups / logs
32
+ *.md
33
+ !README.md
34
+ *-backup-*.zip
35
+ *.log
36
+
37
+ # Upstream fork artifacts not needed in production
38
+ pages/
39
+ main.py
40
+ setup.sh
41
+ Procfile
42
+ vercel.json
43
+ nltk.txt
44
+ requirements-local.txt
45
+ *.ttf
.gitignore ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ __pycache__/
2
+ *.pyc
3
+ *.pyo
4
+ .DS_Store
5
+ Thumbs.db
6
+ .vscode/
7
+ .idea/
8
+ venv/
9
+ .env
10
+ .env.local
Dockerfile ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Humanzise backend — Docker image for Hugging Face Spaces (Docker SDK).
2
+ #
3
+ # HF Spaces requirements met here:
4
+ # - Listens on 0.0.0.0:7860
5
+ # - Runs as non-root user with UID 1000 (`user`)
6
+ # - $HOME = /home/user so HF Hub cache persists under the user
7
+ #
8
+ # Build size strategy:
9
+ # - CPU-only torch wheel (~500 MB instead of ~2 GB CUDA)
10
+ # - --no-cache-dir on every pip install
11
+ # - Slim Debian base
12
+
13
+ FROM python:3.11-slim
14
+
15
+ # System deps needed for occasional source builds
16
+ RUN apt-get update && apt-get install -y --no-install-recommends \
17
+ build-essential \
18
+ git \
19
+ && rm -rf /var/lib/apt/lists/*
20
+
21
+ # HF Spaces mandates a non-root user with UID 1000
22
+ RUN useradd --create-home --uid 1000 user
23
+ USER user
24
+ ENV HOME=/home/user \
25
+ PATH=/home/user/.local/bin:$PATH \
26
+ HF_HOME=/home/user/.cache/huggingface \
27
+ TRANSFORMERS_CACHE=/home/user/.cache/huggingface \
28
+ PYTHONDONTWRITEBYTECODE=1 \
29
+ PYTHONUNBUFFERED=1
30
+
31
+ WORKDIR /home/user/app
32
+
33
+ # Install CPU-only torch first so transformers picks it up and doesn't pull CUDA
34
+ RUN pip install --no-cache-dir --user --upgrade pip && \
35
+ pip install --no-cache-dir --user \
36
+ --index-url https://download.pytorch.org/whl/cpu \
37
+ torch
38
+
39
+ # Install the rest of the deps
40
+ COPY --chown=user:user requirements.txt .
41
+ RUN pip install --no-cache-dir --user -r requirements.txt
42
+
43
+ # Pre-download the small NLP models so cold requests don't pay the download tax
44
+ RUN python -m spacy download en_core_web_sm && \
45
+ python -c "import nltk; \
46
+ nltk.download('punkt', quiet=True); \
47
+ nltk.download('punkt_tab', quiet=True); \
48
+ nltk.download('averaged_perceptron_tagger', quiet=True); \
49
+ nltk.download('averaged_perceptron_tagger_eng', quiet=True); \
50
+ nltk.download('wordnet', quiet=True)"
51
+
52
+ # Copy application code
53
+ COPY --chown=user:user api ./api
54
+ COPY --chown=user:user utils ./utils
55
+
56
+ EXPOSE 7860
57
+
58
+ # The desklib model (~1.75 GB) downloads lazily on the first /detect request
59
+ # and is cached under $HF_HOME for the life of the container.
60
+ CMD ["uvicorn", "api.humanize_api:app", "--host", "0.0.0.0", "--port", "7860"]
LICENSE ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 DADA NANJESHA for project AI Content Detector & Humanizer
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
6
+
7
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
8
+
9
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
README.md CHANGED
@@ -1,12 +1,93 @@
1
  ---
2
- title: Humanzise Api
3
- emoji: 🐢
4
- colorFrom: purple
5
- colorTo: green
6
  sdk: docker
 
7
  pinned: false
8
- license: mit
9
  short_description: Free AI text humanizer and detector
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Humanzise API
3
+ emoji: 🪄
4
+ colorFrom: green
5
+ colorTo: indigo
6
  sdk: docker
7
+ app_port: 7860
8
  pinned: false
 
9
  short_description: Free AI text humanizer and detector
10
  ---
11
 
12
+ # Humanzise
13
+
14
+ Free, open-source **AI text humanizer** + **AI detector**. Paste any AI-generated text and rewrite it to sound more natural — or check how likely an existing text was written by AI.
15
+
16
+ - **Frontend**: Next.js 16 + shadcn/ui + Tailwind CSS (deployed on Vercel)
17
+ - **Backend**: FastAPI + PyTorch + DeBERTa-v3 detector (deployed on Hugging Face Spaces)
18
+ - **Detector model**: [`desklib/ai-text-detector-v1.01`](https://huggingface.co/desklib/ai-text-detector-v1.01) — current leader on the RAID benchmark
19
+ - **Humanizer**: rule-based pipeline (WordNet synonyms + contraction expansion + academic transitions + citation preservation)
20
+
21
+ ## Repository layout
22
+
23
+ ```
24
+ humanzise/
25
+ ├── api/ FastAPI app (entry point: api.humanize_api:app)
26
+ │ └── humanize_api.py
27
+ ├── utils/ Backend logic
28
+ │ ├── humanizer_core.py Text humanization pipeline
29
+ │ ├── ai_detection_utils.py
30
+ │ ├── desklib_model.py Custom DeBERTa-v3 wrapper for desklib weights
31
+ │ ├── model_loaders.py
32
+ │ └── pdf_utils.py PDF text extraction
33
+ ├── web/ Next.js frontend
34
+ │ └── src/
35
+ │ ├── app/
36
+ │ ├── components/
37
+ │ └── lib/
38
+ ├── Dockerfile HF Spaces Docker image
39
+ ├── requirements.txt Production deps (lean, CPU-only torch)
40
+ ├── requirements-local.txt All dev deps
41
+ └── DEPLOY.md Step-by-step deployment guide
42
+ ```
43
+
44
+ ## Running locally
45
+
46
+ ### Backend (Python 3.12)
47
+
48
+ ```bash
49
+ python -m venv venv
50
+ source venv/Scripts/activate # or venv/bin/activate on macOS/Linux
51
+ pip install -r requirements-local.txt
52
+ python -m spacy download en_core_web_sm
53
+
54
+ python -m uvicorn api.humanize_api:app --reload --port 8000
55
+ ```
56
+
57
+ Scalar docs: http://localhost:8000/docs
58
+
59
+ ### Frontend (Node 20+)
60
+
61
+ ```bash
62
+ cd web
63
+ npm install
64
+ npm run dev
65
+ ```
66
+
67
+ Open http://localhost:3000. Set `NEXT_PUBLIC_API_BASE_URL` in `web/.env.local` if your backend isn't on `http://127.0.0.1:8000`.
68
+
69
+ ## API endpoints
70
+
71
+ | Method | Path | Description |
72
+ |---|---|---|
73
+ | `GET` | `/health` | Liveness probe |
74
+ | `POST` | `/humanize` | Rewrite AI text to sound more natural |
75
+ | `POST` | `/detect` | Score text for AI likelihood (desklib DeBERTa-v3) |
76
+ | `POST` | `/extract-file` | Extract text from uploaded PDF/TXT/MD |
77
+
78
+ All endpoints use JSON request/response; `/extract-file` uses `multipart/form-data`.
79
+
80
+ ## Deployment
81
+
82
+ Free deployment path is documented in [DEPLOY.md](./DEPLOY.md):
83
+
84
+ - **Frontend** → Vercel (free, `web/` subfolder)
85
+ - **Backend** → Hugging Face Spaces (Docker SDK, free 16 GB RAM)
86
+
87
+ ## Credits
88
+
89
+ Forked from [DadaNanjesha/AI-content-detector-Humanizer](https://github.com/DadaNanjesha/AI-content-detector-Humanizer) — original Streamlit app. This fork replaced the Streamlit UI with a Next.js frontend, modernized the backend, and swapped in the desklib detector.
90
+
91
+ ## License
92
+
93
+ MIT — see [LICENSE](./LICENSE).
api/humanize_api.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import Dict, Optional
3
+
4
+ from fastapi import FastAPI, File, HTTPException, UploadFile
5
+ from fastapi.middleware.cors import CORSMiddleware
6
+ from pydantic import BaseModel, Field
7
+
8
+ from utils.ai_detection_utils import classify_text_hf
9
+ from utils.pdf_utils import extract_text_from_pdf
10
+ from utils.humanizer_core import (
11
+ count_sentences,
12
+ count_words,
13
+ extract_citations,
14
+ minimal_rewriting,
15
+ preserve_linebreaks_rewrite,
16
+ restore_citations,
17
+ )
18
+
19
+
20
+ DESCRIPTION = """
21
+ AI Text Humanizer & Detector API
22
+
23
+ Provides server-side access to the project's text humanization and AI-detection
24
+ pipelines. The API is consumed by the Next.js frontend in /web.
25
+ """
26
+
27
+ tags_metadata = [
28
+ {"name": "humanize", "description": "Transform AI-generated text into human-like prose."},
29
+ {"name": "detect", "description": "Classify text as AI-generated or human-written."},
30
+ ]
31
+
32
+ app = FastAPI(
33
+ title="AI Text Humanizer API",
34
+ version="0.3",
35
+ description=DESCRIPTION,
36
+ openapi_tags=tags_metadata,
37
+ )
38
+
39
+ app.add_middleware(
40
+ CORSMiddleware,
41
+ allow_origins=["*"],
42
+ allow_credentials=True,
43
+ allow_methods=["*"],
44
+ allow_headers=["*"],
45
+ )
46
+
47
+
48
+ class HumanizeRequest(BaseModel):
49
+ text: str = Field(..., description="The input text to humanize. Must be non-empty.")
50
+ p_syn: Optional[float] = Field(0.2, ge=0.0, le=1.0)
51
+ p_trans: Optional[float] = Field(0.2, ge=0.0, le=1.0)
52
+ preserve_linebreaks: Optional[bool] = Field(True)
53
+
54
+
55
+ class HumanizeResponse(BaseModel):
56
+ humanized_text: str
57
+ orig_word_count: int
58
+ orig_sentence_count: int
59
+ new_word_count: int
60
+ new_sentence_count: int
61
+ words_added: int
62
+ sentences_added: int
63
+
64
+
65
+ class DetectRequest(BaseModel):
66
+ text: str = Field(..., description="The input text to analyze.")
67
+
68
+
69
+ class DetectResponse(BaseModel):
70
+ percentages: Dict[str, float]
71
+ classification: Dict[str, str]
72
+ ai_score: float
73
+ human_score: float
74
+
75
+
76
+ @app.get("/health", tags=["humanize"], summary="Health check")
77
+ def health():
78
+ return {"status": "ok"}
79
+
80
+
81
+ @app.post("/humanize", response_model=HumanizeResponse, tags=["humanize"])
82
+ def humanize(req: HumanizeRequest):
83
+ text = req.text or ""
84
+ if not text.strip():
85
+ raise HTTPException(status_code=400, detail="`text` must be a non-empty string")
86
+
87
+ orig_wc = count_words(text)
88
+ orig_sc = count_sentences(text)
89
+
90
+ no_refs_text, placeholders = extract_citations(text)
91
+
92
+ if req.preserve_linebreaks:
93
+ rewritten = preserve_linebreaks_rewrite(no_refs_text, p_syn=req.p_syn, p_trans=req.p_trans)
94
+ else:
95
+ rewritten = minimal_rewriting(no_refs_text, p_syn=req.p_syn, p_trans=req.p_trans)
96
+
97
+ final_text = restore_citations(rewritten, placeholders)
98
+ final_text = re.sub(r"[ \t]+([.,;:!?])", r"\1", final_text)
99
+ final_text = re.sub(r"(\()[ \t]+", r"\1", final_text)
100
+ final_text = re.sub(r"[ \t]+(\))", r"\1", final_text)
101
+ final_text = re.sub(r"[ \t]{2,}", " ", final_text)
102
+ final_text = re.sub(r"``\s*(.+?)\s*''", r'"\1"', final_text)
103
+
104
+ new_wc = count_words(final_text)
105
+ new_sc = count_sentences(final_text)
106
+
107
+ return {
108
+ "humanized_text": final_text,
109
+ "orig_word_count": orig_wc,
110
+ "orig_sentence_count": orig_sc,
111
+ "new_word_count": new_wc,
112
+ "new_sentence_count": new_sc,
113
+ "words_added": new_wc - orig_wc,
114
+ "sentences_added": new_sc - orig_sc,
115
+ }
116
+
117
+
118
+ @app.post("/extract-file", tags=["humanize"], summary="Extract text from uploaded file")
119
+ async def extract_file(file: UploadFile = File(...)):
120
+ """Accept a PDF, TXT or MD file and return its plain-text contents."""
121
+ if not file.filename:
122
+ raise HTTPException(status_code=400, detail="No file provided")
123
+
124
+ content = await file.read()
125
+ name = file.filename.lower()
126
+
127
+ try:
128
+ if name.endswith(".pdf"):
129
+ text = extract_text_from_pdf(content)
130
+ elif name.endswith((".txt", ".md")):
131
+ text = content.decode("utf-8", errors="ignore")
132
+ else:
133
+ raise HTTPException(
134
+ status_code=400,
135
+ detail="Unsupported file type. Use .pdf, .txt, or .md",
136
+ )
137
+ except HTTPException:
138
+ raise
139
+ except Exception as exc:
140
+ raise HTTPException(status_code=500, detail=f"Failed to extract: {exc}")
141
+
142
+ return {"text": text, "filename": file.filename}
143
+
144
+
145
+ @app.post("/detect", response_model=DetectResponse, tags=["detect"])
146
+ def detect(req: DetectRequest):
147
+ text = req.text or ""
148
+ if not text.strip():
149
+ raise HTTPException(status_code=400, detail="`text` must be a non-empty string")
150
+
151
+ classification_map, percentages, mean_ai_prob = classify_text_hf(text)
152
+
153
+ # Use the raw mean probability as the headline score — it's a more honest
154
+ # signal than bucket-counting (which collapses to 0 for borderline text).
155
+ ai_score = round(mean_ai_prob * 100, 2)
156
+ human_score = round(100 - ai_score, 2)
157
+
158
+ return {
159
+ "percentages": percentages,
160
+ "classification": classification_map,
161
+ "ai_score": ai_score,
162
+ "human_score": human_score,
163
+ }
requirements.txt ADDED
Binary file (1.15 kB). View file
 
utils/__init__.py ADDED
File without changes
utils/ai_detection_utils.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ AI text detection powered by the desklib DeBERTa-v3 classifier.
3
+
4
+ Scores the FULL text and each sentence. Returns the per-sentence bucket
5
+ breakdown the frontend expects PLUS the honest raw mean probability.
6
+ """
7
+ import nltk
8
+ from nltk.tokenize import sent_tokenize
9
+
10
+ from utils.model_loaders import load_detector_model, predict_ai_probability
11
+
12
+ nltk.download("punkt", quiet=True)
13
+
14
+
15
+ def classify_text_hf(text, threshold_ai=0.75, threshold_mid=0.4, threshold_soft=0.15):
16
+ """Classify the input text.
17
+
18
+ Returns:
19
+ classification_map: dict[sentence] -> label bucket
20
+ percentages: dict[bucket] -> percentage of sentences
21
+ mean_ai_probability: float 0..1 (full-text score)
22
+
23
+ The full-text probability is also used as the headline AI score because
24
+ detectors are more reliable on full paragraphs than individual sentences.
25
+ """
26
+ model, tokenizer, device = load_detector_model()
27
+
28
+ # Overall score: run the full text through the model once
29
+ full_prob = predict_ai_probability(text, model, tokenizer, device)
30
+
31
+ sentences = sent_tokenize(text) or [text]
32
+ classification_map = {}
33
+ counts = {
34
+ "AI-generated": 0,
35
+ "AI-generated & AI-refined": 0,
36
+ "Human-written": 0,
37
+ "Human-written & AI-refined": 0,
38
+ }
39
+
40
+ for sentence in sentences:
41
+ if not sentence.strip():
42
+ continue
43
+ prob = predict_ai_probability(sentence, model, tokenizer, device)
44
+
45
+ if prob >= threshold_ai:
46
+ label = "AI-generated"
47
+ elif prob >= threshold_mid:
48
+ label = "AI-generated & AI-refined"
49
+ elif prob >= threshold_soft:
50
+ label = "Human-written & AI-refined"
51
+ else:
52
+ label = "Human-written"
53
+
54
+ classification_map[sentence] = label
55
+ counts[label] += 1
56
+
57
+ total = sum(counts.values())
58
+ percentages = {
59
+ cat: round((count / total) * 100, 2) if total > 0 else 0
60
+ for cat, count in counts.items()
61
+ }
62
+ return classification_map, percentages, full_prob
utils/desklib_model.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Custom model class for the desklib AI text detector.
3
+
4
+ The repo ships `model.safetensors` containing a DeBERTa-v3-large backbone plus
5
+ a single-logit classifier head. There's no modeling code in the repo, so we
6
+ recreate the architecture here verbatim from the README and call
7
+ `from_pretrained()` on THIS class (not `AutoModelForSequenceClassification`)
8
+ to load the weights.
9
+
10
+ Source: https://huggingface.co/desklib/ai-text-detector-v1.01
11
+ """
12
+ import torch
13
+ import torch.nn as nn
14
+ from transformers import AutoConfig, AutoModel, PreTrainedModel
15
+
16
+
17
+ class DesklibAIDetectionModel(PreTrainedModel):
18
+ config_class = AutoConfig
19
+
20
+ def __init__(self, config):
21
+ super().__init__(config)
22
+ self.model = AutoModel.from_config(config)
23
+ self.classifier = nn.Linear(config.hidden_size, 1)
24
+ self.init_weights()
25
+
26
+ def forward(self, input_ids, attention_mask=None, labels=None):
27
+ outputs = self.model(input_ids, attention_mask=attention_mask)
28
+ last_hidden_state = outputs[0]
29
+
30
+ # Mean pooling over non-padding tokens
31
+ mask = attention_mask.unsqueeze(-1).expand(last_hidden_state.size()).float()
32
+ summed = torch.sum(last_hidden_state * mask, dim=1)
33
+ counts = torch.clamp(mask.sum(dim=1), min=1e-9)
34
+ pooled = summed / counts
35
+
36
+ logits = self.classifier(pooled)
37
+ loss = None
38
+ if labels is not None:
39
+ loss_fct = nn.BCEWithLogitsLoss()
40
+ loss = loss_fct(logits.view(-1), labels.float())
41
+
42
+ out = {"logits": logits}
43
+ if loss is not None:
44
+ out["loss"] = loss
45
+ return out
utils/humanizer_core.py ADDED
@@ -0,0 +1,261 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Pure humanization helpers (no Streamlit).
3
+
4
+ Extracted from the original pages/humanize_text.py so the FastAPI backend and
5
+ any frontend can import these functions without pulling in Streamlit.
6
+ """
7
+ import logging
8
+ import random
9
+ import re
10
+ import ssl
11
+ import warnings
12
+
13
+ import nltk
14
+ import spacy
15
+ from nltk.corpus import wordnet
16
+ from nltk.tokenize import sent_tokenize, word_tokenize
17
+
18
+ warnings.filterwarnings("ignore", category=FutureWarning)
19
+
20
+ logger = logging.getLogger(__name__)
21
+
22
+
23
+ def download_nltk_resources():
24
+ try:
25
+ _create_unverified_https_context = ssl._create_unverified_context
26
+ except AttributeError:
27
+ pass
28
+ else:
29
+ ssl._create_default_https_context = _create_unverified_https_context
30
+
31
+ resources = [
32
+ "punkt",
33
+ "averaged_perceptron_tagger",
34
+ "punkt_tab",
35
+ "wordnet",
36
+ "averaged_perceptron_tagger_eng",
37
+ ]
38
+ for r in resources:
39
+ nltk.download(r, quiet=True)
40
+
41
+
42
+ download_nltk_resources()
43
+
44
+ try:
45
+ nlp = spacy.load("en_core_web_sm")
46
+ except OSError:
47
+ logger.warning(
48
+ "spaCy en_core_web_sm model not found. Install with: python -m spacy download en_core_web_sm"
49
+ )
50
+ nlp = None
51
+
52
+
53
+ CITATION_REGEX = re.compile(
54
+ r"\(\s*[A-Za-z&\-,\.\s]+(?:et al\.\s*)?,\s*\d{4}(?:,\s*(?:pp?\.\s*\d+(?:-\d+)?))?\s*\)"
55
+ )
56
+
57
+
58
+ def count_words(text):
59
+ return len(word_tokenize(text))
60
+
61
+
62
+ def count_sentences(text):
63
+ return len(sent_tokenize(text))
64
+
65
+
66
+ def extract_citations(text):
67
+ refs = CITATION_REGEX.findall(text)
68
+ placeholder_map = {}
69
+ replaced_text = text
70
+ for i, r in enumerate(refs, start=1):
71
+ placeholder = f"[[REF_{i}]]"
72
+ placeholder_map[placeholder] = r
73
+ replaced_text = replaced_text.replace(r, placeholder, 1)
74
+ return replaced_text, placeholder_map
75
+
76
+
77
+ PLACEHOLDER_REGEX = re.compile(r"\[\s*\[\s*REF_(\d+)\s*\]\s*\]")
78
+
79
+
80
+ def restore_citations(text, placeholder_map):
81
+ def replace_placeholder(match):
82
+ idx = match.group(1)
83
+ key = f"[[REF_{idx}]]"
84
+ return placeholder_map.get(key, match.group(0))
85
+
86
+ return PLACEHOLDER_REGEX.sub(replace_placeholder, text)
87
+
88
+
89
+ WHOLE_CONTRACTIONS = {
90
+ "can't": "cannot",
91
+ "won't": "will not",
92
+ "shan't": "shall not",
93
+ "ain't": "is not",
94
+ "i'm": "i am",
95
+ "it's": "it is",
96
+ "we're": "we are",
97
+ "they're": "they are",
98
+ "you're": "you are",
99
+ "he's": "he is",
100
+ "she's": "she is",
101
+ "that's": "that is",
102
+ "there's": "there is",
103
+ "what's": "what is",
104
+ "who's": "who is",
105
+ "let's": "let us",
106
+ "didn't": "did not",
107
+ "doesn't": "does not",
108
+ "don't": "do not",
109
+ "couldn't": "could not",
110
+ "shouldn't": "should not",
111
+ "wouldn't": "would not",
112
+ "isn't": "is not",
113
+ "aren't": "are not",
114
+ "weren't": "were not",
115
+ "hasn't": "has not",
116
+ "haven't": "have not",
117
+ "hadn't": "had not",
118
+ }
119
+
120
+ SUFFIX_CONTRACTIONS = {
121
+ "n't": " not",
122
+ "'re": " are",
123
+ "'s": " is",
124
+ "'ll": " will",
125
+ "'ve": " have",
126
+ "'d": " would",
127
+ "'m": " am",
128
+ }
129
+
130
+ ACADEMIC_TRANSITIONS = [
131
+ "Moreover,",
132
+ "Additionally,",
133
+ "Furthermore,",
134
+ "Hence,",
135
+ "Therefore,",
136
+ "Consequently,",
137
+ "Nonetheless,",
138
+ "Nevertheless,",
139
+ "In contrast,",
140
+ "On the other hand,",
141
+ "In addition,",
142
+ "As a result,",
143
+ ]
144
+
145
+
146
+ def expand_contractions(sentence):
147
+ alt = "|".join(re.escape(k) for k in WHOLE_CONTRACTIONS.keys())
148
+ whole_pattern = rf"(?:(``)\s*)?(?P<word>(?:{alt}))(?:\s*(''))?"
149
+
150
+ def _replace_whole_with_quotes(match):
151
+ open_tok = match.group(1) or ""
152
+ word = match.group("word")
153
+ close_tok = match.group(3) or ""
154
+ key = word.lower()
155
+ repl = WHOLE_CONTRACTIONS.get(key, word)
156
+ if word and word[0].isupper():
157
+ repl = repl.capitalize()
158
+ return f"{open_tok}{repl}{close_tok}"
159
+
160
+ sentence = re.sub(
161
+ whole_pattern, _replace_whole_with_quotes, sentence, flags=re.IGNORECASE
162
+ )
163
+
164
+ tokens = word_tokenize(sentence)
165
+ out_tokens = []
166
+ for t in tokens:
167
+ lower_t = t.lower()
168
+ replaced = False
169
+ for contr, expansion in SUFFIX_CONTRACTIONS.items():
170
+ if lower_t.endswith(contr):
171
+ base = lower_t[: -len(contr)]
172
+ new_t = base + expansion
173
+ if t and t[0].isupper():
174
+ new_t = new_t.capitalize()
175
+ out_tokens.append(new_t)
176
+ replaced = True
177
+ break
178
+ if not replaced:
179
+ out_tokens.append(t)
180
+ return " ".join(out_tokens)
181
+
182
+
183
+ def get_synonyms(word, pos):
184
+ wn_pos = None
185
+ if pos.startswith("ADJ"):
186
+ wn_pos = wordnet.ADJ
187
+ elif pos.startswith("NOUN"):
188
+ wn_pos = wordnet.NOUN
189
+ elif pos.startswith("ADV"):
190
+ wn_pos = wordnet.ADV
191
+ elif pos.startswith("VERB"):
192
+ wn_pos = wordnet.VERB
193
+
194
+ synonyms = set()
195
+ if wn_pos:
196
+ for syn in wordnet.synsets(word, pos=wn_pos):
197
+ for lemma in syn.lemmas():
198
+ lemma_name = lemma.name().replace("_", " ")
199
+ if lemma_name.lower() != word.lower():
200
+ synonyms.add(lemma_name)
201
+ return list(synonyms)
202
+
203
+
204
+ def replace_synonyms(sentence, p_syn=0.2):
205
+ if not nlp:
206
+ return sentence
207
+
208
+ doc = nlp(sentence)
209
+ new_tokens = []
210
+ for token in doc:
211
+ if "[[REF_" in token.text:
212
+ new_tokens.append(token.text)
213
+ continue
214
+ if token.pos_ in ["ADJ", "NOUN", "VERB", "ADV"] and wordnet.synsets(token.text):
215
+ if random.random() < p_syn:
216
+ synonyms = get_synonyms(token.text, token.pos_)
217
+ if synonyms:
218
+ new_tokens.append(random.choice(synonyms))
219
+ else:
220
+ new_tokens.append(token.text)
221
+ else:
222
+ new_tokens.append(token.text)
223
+ else:
224
+ new_tokens.append(token.text)
225
+ return " ".join(new_tokens)
226
+
227
+
228
+ def add_academic_transition(sentence, p_transition=0.2):
229
+ if random.random() < p_transition:
230
+ transition = random.choice(ACADEMIC_TRANSITIONS)
231
+ return f"{transition} {sentence}"
232
+ return sentence
233
+
234
+
235
+ def minimal_humanize_line(line, p_syn=0.2, p_trans=0.2):
236
+ line = expand_contractions(line)
237
+ line = replace_synonyms(line, p_syn=p_syn)
238
+ line = add_academic_transition(line, p_transition=p_trans)
239
+ return line
240
+
241
+
242
+ def minimal_rewriting(text, p_syn=0.2, p_trans=0.2):
243
+ lines = sent_tokenize(text)
244
+ out_lines = [
245
+ minimal_humanize_line(ln, p_syn=p_syn, p_trans=p_trans) for ln in lines
246
+ ]
247
+ return " ".join(out_lines)
248
+
249
+
250
+ def preserve_linebreaks_rewrite(text, p_syn=0.2, p_trans=0.2):
251
+ """Rewrite text while preserving original line breaks."""
252
+ lines = text.splitlines()
253
+ out_lines = []
254
+ for ln in lines:
255
+ if not ln.strip():
256
+ out_lines.append("")
257
+ else:
258
+ out_lines.append(
259
+ minimal_rewriting(ln, p_syn=p_syn, p_trans=p_trans)
260
+ )
261
+ return "\n".join(out_lines)
utils/model_loaders.py ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Model loaders for the AI detection pipeline.
3
+
4
+ Uses `desklib/ai-text-detector-v1.01` — a DeBERTa-v3-large classifier that
5
+ currently tops the RAID benchmark for modern LLM detection (ChatGPT, Claude,
6
+ Gemini, Llama, Grok, etc). The model ships a custom head, so we load it via
7
+ the `DesklibAIDetectionModel` wrapper defined in `utils.desklib_model`.
8
+ """
9
+ import logging
10
+ from functools import lru_cache
11
+
12
+ import torch
13
+ from transformers import AutoTokenizer
14
+
15
+ from utils.desklib_model import DesklibAIDetectionModel
16
+
17
+ logger = logging.getLogger(__name__)
18
+
19
+ DETECTOR_MODEL_ID = "desklib/ai-text-detector-v1.01"
20
+
21
+
22
+ @lru_cache(maxsize=1)
23
+ def load_detector_model():
24
+ """Load the desklib AI detector (DeBERTa-v3-large + custom head).
25
+
26
+ Returns (model, tokenizer, device). First call downloads ~1.75 GB
27
+ and caches it under `~/.cache/huggingface`. Subsequent calls return
28
+ the cached in-process instance.
29
+ """
30
+ if torch.cuda.is_available():
31
+ device = torch.device("cuda")
32
+ elif torch.backends.mps.is_available():
33
+ device = torch.device("mps")
34
+ else:
35
+ device = torch.device("cpu")
36
+
37
+ logger.info("Loading detector %s on %s", DETECTOR_MODEL_ID, device)
38
+ tokenizer = AutoTokenizer.from_pretrained(DETECTOR_MODEL_ID)
39
+ model = DesklibAIDetectionModel.from_pretrained(DETECTOR_MODEL_ID)
40
+ model.to(device)
41
+ model.eval()
42
+ logger.info("Detector ready")
43
+ return model, tokenizer, device
44
+
45
+
46
+ @torch.no_grad()
47
+ def predict_ai_probability(text, model, tokenizer, device, max_len=768):
48
+ """Return probability (0..1) that `text` is AI-generated."""
49
+ encoded = tokenizer(
50
+ text,
51
+ padding="max_length",
52
+ truncation=True,
53
+ max_length=max_len,
54
+ return_tensors="pt",
55
+ )
56
+ input_ids = encoded["input_ids"].to(device)
57
+ attention_mask = encoded["attention_mask"].to(device)
58
+
59
+ outputs = model(input_ids=input_ids, attention_mask=attention_mask)
60
+ logits = outputs["logits"]
61
+ return torch.sigmoid(logits).item()
utils/pdf_utils.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # utils/pdf_utils.py
2
+ import fitz
3
+ from io import BytesIO
4
+ import nltk
5
+ from nltk.tokenize import sent_tokenize, word_tokenize
6
+
7
+ nltk.download('punkt', quiet=True)
8
+
9
+ def extract_text_from_pdf(pdf_bytes):
10
+ """Extract text from all pages of a PDF."""
11
+ doc = fitz.open(stream=pdf_bytes, filetype="pdf")
12
+ all_text = ""
13
+ for page in doc:
14
+ all_text += page.get_text("text") + "\n"
15
+ doc.close()
16
+ return all_text
17
+
18
+ def word_count(text):
19
+ return len(word_tokenize(text))
20
+
21
+ def generate_annotated_pdf(pdf_bytes, classification_map):
22
+ """Generate an annotated PDF with color-coded highlights for AI text."""
23
+ doc = fitz.open(stream=pdf_bytes, filetype="pdf")
24
+ legend_text = (
25
+ "Color Legend:\n"
26
+ "• Red: AI-generated\n"
27
+ "• Orange: AI-generated & AI-refined\n"
28
+ "• Light Blue: Human-written & AI-refined\n\n"
29
+ "Note: Sentences classified as 'Human-written' are not highlighted."
30
+ )
31
+ legend_page = doc.new_page(pno=0)
32
+ legend_page.insert_text((72, 72), legend_text, fontsize=14, fontname="helv")
33
+
34
+ def hex_to_rgb_float(hex_color):
35
+ hex_color = hex_color.lstrip('#')
36
+ r = int(hex_color[0:2], 16) / 255.0
37
+ g = int(hex_color[2:4], 16) / 255.0
38
+ b = int(hex_color[4:6], 16) / 255.0
39
+ return (r, g, b)
40
+
41
+ COLOR_MAPPING = {
42
+ "AI-generated": "#ffcccc",
43
+ "AI-generated & AI-refined": "#ffe5cc",
44
+ "Human-written & AI-refined": "#e6f2ff"
45
+ }
46
+
47
+ for sentence, label in classification_map.items():
48
+ if label == "Human-written":
49
+ continue
50
+ color_hex = COLOR_MAPPING.get(label)
51
+ if not color_hex:
52
+ continue
53
+ color = hex_to_rgb_float(color_hex)
54
+ for page in doc:
55
+ rects = page.search_for(sentence)
56
+ for rect in rects:
57
+ annot = page.add_highlight_annot(rect)
58
+ annot.set_colors(stroke=color)
59
+ annot.update()
60
+
61
+ out_bytes = doc.write()
62
+ doc.close()
63
+ return BytesIO(out_bytes)