Spaces:

DeepActionPotential
/

FineTextTector

Sleeping

App Files Files

xet

Community

DeepActionPotential commited on Jun 9

Commit

e46d899

verified ·

1 Parent(s): 26cbd99

Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

.gitattributes +3 -0
LICENCE +21 -0
README.md +100 -19
__pycache__/utils.cpython-311.pyc +0 -0
ai_vs_human_text_fine_tuned_classifier.ipynb +0 -0
app.py +54 -0
images/1.png +3 -0
images/2.png +3 -0
images/FineTextTector.mp4 +3 -0
models/best_model.joblib +3 -0
requirements.txt +7 -3
run.py +2 -0
styles.css +67 -0
utils.py +50 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+images/1.png filter=lfs diff=lfs merge=lfs -text
+images/2.png filter=lfs diff=lfs merge=lfs -text
+images/FineTextTector.mp4 filter=lfs diff=lfs merge=lfs -text

LICENCE ADDED Viewed

	@@ -0,0 +1,21 @@

+MIT License
+Copyright (c) 2025 Eslam Tarek
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.

README.md CHANGED Viewed

@@ -1,19 +1,100 @@
----
-title: FineTextTector
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: A fine-tuned classifier for text
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+# AI vs Human Text Fine-Tuned Classifier
+## About the Project
+This project aims to develop a robust machine learning model capable of distinguishing between human-written and AI-generated text. With the rapid advancement of large language models (LLMs) such as ChatGPT and Gemini, the ability to identify the origin of a text has become crucial in various domains, including academic integrity, content moderation, misinformation detection, and authorship verification. The project leverages state-of-the-art natural language processing (NLP) techniques and transfer learning to build a binary classifier that can accurately predict whether a given text was authored by a human or generated by an AI.
+The workflow encompasses comprehensive exploratory data analysis (EDA), advanced text preprocessing, model selection and fine-tuning, and thorough evaluation. The final model is designed to be easily deployable and accessible for real-world applications.
+## About the Dataset
+The dataset used in this project is sourced from Kaggle: [AI vs Human Text Dataset](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text). It contains a large collection of text samples, each labeled as either human-written or AI-generated. The dataset is well-suited for binary classification tasks and provides a diverse range of topics and writing styles, making it ideal for training and evaluating models that need to generalize across different types of content.
+- **Features:**
+  - `text`: The actual text sample.
+  - `generated`: Label indicating the source (0 for human, 1 for AI).
+The dataset is split into training, validation, and test sets to ensure unbiased evaluation and robust model performance.
+## Notebook Summary
+The main notebook, `ai_vs_human_text_fine_tuned_classifier.ipynb`, guides users through the entire process of building the classifier:
+1. **Problem Definition:** Outlines the motivation and objectives.
+2. **Exploratory Data Analysis (EDA):** Visualizes class distributions, text lengths, lexical richness, punctuation usage, and stopword ratios to uncover patterns and differences between human and AI texts.
+3. **Text Preprocessing:** Applies normalization, stopword removal, noise filtering (removing URLs, emails, hashtags, mentions, numbers), and filters out outlier texts based on length.
+4. **Model Selection:** Utilizes transfer learning with the `distilbert/distilroberta-base` model, enhanced with LoRA (Low-Rank Adaptation) for efficient fine-tuning.
+5. **Training:** Fine-tunes the model on a subset of the data, using stratified splits and advanced training arguments for optimal performance.
+6. **Evaluation:** Assesses the model using accuracy, precision, recall, and F1-score on a held-out test set.
+7. **Deployment:** Demonstrates how to push the trained model and tokenizer to Hugging Face Hub for sharing and reuse.
+## Model Results
+### Preprocessing
+- **Lowercasing and Stripping:** All text is converted to lowercase and stripped of extra whitespace.
+- **Punctuation and Stopword Removal:** Punctuation is removed, and stopwords are filtered out to focus on meaningful content.
+- **Noise Filtering:** Regular expressions are used to remove URLs, emails, hashtags, mentions, and numbers.
+- **Outlier Filtering:** Texts that are extremely short or long (based on quantiles) are removed to ensure consistent input lengths for the model.
+- **Deduplication:** Duplicate texts are dropped to prevent data leakage.
+### Training
+- **Model Architecture:** The project uses `distilbert/distilroberta-base`, a distilled version of RoBERTa, known for its efficiency and strong performance on text classification tasks.
+- **LoRA Fine-Tuning:** LoRA (Low-Rank Adaptation) is applied to reduce the number of trainable parameters, making the fine-tuning process more memory- and compute-efficient without sacrificing accuracy.
+- **Training Arguments:** The model is trained for 2 epochs with early stopping, regular evaluation, and checkpointing. Batch sizes and learning rates are carefully chosen for stability and speed.
+### Evaluation
+- **Metrics:** The model is evaluated using accuracy, precision, recall, and F1-score. These metrics provide a comprehensive view of the classifier's performance, especially in distinguishing between the two classes.
+- **Results:** The fine-tuned model demonstrates strong performance, with high accuracy and balanced precision/recall, indicating its effectiveness in real-world scenarios.
+## How to Install
+Follow these steps to set up the environment using Python's built-in `venv`:
+```bash
+# Clone the repository
+git clone https://github.com/DeepActionPotential/FineTextTector
+cd FineTextTector
+# Create a virtual environment
+python -m venv venv
+# Activate the virtual environment
+# On Windows:
+venv\Scripts\activate
+# On macOS/Linux:
+source venv/bin/activate
+# Install required packages
+pip install -r requirements.txt
+```
+## How to Use the Software
+ - ## [Demo-video](images/FineTextTector.mp4)
+ - ![Demo-image](images/1.png)
+ - ![Demo-image](images/2.png)
+## Technologies Used
+- **Transformers (Hugging Face):** Core library for model loading, tokenization, and training. Enables transfer learning with state-of-the-art NLP models.
+- **Datasets (Hugging Face):** Efficient data handling, splitting, and preprocessing.
+- **PEFT (Parameter-Efficient Fine-Tuning):** Implements LoRA for memory- and compute-efficient model adaptation.
+- **Optuna:** Automated hyperparameter optimization to fine-tune model performance.
+- **Scikit-learn:** Data splitting, metrics calculation, and utility functions.
+- **Seaborn & Matplotlib:** Data visualization for EDA and result interpretation.
+- **NLTK:** Stopword lists and basic NLP utilities.
+- **Python venv:** Isolated environment management for reproducible installations.
+These technologies collectively enable efficient, scalable, and reproducible development of advanced NLP models.
+## License
+This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
+---

__pycache__/utils.cpython-311.pyc ADDED Viewed

Binary file (2.35 kB). View file

ai_vs_human_text_fine_tuned_classifier.ipynb ADDED Viewed

The diff for this file is too large to render. See raw diff

app.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import streamlit as st
+from utils import load_model, preprocess_text
+import nltk
+nltk.download('stopwords')
+model = load_model('./models/best_model.joblib')
+min_words_number = 100
+def check_generated_text(text):
+    filtered_text = preprocess_text(text)
+    prediction = model.predict([filtered_text])
+    return not int(prediction[0])
+# Load styles
+with open("styles.css") as f:
+    st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
+# Title
+st.title("Generated Text Checker")
+# Initialize session state
+if "check_clicked" not in st.session_state:
+    st.session_state.check_clicked = False
+# Use a form to isolate the check action
+with st.form("text_check_form"):
+    user_input = st.text_area(
+        f"Enter text to check",
+        height=400,
+        placeholder=f"Paste your generated text here... it should be at least {min_words_number} words"
+    )
+    submitted = st.form_submit_button("Check text")
+# Handle form submission
+if submitted:
+    st.session_state.check_clicked = True
+# Only run check when button is clicked
+if st.session_state.check_clicked:
+    with st.spinner("Checking text..."):
+        current_length = len(user_input.split())
+        if current_length >= min_words_number:
+            result = check_generated_text(user_input)
+            if result:
+                st.info("✅ The text appears to be human-written!")
+            else:
+                st.info("🤖 The text appears to be AI-generated.")
+        else:
+            st.warning(f"Please enter at least {min_words_number} words.")
+    # Reset check state
+    st.session_state.check_clicked = False

images/1.png ADDED Viewed

Git LFS Details

SHA256: 58d27f2956d5eabc5e8e5e8f412028b7fb00c592f19b70e0145c0fdc4458e5f3
Pointer size: 131 Bytes
Size of remote file: 107 kB

images/2.png ADDED Viewed

Git LFS Details

SHA256: 65b94cfcbe227db816ad5a96a927734c36692d67a1ee1d34fa929386d7d4ba1c
Pointer size: 131 Bytes
Size of remote file: 705 kB

images/FineTextTector.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ef4863509b18e08f527035c08b345cd452b378a42518333b3d1566f38a15d6f8
+size 4457655

models/best_model.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:946bba1b68f35d1e95e5245105fc4d1a991f8293df5c3407f7cfa723d359d400
+size 436909018

requirements.txt CHANGED Viewed

@@ -1,3 +1,7 @@
-altair
-pandas
-streamlit

+streamlit>=1.29.1
+nltk==3.8.1
+scikit-learn==1.3.2
+joblib==1.3.2
+pandas==2.1.4
+numpy==1.26.2
+python-dotenv==1.0.0

run.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ import subprocess
2	+ subprocess.run(["streamlit", "run", "app.py"])

styles.css ADDED Viewed

	@@ -0,0 +1,67 @@

+#MainMenu, header, footer {
+    visibility: hidden;
+}
+.stApp {
+    background-color: #343541 !important;
+    color: #ECECEF !important;
+}
+.stTextArea>div>div>textarea {
+    background-color: #40414F !important;
+    color: #ECECEF !important;
+    border-radius: 8px !important;
+    padding: 16px !important;
+    border: 1px solid #565869 !important;
+    font-size: 16px !important;
+    min-height: 300px !important;
+}
+.stTextArea>label {
+    color: #ECECEF !important;
+    font-size: 18px !important;
+}
+.stButton>button {
+    background-color: #19C37D !important;
+    color: white !important;
+    border: none !important;
+    border-radius: 8px !important;
+    padding: 12px 24px !important;
+    font-size: 16px !important;
+    font-weight: 500 !important;
+    transition: background-color 0.3s ease !important;
+}
+.stButton>button:hover {
+    background-color: #15A46C !important;
+    color: white !important;
+}
+.stAlert {
+    border-radius: 8px !important;
+    padding: 16px !important;
+}
+.stAlert [data-testid="stMarkdownContainer"] {
+    color: #ECECEF !important;
+}
+.stAlert.st-emotion-cache-1hyeoxa {
+    background-color: rgba(25, 195, 125, 0.1) !important;
+    border: 1px solid #19C37D !important;
+}
+.stAlert.st-emotion-cache-1d3z3hw {
+    background-color: rgba(239, 65, 70, 0.1) !important;
+    border: 1px solid #EF4146 !important;
+}
+.stTitle {
+    color: #ECECEF !important;
+    text-align: center !important;
+    margin-bottom: 32px !important;
+}

utils.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import joblib
+import re
+import string
+from nltk.corpus import stopwords
+def load_model(model_path):
+    """
+    Load a joblib model
+    Args:
+    - model_path (str): path to the model
+    Returns:
+    - model: loaded model
+    """
+    model = joblib.load(model_path)
+    return model
+# Set of English stopwords
+stop_words = set(stopwords.words('english'))
+def preprocess_text(text:str):
+    # Step 1: Lowercase
+    text = text.lower()
+    # Step 2: Strip extra whitespace
+    text = re.sub(r'\s+', ' ', text.strip())
+    # Step 3: Remove punctuation
+    text = text.translate(str.maketrans('', '', string.punctuation))
+    # Step 4: Remove stopwords
+    text = ' '.join(word for word in text.split() if word not in stop_words)
+    # Step 5: Remove noise (URLs, emails, hashtags, mentions, numbers, non-printables)
+    text = re.sub(r'http\S+|www\.\S+', '', text)       # URLs
+    text = re.sub(r'\S+@\S+\.\S+', '', text)           # Emails
+    text = re.sub(r'#[A-Za-z0-9_]+', '', text)         # Hashtags
+    text = re.sub(r'@[A-Za-z0-9_]+', '', text)         # Mentions
+    text = re.sub(r'\d+', '', text)                    # Numbers
+    text = ''.join(ch for ch in text if ch.isprintable())  # Non-printables
+    return text