Upload folder using huggingface_hub
Browse files- .gitattributes +3 -0
- LICENCE +21 -0
- README.md +100 -19
- __pycache__/utils.cpython-311.pyc +0 -0
- ai_vs_human_text_fine_tuned_classifier.ipynb +0 -0
- app.py +54 -0
- images/1.png +3 -0
- images/2.png +3 -0
- images/FineTextTector.mp4 +3 -0
- models/best_model.joblib +3 -0
- requirements.txt +7 -3
- run.py +2 -0
- styles.css +67 -0
- utils.py +50 -0
.gitattributes
CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
36 |
+
images/1.png filter=lfs diff=lfs merge=lfs -text
|
37 |
+
images/2.png filter=lfs diff=lfs merge=lfs -text
|
38 |
+
images/FineTextTector.mp4 filter=lfs diff=lfs merge=lfs -text
|
LICENCE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2025 Eslam Tarek
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
CHANGED
@@ -1,19 +1,100 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# AI vs Human Text Fine-Tuned Classifier
|
2 |
+
|
3 |
+
## About the Project
|
4 |
+
|
5 |
+
This project aims to develop a robust machine learning model capable of distinguishing between human-written and AI-generated text. With the rapid advancement of large language models (LLMs) such as ChatGPT and Gemini, the ability to identify the origin of a text has become crucial in various domains, including academic integrity, content moderation, misinformation detection, and authorship verification. The project leverages state-of-the-art natural language processing (NLP) techniques and transfer learning to build a binary classifier that can accurately predict whether a given text was authored by a human or generated by an AI.
|
6 |
+
|
7 |
+
The workflow encompasses comprehensive exploratory data analysis (EDA), advanced text preprocessing, model selection and fine-tuning, and thorough evaluation. The final model is designed to be easily deployable and accessible for real-world applications.
|
8 |
+
|
9 |
+
## About the Dataset
|
10 |
+
|
11 |
+
The dataset used in this project is sourced from Kaggle: [AI vs Human Text Dataset](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text). It contains a large collection of text samples, each labeled as either human-written or AI-generated. The dataset is well-suited for binary classification tasks and provides a diverse range of topics and writing styles, making it ideal for training and evaluating models that need to generalize across different types of content.
|
12 |
+
|
13 |
+
- **Features:**
|
14 |
+
- `text`: The actual text sample.
|
15 |
+
- `generated`: Label indicating the source (0 for human, 1 for AI).
|
16 |
+
|
17 |
+
The dataset is split into training, validation, and test sets to ensure unbiased evaluation and robust model performance.
|
18 |
+
|
19 |
+
## Notebook Summary
|
20 |
+
|
21 |
+
The main notebook, `ai_vs_human_text_fine_tuned_classifier.ipynb`, guides users through the entire process of building the classifier:
|
22 |
+
|
23 |
+
1. **Problem Definition:** Outlines the motivation and objectives.
|
24 |
+
2. **Exploratory Data Analysis (EDA):** Visualizes class distributions, text lengths, lexical richness, punctuation usage, and stopword ratios to uncover patterns and differences between human and AI texts.
|
25 |
+
3. **Text Preprocessing:** Applies normalization, stopword removal, noise filtering (removing URLs, emails, hashtags, mentions, numbers), and filters out outlier texts based on length.
|
26 |
+
4. **Model Selection:** Utilizes transfer learning with the `distilbert/distilroberta-base` model, enhanced with LoRA (Low-Rank Adaptation) for efficient fine-tuning.
|
27 |
+
5. **Training:** Fine-tunes the model on a subset of the data, using stratified splits and advanced training arguments for optimal performance.
|
28 |
+
6. **Evaluation:** Assesses the model using accuracy, precision, recall, and F1-score on a held-out test set.
|
29 |
+
7. **Deployment:** Demonstrates how to push the trained model and tokenizer to Hugging Face Hub for sharing and reuse.
|
30 |
+
|
31 |
+
## Model Results
|
32 |
+
|
33 |
+
### Preprocessing
|
34 |
+
|
35 |
+
- **Lowercasing and Stripping:** All text is converted to lowercase and stripped of extra whitespace.
|
36 |
+
- **Punctuation and Stopword Removal:** Punctuation is removed, and stopwords are filtered out to focus on meaningful content.
|
37 |
+
- **Noise Filtering:** Regular expressions are used to remove URLs, emails, hashtags, mentions, and numbers.
|
38 |
+
- **Outlier Filtering:** Texts that are extremely short or long (based on quantiles) are removed to ensure consistent input lengths for the model.
|
39 |
+
- **Deduplication:** Duplicate texts are dropped to prevent data leakage.
|
40 |
+
|
41 |
+
### Training
|
42 |
+
|
43 |
+
- **Model Architecture:** The project uses `distilbert/distilroberta-base`, a distilled version of RoBERTa, known for its efficiency and strong performance on text classification tasks.
|
44 |
+
- **LoRA Fine-Tuning:** LoRA (Low-Rank Adaptation) is applied to reduce the number of trainable parameters, making the fine-tuning process more memory- and compute-efficient without sacrificing accuracy.
|
45 |
+
- **Training Arguments:** The model is trained for 2 epochs with early stopping, regular evaluation, and checkpointing. Batch sizes and learning rates are carefully chosen for stability and speed.
|
46 |
+
|
47 |
+
### Evaluation
|
48 |
+
|
49 |
+
- **Metrics:** The model is evaluated using accuracy, precision, recall, and F1-score. These metrics provide a comprehensive view of the classifier's performance, especially in distinguishing between the two classes.
|
50 |
+
- **Results:** The fine-tuned model demonstrates strong performance, with high accuracy and balanced precision/recall, indicating its effectiveness in real-world scenarios.
|
51 |
+
|
52 |
+
## How to Install
|
53 |
+
|
54 |
+
Follow these steps to set up the environment using Python's built-in `venv`:
|
55 |
+
|
56 |
+
```bash
|
57 |
+
# Clone the repository
|
58 |
+
git clone https://github.com/DeepActionPotential/FineTextTector
|
59 |
+
cd FineTextTector
|
60 |
+
|
61 |
+
# Create a virtual environment
|
62 |
+
python -m venv venv
|
63 |
+
|
64 |
+
# Activate the virtual environment
|
65 |
+
# On Windows:
|
66 |
+
venv\Scripts\activate
|
67 |
+
# On macOS/Linux:
|
68 |
+
source venv/bin/activate
|
69 |
+
|
70 |
+
|
71 |
+
# Install required packages
|
72 |
+
pip install -r requirements.txt
|
73 |
+
```
|
74 |
+
|
75 |
+
|
76 |
+
|
77 |
+
## How to Use the Software
|
78 |
+
|
79 |
+
- ## [Demo-video](images/FineTextTector.mp4)
|
80 |
+
- 
|
81 |
+
- 
|
82 |
+
|
83 |
+
## Technologies Used
|
84 |
+
|
85 |
+
- **Transformers (Hugging Face):** Core library for model loading, tokenization, and training. Enables transfer learning with state-of-the-art NLP models.
|
86 |
+
- **Datasets (Hugging Face):** Efficient data handling, splitting, and preprocessing.
|
87 |
+
- **PEFT (Parameter-Efficient Fine-Tuning):** Implements LoRA for memory- and compute-efficient model adaptation.
|
88 |
+
- **Optuna:** Automated hyperparameter optimization to fine-tune model performance.
|
89 |
+
- **Scikit-learn:** Data splitting, metrics calculation, and utility functions.
|
90 |
+
- **Seaborn & Matplotlib:** Data visualization for EDA and result interpretation.
|
91 |
+
- **NLTK:** Stopword lists and basic NLP utilities.
|
92 |
+
- **Python venv:** Isolated environment management for reproducible installations.
|
93 |
+
|
94 |
+
These technologies collectively enable efficient, scalable, and reproducible development of advanced NLP models.
|
95 |
+
|
96 |
+
## License
|
97 |
+
|
98 |
+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
|
99 |
+
|
100 |
+
---
|
__pycache__/utils.cpython-311.pyc
ADDED
Binary file (2.35 kB). View file
|
|
ai_vs_human_text_fine_tuned_classifier.ipynb
ADDED
The diff for this file is too large to render.
See raw diff
|
|
app.py
ADDED
@@ -0,0 +1,54 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import streamlit as st
|
2 |
+
from utils import load_model, preprocess_text
|
3 |
+
import nltk
|
4 |
+
|
5 |
+
nltk.download('stopwords')
|
6 |
+
model = load_model('./models/best_model.joblib')
|
7 |
+
|
8 |
+
min_words_number = 100
|
9 |
+
|
10 |
+
def check_generated_text(text):
|
11 |
+
filtered_text = preprocess_text(text)
|
12 |
+
prediction = model.predict([filtered_text])
|
13 |
+
return not int(prediction[0])
|
14 |
+
|
15 |
+
# Load styles
|
16 |
+
with open("styles.css") as f:
|
17 |
+
st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
|
18 |
+
|
19 |
+
# Title
|
20 |
+
st.title("Generated Text Checker")
|
21 |
+
|
22 |
+
# Initialize session state
|
23 |
+
if "check_clicked" not in st.session_state:
|
24 |
+
st.session_state.check_clicked = False
|
25 |
+
|
26 |
+
# Use a form to isolate the check action
|
27 |
+
with st.form("text_check_form"):
|
28 |
+
user_input = st.text_area(
|
29 |
+
f"Enter text to check",
|
30 |
+
height=400,
|
31 |
+
placeholder=f"Paste your generated text here... it should be at least {min_words_number} words"
|
32 |
+
)
|
33 |
+
submitted = st.form_submit_button("Check text")
|
34 |
+
|
35 |
+
# Handle form submission
|
36 |
+
if submitted:
|
37 |
+
st.session_state.check_clicked = True
|
38 |
+
|
39 |
+
# Only run check when button is clicked
|
40 |
+
if st.session_state.check_clicked:
|
41 |
+
with st.spinner("Checking text..."):
|
42 |
+
current_length = len(user_input.split())
|
43 |
+
|
44 |
+
if current_length >= min_words_number:
|
45 |
+
result = check_generated_text(user_input)
|
46 |
+
if result:
|
47 |
+
st.info("✅ The text appears to be human-written!")
|
48 |
+
else:
|
49 |
+
st.info("🤖 The text appears to be AI-generated.")
|
50 |
+
else:
|
51 |
+
st.warning(f"Please enter at least {min_words_number} words.")
|
52 |
+
|
53 |
+
# Reset check state
|
54 |
+
st.session_state.check_clicked = False
|
images/1.png
ADDED
![]() |
Git LFS Details
|
images/2.png
ADDED
![]() |
Git LFS Details
|
images/FineTextTector.mp4
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:ef4863509b18e08f527035c08b345cd452b378a42518333b3d1566f38a15d6f8
|
3 |
+
size 4457655
|
models/best_model.joblib
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:946bba1b68f35d1e95e5245105fc4d1a991f8293df5c3407f7cfa723d359d400
|
3 |
+
size 436909018
|
requirements.txt
CHANGED
@@ -1,3 +1,7 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
1 |
+
streamlit>=1.29.1
|
2 |
+
nltk==3.8.1
|
3 |
+
scikit-learn==1.3.2
|
4 |
+
joblib==1.3.2
|
5 |
+
pandas==2.1.4
|
6 |
+
numpy==1.26.2
|
7 |
+
python-dotenv==1.0.0
|
run.py
ADDED
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
1 |
+
import subprocess
|
2 |
+
subprocess.run(["streamlit", "run", "app.py"])
|
styles.css
ADDED
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
|
3 |
+
#MainMenu, header, footer {
|
4 |
+
visibility: hidden;
|
5 |
+
}
|
6 |
+
|
7 |
+
|
8 |
+
.stApp {
|
9 |
+
background-color: #343541 !important;
|
10 |
+
color: #ECECEF !important;
|
11 |
+
}
|
12 |
+
|
13 |
+
.stTextArea>div>div>textarea {
|
14 |
+
background-color: #40414F !important;
|
15 |
+
color: #ECECEF !important;
|
16 |
+
border-radius: 8px !important;
|
17 |
+
padding: 16px !important;
|
18 |
+
border: 1px solid #565869 !important;
|
19 |
+
font-size: 16px !important;
|
20 |
+
min-height: 300px !important;
|
21 |
+
}
|
22 |
+
|
23 |
+
.stTextArea>label {
|
24 |
+
color: #ECECEF !important;
|
25 |
+
font-size: 18px !important;
|
26 |
+
}
|
27 |
+
|
28 |
+
.stButton>button {
|
29 |
+
background-color: #19C37D !important;
|
30 |
+
color: white !important;
|
31 |
+
border: none !important;
|
32 |
+
border-radius: 8px !important;
|
33 |
+
padding: 12px 24px !important;
|
34 |
+
font-size: 16px !important;
|
35 |
+
font-weight: 500 !important;
|
36 |
+
transition: background-color 0.3s ease !important;
|
37 |
+
}
|
38 |
+
|
39 |
+
.stButton>button:hover {
|
40 |
+
background-color: #15A46C !important;
|
41 |
+
color: white !important;
|
42 |
+
}
|
43 |
+
|
44 |
+
.stAlert {
|
45 |
+
border-radius: 8px !important;
|
46 |
+
padding: 16px !important;
|
47 |
+
}
|
48 |
+
|
49 |
+
.stAlert [data-testid="stMarkdownContainer"] {
|
50 |
+
color: #ECECEF !important;
|
51 |
+
}
|
52 |
+
|
53 |
+
.stAlert.st-emotion-cache-1hyeoxa {
|
54 |
+
background-color: rgba(25, 195, 125, 0.1) !important;
|
55 |
+
border: 1px solid #19C37D !important;
|
56 |
+
}
|
57 |
+
|
58 |
+
.stAlert.st-emotion-cache-1d3z3hw {
|
59 |
+
background-color: rgba(239, 65, 70, 0.1) !important;
|
60 |
+
border: 1px solid #EF4146 !important;
|
61 |
+
}
|
62 |
+
|
63 |
+
.stTitle {
|
64 |
+
color: #ECECEF !important;
|
65 |
+
text-align: center !important;
|
66 |
+
margin-bottom: 32px !important;
|
67 |
+
}
|
utils.py
ADDED
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
|
2 |
+
import joblib
|
3 |
+
|
4 |
+
import re
|
5 |
+
import string
|
6 |
+
from nltk.corpus import stopwords
|
7 |
+
|
8 |
+
|
9 |
+
|
10 |
+
def load_model(model_path):
|
11 |
+
"""
|
12 |
+
Load a joblib model
|
13 |
+
|
14 |
+
Args:
|
15 |
+
- model_path (str): path to the model
|
16 |
+
|
17 |
+
Returns:
|
18 |
+
- model: loaded model
|
19 |
+
"""
|
20 |
+
model = joblib.load(model_path)
|
21 |
+
return model
|
22 |
+
|
23 |
+
|
24 |
+
|
25 |
+
# Set of English stopwords
|
26 |
+
stop_words = set(stopwords.words('english'))
|
27 |
+
|
28 |
+
def preprocess_text(text:str):
|
29 |
+
# Step 1: Lowercase
|
30 |
+
text = text.lower()
|
31 |
+
|
32 |
+
# Step 2: Strip extra whitespace
|
33 |
+
text = re.sub(r'\s+', ' ', text.strip())
|
34 |
+
|
35 |
+
# Step 3: Remove punctuation
|
36 |
+
text = text.translate(str.maketrans('', '', string.punctuation))
|
37 |
+
|
38 |
+
# Step 4: Remove stopwords
|
39 |
+
text = ' '.join(word for word in text.split() if word not in stop_words)
|
40 |
+
|
41 |
+
# Step 5: Remove noise (URLs, emails, hashtags, mentions, numbers, non-printables)
|
42 |
+
text = re.sub(r'http\S+|www\.\S+', '', text) # URLs
|
43 |
+
text = re.sub(r'\S+@\S+\.\S+', '', text) # Emails
|
44 |
+
text = re.sub(r'#[A-Za-z0-9_]+', '', text) # Hashtags
|
45 |
+
text = re.sub(r'@[A-Za-z0-9_]+', '', text) # Mentions
|
46 |
+
text = re.sub(r'\d+', '', text) # Numbers
|
47 |
+
text = ''.join(ch for ch in text if ch.isprintable()) # Non-printables
|
48 |
+
|
49 |
+
return text
|
50 |
+
|