DeepActionPotential commited on
Commit
e46d899
·
verified ·
1 Parent(s): 26cbd99

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ images/1.png filter=lfs diff=lfs merge=lfs -text
37
+ images/2.png filter=lfs diff=lfs merge=lfs -text
38
+ images/FineTextTector.mp4 filter=lfs diff=lfs merge=lfs -text
LICENCE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 Eslam Tarek
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md CHANGED
@@ -1,19 +1,100 @@
1
- ---
2
- title: FineTextTector
3
- emoji: 🚀
4
- colorFrom: red
5
- colorTo: red
6
- sdk: docker
7
- app_port: 8501
8
- tags:
9
- - streamlit
10
- pinned: false
11
- short_description: A fine-tuned classifier for text
12
- ---
13
-
14
- # Welcome to Streamlit!
15
-
16
- Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
17
-
18
- If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
19
- forums](https://discuss.streamlit.io).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI vs Human Text Fine-Tuned Classifier
2
+
3
+ ## About the Project
4
+
5
+ This project aims to develop a robust machine learning model capable of distinguishing between human-written and AI-generated text. With the rapid advancement of large language models (LLMs) such as ChatGPT and Gemini, the ability to identify the origin of a text has become crucial in various domains, including academic integrity, content moderation, misinformation detection, and authorship verification. The project leverages state-of-the-art natural language processing (NLP) techniques and transfer learning to build a binary classifier that can accurately predict whether a given text was authored by a human or generated by an AI.
6
+
7
+ The workflow encompasses comprehensive exploratory data analysis (EDA), advanced text preprocessing, model selection and fine-tuning, and thorough evaluation. The final model is designed to be easily deployable and accessible for real-world applications.
8
+
9
+ ## About the Dataset
10
+
11
+ The dataset used in this project is sourced from Kaggle: [AI vs Human Text Dataset](https://www.kaggle.com/datasets/shanegerami/ai-vs-human-text). It contains a large collection of text samples, each labeled as either human-written or AI-generated. The dataset is well-suited for binary classification tasks and provides a diverse range of topics and writing styles, making it ideal for training and evaluating models that need to generalize across different types of content.
12
+
13
+ - **Features:**
14
+ - `text`: The actual text sample.
15
+ - `generated`: Label indicating the source (0 for human, 1 for AI).
16
+
17
+ The dataset is split into training, validation, and test sets to ensure unbiased evaluation and robust model performance.
18
+
19
+ ## Notebook Summary
20
+
21
+ The main notebook, `ai_vs_human_text_fine_tuned_classifier.ipynb`, guides users through the entire process of building the classifier:
22
+
23
+ 1. **Problem Definition:** Outlines the motivation and objectives.
24
+ 2. **Exploratory Data Analysis (EDA):** Visualizes class distributions, text lengths, lexical richness, punctuation usage, and stopword ratios to uncover patterns and differences between human and AI texts.
25
+ 3. **Text Preprocessing:** Applies normalization, stopword removal, noise filtering (removing URLs, emails, hashtags, mentions, numbers), and filters out outlier texts based on length.
26
+ 4. **Model Selection:** Utilizes transfer learning with the `distilbert/distilroberta-base` model, enhanced with LoRA (Low-Rank Adaptation) for efficient fine-tuning.
27
+ 5. **Training:** Fine-tunes the model on a subset of the data, using stratified splits and advanced training arguments for optimal performance.
28
+ 6. **Evaluation:** Assesses the model using accuracy, precision, recall, and F1-score on a held-out test set.
29
+ 7. **Deployment:** Demonstrates how to push the trained model and tokenizer to Hugging Face Hub for sharing and reuse.
30
+
31
+ ## Model Results
32
+
33
+ ### Preprocessing
34
+
35
+ - **Lowercasing and Stripping:** All text is converted to lowercase and stripped of extra whitespace.
36
+ - **Punctuation and Stopword Removal:** Punctuation is removed, and stopwords are filtered out to focus on meaningful content.
37
+ - **Noise Filtering:** Regular expressions are used to remove URLs, emails, hashtags, mentions, and numbers.
38
+ - **Outlier Filtering:** Texts that are extremely short or long (based on quantiles) are removed to ensure consistent input lengths for the model.
39
+ - **Deduplication:** Duplicate texts are dropped to prevent data leakage.
40
+
41
+ ### Training
42
+
43
+ - **Model Architecture:** The project uses `distilbert/distilroberta-base`, a distilled version of RoBERTa, known for its efficiency and strong performance on text classification tasks.
44
+ - **LoRA Fine-Tuning:** LoRA (Low-Rank Adaptation) is applied to reduce the number of trainable parameters, making the fine-tuning process more memory- and compute-efficient without sacrificing accuracy.
45
+ - **Training Arguments:** The model is trained for 2 epochs with early stopping, regular evaluation, and checkpointing. Batch sizes and learning rates are carefully chosen for stability and speed.
46
+
47
+ ### Evaluation
48
+
49
+ - **Metrics:** The model is evaluated using accuracy, precision, recall, and F1-score. These metrics provide a comprehensive view of the classifier's performance, especially in distinguishing between the two classes.
50
+ - **Results:** The fine-tuned model demonstrates strong performance, with high accuracy and balanced precision/recall, indicating its effectiveness in real-world scenarios.
51
+
52
+ ## How to Install
53
+
54
+ Follow these steps to set up the environment using Python's built-in `venv`:
55
+
56
+ ```bash
57
+ # Clone the repository
58
+ git clone https://github.com/DeepActionPotential/FineTextTector
59
+ cd FineTextTector
60
+
61
+ # Create a virtual environment
62
+ python -m venv venv
63
+
64
+ # Activate the virtual environment
65
+ # On Windows:
66
+ venv\Scripts\activate
67
+ # On macOS/Linux:
68
+ source venv/bin/activate
69
+
70
+
71
+ # Install required packages
72
+ pip install -r requirements.txt
73
+ ```
74
+
75
+
76
+
77
+ ## How to Use the Software
78
+
79
+ - ## [Demo-video](images/FineTextTector.mp4)
80
+ - ![Demo-image](images/1.png)
81
+ - ![Demo-image](images/2.png)
82
+
83
+ ## Technologies Used
84
+
85
+ - **Transformers (Hugging Face):** Core library for model loading, tokenization, and training. Enables transfer learning with state-of-the-art NLP models.
86
+ - **Datasets (Hugging Face):** Efficient data handling, splitting, and preprocessing.
87
+ - **PEFT (Parameter-Efficient Fine-Tuning):** Implements LoRA for memory- and compute-efficient model adaptation.
88
+ - **Optuna:** Automated hyperparameter optimization to fine-tune model performance.
89
+ - **Scikit-learn:** Data splitting, metrics calculation, and utility functions.
90
+ - **Seaborn & Matplotlib:** Data visualization for EDA and result interpretation.
91
+ - **NLTK:** Stopword lists and basic NLP utilities.
92
+ - **Python venv:** Isolated environment management for reproducible installations.
93
+
94
+ These technologies collectively enable efficient, scalable, and reproducible development of advanced NLP models.
95
+
96
+ ## License
97
+
98
+ This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
99
+
100
+ ---
__pycache__/utils.cpython-311.pyc ADDED
Binary file (2.35 kB). View file
 
ai_vs_human_text_fine_tuned_classifier.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
app.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ from utils import load_model, preprocess_text
3
+ import nltk
4
+
5
+ nltk.download('stopwords')
6
+ model = load_model('./models/best_model.joblib')
7
+
8
+ min_words_number = 100
9
+
10
+ def check_generated_text(text):
11
+ filtered_text = preprocess_text(text)
12
+ prediction = model.predict([filtered_text])
13
+ return not int(prediction[0])
14
+
15
+ # Load styles
16
+ with open("styles.css") as f:
17
+ st.markdown(f"<style>{f.read()}</style>", unsafe_allow_html=True)
18
+
19
+ # Title
20
+ st.title("Generated Text Checker")
21
+
22
+ # Initialize session state
23
+ if "check_clicked" not in st.session_state:
24
+ st.session_state.check_clicked = False
25
+
26
+ # Use a form to isolate the check action
27
+ with st.form("text_check_form"):
28
+ user_input = st.text_area(
29
+ f"Enter text to check",
30
+ height=400,
31
+ placeholder=f"Paste your generated text here... it should be at least {min_words_number} words"
32
+ )
33
+ submitted = st.form_submit_button("Check text")
34
+
35
+ # Handle form submission
36
+ if submitted:
37
+ st.session_state.check_clicked = True
38
+
39
+ # Only run check when button is clicked
40
+ if st.session_state.check_clicked:
41
+ with st.spinner("Checking text..."):
42
+ current_length = len(user_input.split())
43
+
44
+ if current_length >= min_words_number:
45
+ result = check_generated_text(user_input)
46
+ if result:
47
+ st.info("✅ The text appears to be human-written!")
48
+ else:
49
+ st.info("🤖 The text appears to be AI-generated.")
50
+ else:
51
+ st.warning(f"Please enter at least {min_words_number} words.")
52
+
53
+ # Reset check state
54
+ st.session_state.check_clicked = False
images/1.png ADDED

Git LFS Details

  • SHA256: 58d27f2956d5eabc5e8e5e8f412028b7fb00c592f19b70e0145c0fdc4458e5f3
  • Pointer size: 131 Bytes
  • Size of remote file: 107 kB
images/2.png ADDED

Git LFS Details

  • SHA256: 65b94cfcbe227db816ad5a96a927734c36692d67a1ee1d34fa929386d7d4ba1c
  • Pointer size: 131 Bytes
  • Size of remote file: 705 kB
images/FineTextTector.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ef4863509b18e08f527035c08b345cd452b378a42518333b3d1566f38a15d6f8
3
+ size 4457655
models/best_model.joblib ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:946bba1b68f35d1e95e5245105fc4d1a991f8293df5c3407f7cfa723d359d400
3
+ size 436909018
requirements.txt CHANGED
@@ -1,3 +1,7 @@
1
- altair
2
- pandas
3
- streamlit
 
 
 
 
 
1
+ streamlit>=1.29.1
2
+ nltk==3.8.1
3
+ scikit-learn==1.3.2
4
+ joblib==1.3.2
5
+ pandas==2.1.4
6
+ numpy==1.26.2
7
+ python-dotenv==1.0.0
run.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ import subprocess
2
+ subprocess.run(["streamlit", "run", "app.py"])
styles.css ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ #MainMenu, header, footer {
4
+ visibility: hidden;
5
+ }
6
+
7
+
8
+ .stApp {
9
+ background-color: #343541 !important;
10
+ color: #ECECEF !important;
11
+ }
12
+
13
+ .stTextArea>div>div>textarea {
14
+ background-color: #40414F !important;
15
+ color: #ECECEF !important;
16
+ border-radius: 8px !important;
17
+ padding: 16px !important;
18
+ border: 1px solid #565869 !important;
19
+ font-size: 16px !important;
20
+ min-height: 300px !important;
21
+ }
22
+
23
+ .stTextArea>label {
24
+ color: #ECECEF !important;
25
+ font-size: 18px !important;
26
+ }
27
+
28
+ .stButton>button {
29
+ background-color: #19C37D !important;
30
+ color: white !important;
31
+ border: none !important;
32
+ border-radius: 8px !important;
33
+ padding: 12px 24px !important;
34
+ font-size: 16px !important;
35
+ font-weight: 500 !important;
36
+ transition: background-color 0.3s ease !important;
37
+ }
38
+
39
+ .stButton>button:hover {
40
+ background-color: #15A46C !important;
41
+ color: white !important;
42
+ }
43
+
44
+ .stAlert {
45
+ border-radius: 8px !important;
46
+ padding: 16px !important;
47
+ }
48
+
49
+ .stAlert [data-testid="stMarkdownContainer"] {
50
+ color: #ECECEF !important;
51
+ }
52
+
53
+ .stAlert.st-emotion-cache-1hyeoxa {
54
+ background-color: rgba(25, 195, 125, 0.1) !important;
55
+ border: 1px solid #19C37D !important;
56
+ }
57
+
58
+ .stAlert.st-emotion-cache-1d3z3hw {
59
+ background-color: rgba(239, 65, 70, 0.1) !important;
60
+ border: 1px solid #EF4146 !important;
61
+ }
62
+
63
+ .stTitle {
64
+ color: #ECECEF !important;
65
+ text-align: center !important;
66
+ margin-bottom: 32px !important;
67
+ }
utils.py ADDED
@@ -0,0 +1,50 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import joblib
3
+
4
+ import re
5
+ import string
6
+ from nltk.corpus import stopwords
7
+
8
+
9
+
10
+ def load_model(model_path):
11
+ """
12
+ Load a joblib model
13
+
14
+ Args:
15
+ - model_path (str): path to the model
16
+
17
+ Returns:
18
+ - model: loaded model
19
+ """
20
+ model = joblib.load(model_path)
21
+ return model
22
+
23
+
24
+
25
+ # Set of English stopwords
26
+ stop_words = set(stopwords.words('english'))
27
+
28
+ def preprocess_text(text:str):
29
+ # Step 1: Lowercase
30
+ text = text.lower()
31
+
32
+ # Step 2: Strip extra whitespace
33
+ text = re.sub(r'\s+', ' ', text.strip())
34
+
35
+ # Step 3: Remove punctuation
36
+ text = text.translate(str.maketrans('', '', string.punctuation))
37
+
38
+ # Step 4: Remove stopwords
39
+ text = ' '.join(word for word in text.split() if word not in stop_words)
40
+
41
+ # Step 5: Remove noise (URLs, emails, hashtags, mentions, numbers, non-printables)
42
+ text = re.sub(r'http\S+|www\.\S+', '', text) # URLs
43
+ text = re.sub(r'\S+@\S+\.\S+', '', text) # Emails
44
+ text = re.sub(r'#[A-Za-z0-9_]+', '', text) # Hashtags
45
+ text = re.sub(r'@[A-Za-z0-9_]+', '', text) # Mentions
46
+ text = re.sub(r'\d+', '', text) # Numbers
47
+ text = ''.join(ch for ch in text if ch.isprintable()) # Non-printables
48
+
49
+ return text
50
+