Spaces:

rosa0003
/

smartpdf_Highlighter

Sleeping

App Files Files Community

rosa0003 commited on Oct 29, 2024

Commit

5d4f783

verified ·

1 Parent(s): 653035a

Upload 10 files

Browse files

Files changed (11) hide show

.gitattributes +3 -0
README.md +68 -13
app.py +73 -0
photos/after.jpg +3 -0
photos/app.png +0 -0
photos/before.jpg +3 -0
photos/example.jpg +3 -0
photos/icon.png +0 -0
requirements.txt +7 -0
src/__init__.py +5 -0
src/functions.py +271 -0

.gitattributes CHANGED Viewed

@@ -36,3 +36,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 smart-pdf-highlighter-main/photos/after.jpg filter=lfs diff=lfs merge=lfs -text
 smart-pdf-highlighter-main/photos/before.jpg filter=lfs diff=lfs merge=lfs -text
 smart-pdf-highlighter-main/photos/example.jpg filter=lfs diff=lfs merge=lfs -text

 smart-pdf-highlighter-main/photos/after.jpg filter=lfs diff=lfs merge=lfs -text
 smart-pdf-highlighter-main/photos/before.jpg filter=lfs diff=lfs merge=lfs -text
 smart-pdf-highlighter-main/photos/example.jpg filter=lfs diff=lfs merge=lfs -text
+photos/after.jpg filter=lfs diff=lfs merge=lfs -text
+photos/before.jpg filter=lfs diff=lfs merge=lfs -text
+photos/example.jpg filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,13 +1,68 @@
----
-title: Smartpdf Highlighter
-emoji: 🏆
-colorFrom: gray
-colorTo: indigo
-sdk: streamlit
-sdk_version: 1.39.0
-app_file: app.py
-pinned: false
-short_description: Evidenzia un pdf
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Smart PDF Highlighter
+Welcome to Smart PDF Highlighter! This tool finds and highlights important parts in your PDFs all by itself. It uses smart AI methods like deep learning and fancy algorithms to pick out the most important sentences.
+## Overview
+![ScreenShot](./photos/app.png)
+The Smart PDF Highlighter functions with the following workflow:
+1. **User Interface**: Users interact with the Streamlit-based graphical user interface (GUI) to upload their PDF files.
+2. **PDF Processing**: Upon file upload, the tool processes the PDF content to identify important sentences.
+3. **Highlighting**: Important sentences are highlighted within the PDF, emphasizing key content.
+4. **Download**: Users can download the highlighted PDF for further reference.
+## Installation
+To use the Smart PDF Highlighter, follow these simple steps:
+1. **Clone the Repository:** Clone the repository to your local machine.
+    ```python
+    git clone https://github.com/FzS92/smart-pdf-highlighter.git
+    cd smart-pdf-highlighter
+    ```
+2. **Create Virtual Environment:** Set up a Python 3.8 virtual environment and activate it.
+    ```python
+    conda create -n smart-pdf-env python=3.8
+    conda activate smart-pdf-env
+    ```
+3. **Install Requirements:** Install the required dependencies.
+    ```python
+    pip install -r requirements.txt
+    ```
+## Usage
+Follow these steps to run the Smart PDF Highlighter:
+1. **Run the Application:** Execute the `app.py` script to start the Streamlit application.
+    ```python
+    streamlit run app.py
+    ```
+2. **Upload PDF:** Use the provided interface to upload your PDF file.
+3. **Highlighting:** Once the file is uploaded, the tool will automatically process it and generate a highlighted version.
+4. **Download:** Download the highlighted PDF using the provided download button.
+## Online Version
+Additionally, an online version of Smart PDF Highlighter is available with the following modifications:
+1. **Langchain Encoding**: Utilizing langchain encoding (powered by OpenAI), employing the "text-embedding-3-small" model. This feature is currently free for users.
+2. **Backend Technology Change**: Instead of PyTorch, the online version operates using NumPy for efficiency, running on CPU on AWS service.
+You can access the online version here: https://FzS92.github.io
+(You may get warning from your browser since it does not have a domain + SSL).
+## Example
+Before Highlighting             |  After Highlighting
+:-------------------------:|:-------------------------:
+![Before Highlighting](./photos/before.jpg)  |  ![After Highlighting](./photos/after.jpg)

app.py ADDED Viewed

	@@ -0,0 +1,73 @@

+"""
+Smart PDF Highlighter
+This script provides a Streamlit web application for automatically identifying and
+highlighting important content within PDF files. It utilizes AI techniques such as
+deep learning, clustering, and advanced algorithms such as PageRank to analyze text
+and intelligently select key sentences for highlighting.
+Author: Farzad Salajegheh
+Date: 2024
+"""
+import logging
+import time
+import streamlit as st
+from src import generate_highlighted_pdf
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+def main():
+    """Main function to run the PDF Highlighter tool."""
+    st.set_page_config(page_title="Smart PDF Highlighter", page_icon="./photos/icon.png")
+    st.title("Smart PDF Highlighter")
+    show_description()
+    uploaded_file = st.file_uploader("Upload a PDF file", type=["pdf"])
+    if uploaded_file is not None:
+        st.write("PDF file successfully uploaded.")
+        process_pdf(uploaded_file)
+def show_description():
+    """Display description of functionality and maximum limits."""
+    st.write("""Welcome to Smart PDF Highlighter! This tool automatically identifies
+        and highlights important content within your PDF files. It utilizes many
+        AI techniques such as deep learning and other advanced algorithms to
+        analyze the text and intelligently select key sentences for highlighting.""")
+    st.write("Maximum Limits: 40 pages, 2000 sentences.")
+def process_pdf(uploaded_file):
+    """Process the uploaded PDF file and generate highlighted PDF."""
+    st.write("Generating highlighted PDF...")
+    start_time = time.time()
+    with st.spinner("Processing..."):
+        result = generate_highlighted_pdf(uploaded_file)
+        if isinstance(result, str):
+            st.error(result)
+            logger.error("Error generating highlighted PDF: %s", result)
+            return
+        else:
+            file = result
+    end_time = time.time()
+    execution_time = end_time - start_time
+    st.success(
+        f"Highlighted PDF generated successfully in {execution_time:.2f} seconds."
+    )
+    st.write("Download the highlighted PDF:")
+    st.download_button(
+        label="Download",
+        data=file,
+        file_name="highlighted_pdf.pdf",
+    )
+if __name__ == "__main__":
+    main()

photos/after.jpg ADDED Viewed

Git LFS Details

SHA256: d18a404ec1b1ac7fb6f27081935d2450f5eea4ef0cde835740c690d39d7e8ee8
Pointer size: 132 Bytes
Size of remote file: 1.38 MB

photos/app.png ADDED Viewed

photos/before.jpg ADDED Viewed

Git LFS Details

SHA256: ecdad4bb068cc61fde509e147dd10423d1e9b656bea1eb520f95a9ea329ec05a
Pointer size: 132 Bytes
Size of remote file: 1.43 MB

photos/example.jpg ADDED Viewed

Git LFS Details

SHA256: 84efe7351e23b565a2ba32511dec6d8c257df92a36619f2ebdae5c2cc32a3934
Pointer size: 132 Bytes
Size of remote file: 1.41 MB

photos/icon.png ADDED Viewed

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+PyMuPDF==1.23.25
+networkx==3.1
+numpy==1.24.4
+scikit_learn==1.3.2
+sentence_transformers==2.3.1
+streamlit==1.31.1
+torch==2.2.0

src/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""
+Import necessary modules.
+"""
+from .functions import generate_highlighted_pdf

src/functions.py ADDED Viewed

	@@ -0,0 +1,271 @@

+"""
+This module provides functions for generating a highlighted PDF with important sentences.
+The main function, `generate_highlighted_pdf`, takes an input PDF file and a pre-trained
+sentence embedding model as input.
+It splits the text of the PDF into sentences, computes sentence embeddings, and builds a
+graph based on the cosine similarity between embeddings and at the same time split the
+sentences to different clusters using clustering.
+The sentences are then ranked using PageRank scores and a the middle of the cluster,
+and important sentences are selected based on a threshold and clustering.
+Finally, the selected sentences are highlighted in the PDF and the highlighted PDF content
+is returned.
+Other utility functions in this module include functions for loading a sentence embedding
+model, encoding sentences, computing similarity matrices,building graphs, ranking sentences,
+clustering sentence embeddings, and splitting text into sentences.
+Note: This module requires the PyMuPDF, networkx, numpy, torch, sentence_transformers, and
+sklearn libraries to be installed.
+"""
+import logging
+from typing import BinaryIO, List, Tuple
+import fitz  # PyMuPDF
+import networkx as nx
+import numpy as np
+import torch
+import torch.nn.functional as F
+from sentence_transformers import SentenceTransformer
+from sklearn.cluster import KMeans
+# Constants
+MAX_PAGE = 40
+MAX_SENTENCES = 2000
+PAGERANK_THRESHOLD_RATIO = 0.15
+NUM_CLUSTERS_RATIO = 0.05
+MIN_WORDS = 10
+# Logger configuration
+logging.basicConfig(level=logging.ERROR)
+logger = logging.getLogger(__name__)
+def load_sentence_model(revision: str = None) -> SentenceTransformer:
+    """
+    Load a pre-trained sentence embedding model.
+    Args:
+        revision (str): Optional parameter to specify the model revision.
+    Returns:
+        SentenceTransformer: A pre-trained sentence embedding model.
+    """
+    return SentenceTransformer("avsolatorio/GIST-Embedding-v0", revision=revision)
+def encode_sentence(model: SentenceTransformer, sentence: str) -> torch.Tensor:
+    """
+    Encode a sentence into a fixed-dimensional vector representation.
+    Args:
+        model (SentenceTransformer): A pre-trained sentence embedding model.
+        sentence (str): Input sentence.
+    Returns:
+        torch.Tensor: Encoded sentence vector.
+    """
+    model.eval()  # Set the model to evaluation mode
+    # Check if GPU is available
+    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+    with torch.no_grad():  # Disable gradient tracking
+        return model.encode(sentence, convert_to_tensor=True).to(device)
+def compute_similarity_matrix(embeddings: torch.Tensor) -> np.ndarray:
+    """
+    Compute the cosine similarity matrix between sentence embeddings.
+    Args:
+        embeddings (torch.Tensor): Sentence embeddings.
+    Returns:
+        np.ndarray: Cosine similarity matrix.
+    """
+    scores = F.cosine_similarity(
+        embeddings.unsqueeze(1), embeddings.unsqueeze(0), dim=-1
+    )
+    similarity_matrix = scores.cpu().numpy()
+    normalized_adjacency_matrix = similarity_matrix / similarity_matrix.sum(
+        axis=1, keepdims=True
+    )
+    return normalized_adjacency_matrix
+def build_graph(normalized_adjacency_matrix: np.ndarray) -> nx.DiGraph:
+    """
+    Build a directed graph from a normalized adjacency matrix.
+    Args:
+        normalized_adjacency_matrix (np.ndarray): Normalized adjacency matrix.
+    Returns:
+        nx.DiGraph: Directed graph.
+    """
+    return nx.DiGraph(normalized_adjacency_matrix)
+def rank_sentences(graph: nx.DiGraph, sentences: List[str]) -> List[Tuple[str, float]]:
+    """
+    Rank sentences based on PageRank scores.
+    Args:
+        graph (nx.DiGraph): Directed graph.
+        sentences (List[str]): List of sentences.
+    Returns:
+        List[Tuple[str, float]]: Ranked sentences with their PageRank scores.
+    """
+    pagerank_scores = nx.pagerank(graph)
+    ranked_sentences = sorted(
+        zip(sentences, pagerank_scores.values()),
+        key=lambda x: x[1],
+        reverse=True,
+    )
+    return ranked_sentences
+def cluster_sentences(
+    embeddings: torch.Tensor, num_clusters: int
+) -> Tuple[np.ndarray, np.ndarray]:
+    """
+    Cluster sentence embeddings using K-means clustering.
+    Args:
+        embeddings (torch.Tensor): Sentence embeddings.
+        num_clusters (int): Number of clusters.
+    Returns:
+        Tuple[np.ndarray, np.ndarray]: Cluster assignments and cluster centers.
+    """
+    kmeans = KMeans(n_clusters=num_clusters, random_state=42)
+    cluster_assignments = kmeans.fit_predict(embeddings.cpu())
+    cluster_centers = kmeans.cluster_centers_
+    return cluster_assignments, cluster_centers
+def get_middle_sentence(cluster_indices: np.ndarray, sentences: List[str]) -> List[str]:
+    """
+    Get the middle sentence from each cluster.
+    Args:
+        cluster_indices (np.ndarray): Cluster assignments.
+        sentences (List[str]): List of sentences.
+    Returns:
+        List[str]: Middle sentences from each cluster.
+    """
+    middle_indices = [
+        int(np.median(np.where(cluster_indices == i)[0]))
+        for i in range(max(cluster_indices) + 1)
+    ]
+    middle_sentences = [sentences[i] for i in middle_indices]
+    return middle_sentences
+def split_text_into_sentences(text: str, min_words: int = MIN_WORDS) -> List[str]:
+    """
+    Split text into sentences.
+    Args:
+        text (str): Input text.
+        min_words (int): Minimum number of words for a valid sentence.
+    Returns:
+        List[str]: List of sentences.
+    """
+    sentences = []
+    for s in text.split("."):
+        s = s.strip()
+        # filtering out short sentences and sentences that contain more than 40% digits
+        if (
+            s
+            and len(s.split()) >= min_words
+            and (sum(c.isdigit() for c in s) / len(s)) < 0.4
+        ):
+            sentences.append(s)
+    return sentences
+def extract_text_from_pages(doc):
+    """Generator to yield text per page from the PDF, for memory efficiency for large PDFs."""
+    for page_num in range(len(doc)):
+        yield doc[page_num].get_text()
+def generate_highlighted_pdf(
+    input_pdf_file: BinaryIO, model=load_sentence_model()
+) -> bytes:
+    """
+    Generate a highlighted PDF with important sentences.
+    Args:
+        input_pdf_file: Input PDF file object.
+        model (SentenceTransformer): Pre-trained sentence embedding model.
+    Returns:
+        bytes: Highlighted PDF content.
+    """
+    with fitz.open(stream=input_pdf_file.read(), filetype="pdf") as doc:
+        num_pages = doc.page_count
+        if num_pages > MAX_PAGE:
+            # It will show the error message for the user.
+            return f"The PDF file exceeds the maximum limit of {MAX_PAGE} pages."
+        sentences = []
+        for page_text in extract_text_from_pages(doc):  # Memory efficient
+            sentences.extend(split_text_into_sentences(page_text))
+        len_sentences = len(sentences)
+        print(len_sentences)
+        if len_sentences > MAX_SENTENCES:
+            # It will show the error message for the user.
+            return (
+                f"The PDF file exceeds the maximum limit of {MAX_SENTENCES} sentences."
+            )
+        embeddings = encode_sentence(model, sentences)
+        similarity_matrix = compute_similarity_matrix(embeddings)
+        graph = build_graph(similarity_matrix)
+        ranked_sentences = rank_sentences(graph, sentences)
+        pagerank_threshold = int(len(ranked_sentences) * PAGERANK_THRESHOLD_RATIO) + 1
+        top_pagerank_sentences = [
+            sentence[0] for sentence in ranked_sentences[:pagerank_threshold]
+        ]
+        num_clusters = int(len_sentences * NUM_CLUSTERS_RATIO) + 1
+        cluster_assignments, _ = cluster_sentences(embeddings, num_clusters)
+        center_sentences = get_middle_sentence(cluster_assignments, sentences)
+        important_sentences = list(set(top_pagerank_sentences + center_sentences))
+        for i in range(num_pages):
+            try:
+                page = doc[i]
+                for sentence in important_sentences:
+                    rects = page.search_for(sentence)
+                    colors = (fitz.pdfcolor["yellow"], fitz.pdfcolor["green"])
+                    for i, rect in enumerate(rects):
+                        color = colors[i % 2]
+                        annot = page.add_highlight_annot(rect)
+                        annot.set_colors(stroke=color)
+                        annot.update()
+            except Exception as e:
+                logger.error(f"Error processing page {i}: {e}")
+        output_pdf = doc.write()
+    return output_pdf