Spaces:

asach
/

arxiv-plagiarism-checker-Ilm

Runtime error

App Files Files Community

gamingflexer commited on Jan 30, 2024

Commit

cc93c45

1 Parent(s): 8d58e30

Documentation Updated

Browse files

Files changed (2) hide show

README.md +51 -11
docs.md +20 -0

README.md CHANGED Viewed

@@ -10,33 +10,71 @@ pinned: true
 # Arxiv Plagiarism Checker LLM
-## end-end-mlops-huggingface
-[![Sync to Hugging Face hub](https://github.com/gamingflexer/arxiv-plagiarism-checker-llm/actions/workflows/main.yml/badge.svg)](https://github.com/gamingflexer/arxiv-plagiarism-checker-llm/actions/workflows/main.yml)
 Arxiv author's plagiarism check just by entering the arxiv author
-## Demo UI
-![Demo Image](images/demo_ui.png)
-## Docs
-- Miro RoadMap [Link](https://miro.com/app/board/uXjVN8HgXk8=/)
 ### Research Points
 - Notion [Link](https://gamingflexer.notion.site/Arxiv-983d173f46c1426caa9dab319f4ddb3d?pvs=4)
 ### Top Plagiarism Checkers API
 - **[ProWritingAid API V2](https://cloud.prowritingaid.com/analysis/swagger/ui/index) - Free Plan**
 - **[Unicheck](https://unicheck.com/plagiarism-checker-api) - Request Demo**
-- **Copyleaks** - Free Plan
-- EDEN AI - https://www.edenai.co/feature/plagiarism-detection
 ----
 ## Requirements
@@ -48,11 +86,13 @@ Arxiv author's plagiarism check just by entering the arxiv author
 ## Installation
 ```bash
 ```
 ## Usage
-```python
 ```

 # Arxiv Plagiarism Checker LLM
+## Demo Link
+![Demo Link](https://huggingface.co/spaces/asach/arxiv-plagiarism-checker-Ilm)
+[![Sync to Hugging Face hub](https://github.com/gamingflexer/arxiv-plagiarism-checker-llm/actions/workflows/main.yml/badge.svg)](https://github.com/gamingflexer/arxiv-plagiarism-checker-llm/actions/workflows/main.yml)
 Arxiv author's plagiarism check just by entering the arxiv author
+## Docs & Working
+INPUT - Authors Name
+OUTPUT - Plagiarism Check Results
+### Tech Stack
+- Gradio
+- ChromaDB
+- SERP API
+- OpenAI GPT Embeddings & LLM Models
+1. We have collected the data from arxiv GCP cloud for the year of 2023 & 2024 and then we have used the text-embedding-3-large to generate the embeddings for the documents. This amount to about 10GB.
+2. Document Text Extraction is done in 2 formats with metdata
+- Document Level
+- Paragraph Level
+- MetaData
+Meta data example
+```json
+{
+  "id": "2106.09680",
+  "title": "Accuracy, Interpretability, and Differential Privacy via Explainable Boosting",
+  "summary": "We show that adding differential privacy to Explainable Boosting Machines\n(EBMs), a recent method for training interpretable ML models, yields\nstate-of-the-art accuracy while protecting privacy. Our experiments on multiple\nclassification and regression datasets show that DP-EBM models suffer\nsurprisingly little accuracy loss even with strong differential privacy\nguarantees. In addition to high accuracy, two other benefits of applying DP to\nEBMs are: a) trained models provide exact global and local interpretability,\nwhich is often important in settings where differential privacy is needed; and\nb) the models can be edited after training without loss of privacy to correct\nerrors which DP noise may have introduced.",
+  "source": "http://arxiv.org/pdf/2106.09680",
+  "authors": "Harsha Nori Rich Caruana Zhiqi Bu Judy Hanwen Shen Janardhan Kulkarni",
+  "references": ""
+}
+```
+3. Embeddings are generated for the documents and paragraphs using OpenAI Models
+4. Authors are then searched on the Google SERP API and the documents (Top 10) are then compared individually with the embeddings of the documents.
+5. Retreived documents & Top 3 simialar papers from Google SERP API on the topic
+    - Metadata and text is extracted
+6. Once Extracted Unique Lines and Paragraphs are extracted and then compared by using LLM - GPT 4 Preview Model - 128K
+7. Unique Lines are then compared with the document embeddings and the paragraphs are compared with the paragraph embeddings.
+8. Top 3 Similar Text and respective documents are then returned to the user as Plagiarised Content.
 ### Research Points
+- Miro RoadMap [Link](https://miro.com/app/board/uXjVN8HgXk8=/)
 - Notion [Link](https://gamingflexer.notion.site/Arxiv-983d173f46c1426caa9dab319f4ddb3d?pvs=4)
 ### Top Plagiarism Checkers API
 - **[ProWritingAid API V2](https://cloud.prowritingaid.com/analysis/swagger/ui/index) - Free Plan**
 - **[Unicheck](https://unicheck.com/plagiarism-checker-api) - Request Demo**
+- **[Copyleaks]() - Request Demo**
+- **[EDEN AI](https://www.edenai.co/feature/plagiarism-detection) - Free Plan**
 ----
 ## Requirements
 ## Installation
 ```bash
+pip install -r requirements.txt
 ```
 ## Usage
+We are using a gradio app to implement the plagiarism checker
+```python
+python app.py or gradio app.py
 ```

docs.md CHANGED Viewed

@@ -1,3 +1,21 @@
 ### 1. Data Input:
 - **Input Data:** Collect a diverse dataset of academic papers, articles, or textual content from various sources.
@@ -78,3 +96,5 @@
 - **Large Language Models:**
   - Fine-tune or use pre-trained models for enhanced context understanding.
   - Hugging Face Transformers library for accessing pre-trained models.

+- 2024_main_document_lvl
+- 2024_main_paragraph_lvl
+- 2023_main_document_lvl
+- 2023_main_paragraph_lvl
+- Embeddings convert pdfs
+  - Para
+  - Docs
+- HNSW - Kmeans fast searcddh
+- K means graphs based on the topics
+- Check for similarity between our own db
+  - Para
+  - Docs
+- Get The most important Ones
+- Get the Unquine sentances like title & other content ?? - LLM think karun karel
+- Search Google using the unquine searches --> get the top 3 and do the same check again --> result
 ### 1. Data Input:
 - **Input Data:** Collect a diverse dataset of academic papers, articles, or textual content from various sources.
 - **Large Language Models:**
   - Fine-tune or use pre-trained models for enhanced context understanding.
   - Hugging Face Transformers library for accessing pre-trained models.
+- Fingerprinting Concept