gamingflexer commited on
Commit
8b67ee1
1 Parent(s): ff20e95

Dataset Download instructions

Browse files
Files changed (1) hide show
  1. README.md +17 -0
README.md CHANGED
@@ -25,6 +25,23 @@ OUTPUT - Plagiarism Check Results
25
 
26
  You can get MIT authors List from here - [Link](https://dspace.mit.edu/handle/1721.1/7582/browse?rpp=100&sort_by=-1&type=author&offset=100&etal=-1&order=ASC)
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
  ### Tech Stack
29
 
30
  - Gradio
 
25
 
26
  You can get MIT authors List from here - [Link](https://dspace.mit.edu/handle/1721.1/7582/browse?rpp=100&sort_by=-1&type=author&offset=100&etal=-1&order=ASC)
27
 
28
+ ## Dataset & Embeddings
29
+
30
+ We have used the arxiv dataset for the year 2023 & 2024 and then we have used the OpenAI Embeddings to generate the embeddings for the documents.
31
+
32
+ - Install gsutil - [Link](https://cloud.google.com/storage/docs/gsutil_install)
33
+
34
+ ```bash
35
+
36
+ # Single year files
37
+ gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/19*/ ./papers_from_2019/
38
+
39
+ #single file
40
+ gsutil cp gs://arxiv-dataset/arxiv/arxiv/pdf/2310/2310.00001v1.pdf .
41
+
42
+
43
+ ```
44
+
45
  ### Tech Stack
46
 
47
  - Gradio