yiyixin commited on
Commit
da90bef
β€’
1 Parent(s): bbc4e4c
Files changed (1) hide show
  1. README.md +60 -8
README.md CHANGED
@@ -1,13 +1,65 @@
1
  ---
2
- title: {{title}}
3
- emoji: {{emoji}}
4
- colorFrom: {{colorFrom}}
5
- colorTo: {{colorTo}}
6
- sdk: {{sdk}}
7
- sdk_version: {{sdkVersion}}
8
- app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
13
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: ChatPaper
3
+ emoji: πŸ“•
4
+ colorFrom: pink
5
+ colorTo: purple
6
+ sdk: docker
7
+ sdk_version: 20.10.23
8
+ app_file: frontend.py
9
  pinned: false
10
+ license: gpl-3.0
11
  ---
12
 
13
+ # ChatPaper
14
 
15
+ Yet another paper reading assistant, similar as [ChatPDF](https://www.chatpdf.com/).
16
+
17
+ ## Setup
18
+
19
+ 1. Install dependencies (tested on Python 3.9)
20
+
21
+ ```bash
22
+ pip install -r requirements.txt
23
+ ```
24
+
25
+ 2. Setup GROBID local server
26
+
27
+ ```bash
28
+ bash serve_grobid.sh
29
+ ```
30
+
31
+ 3. Setup backend
32
+
33
+ ```bash
34
+ python backend.py --port 5000 --host localhost
35
+ ```
36
+
37
+ 4. Frontend
38
+
39
+ ```bash
40
+ streamlit run frontend.py --server.port 8502 --server.host localhost
41
+ ```
42
+
43
+ ## Demo Example
44
+
45
+ - Prepare an [OpenAI API key](https://platform.openai.com/account/api-keys) and then upload a PDF to start chatting with the paper.
46
+
47
+ ![image-20230318232056584](https://s2.loli.net/2023/03/19/SbsuLQJpdqePoZV.png)
48
+
49
+ ## Implementation Details
50
+
51
+ - Greedy Dynamic Context: Since the max token limit, we select the most relevant paragraphs in the pdf for each user query. Our model split the text input and output by the chatbot into four part: system_prompt (S), dynamic_source (D), user_query (Q), and model_answer(A). So upon each query, we first rank all the paragraphs by using a sentence_embedding model to calculate the similarity distance between the query embedding and all source embeddings. Then we compose the dynamic_source using a greedy method by to gradually push all relevant paragraphs (maintaing D <= MAX_TOKEN_LIMIT - Q - S - A - SOME_OVERHEAD).
52
+
53
+ - Context Truncating: When context is too long, we now we simply pop out the first QA-pair.
54
+
55
+ ## TODO
56
+
57
+ - [ ] **Context Condense**: how to deal with long context? maybe we can tune a soft prompt to condense the context
58
+ - [ ] **Poping context out based on similarity**
59
+
60
+ ## References
61
+
62
+ 1. SciPDF Parser: https://github.com/titipata/scipdf_parser
63
+ 2. St-chat: https://github.com/AI-Yash/st-chat
64
+ 3. Sentence-transformers: https://github.com/UKPLab/sentence-transformers
65
+ 4. ChatGPT Chatbot Wrapper: https://github.com/acheong08/ChatGPT