lfoppiano commited on
Commit
041e0aa
1 Parent(s): 6eee84d

prepare for the new release

Browse files
Files changed (3) hide show
  1. CHANGELOG.md +20 -0
  2. README.md +10 -8
  3. streamlit_app.py +2 -3
CHANGELOG.md CHANGED
@@ -4,6 +4,26 @@ All notable changes to this project will be documented in this file.
4
 
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7
  ## [0.3.4] - 2023-12-26
8
 
9
  ### Added
 
4
 
5
  The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
6
 
7
+ ## [0.4.0] - 2024-06-24
8
+
9
+ ### Added
10
+ + Add selection of embedding functions
11
+ + Add selection of text from the pdf viewer (provided by https://github.com/lfoppiano/streamlit-pdf-viewer)
12
+ + Added an experimental feature for calculating the coefficient that relate the question and the embedding database
13
+ + Added the data availability statement in the searchable text
14
+
15
+ ### Changed
16
+ + Removed obsolete and non-working models zephyr and mistral v0.1
17
+ + The underlying library was refactored to make it easier to maintain
18
+ + Removed the native PDF viewer
19
+ + Updated langchain and streamlit to the latest versions
20
+ + Removed conversational memory which was causing more problems than bringing benefits
21
+ + Rearranged the interface to get more space
22
+
23
+ ### Fixed
24
+ + Updated and removed models that were not working
25
+ + Fixed problems with langchain and other libraries
26
+
27
  ## [0.3.4] - 2023-12-26
28
 
29
  ### Added
README.md CHANGED
@@ -21,17 +21,14 @@ https://lfoppiano-document-qa.hf.space/
21
  ## Introduction
22
 
23
  Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, GPT4, GPT4-Turbo, Mistral-7b-instruct and Zephyr-7b-beta.
24
- The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
25
  **Different to most of the projects**, we focus on scientific articles and we extract text from a structured document.
26
  We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
27
 
28
  Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
29
 
30
- The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
31
-
32
  (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
33
 
34
-
35
  [<img src="https://img.youtube.com/vi/M4UaYs5WKGs/hqdefault.jpg" height="300" align="right"
36
  />](https://www.youtube.com/embed/M4UaYs5WKGs)
37
 
@@ -46,6 +43,9 @@ The conversation is kept in memory by a buffered sliding window memory (top 4 mo
46
 
47
  ## Documentation
48
 
 
 
 
49
  ### Context size
50
  Allow to change the number of blocks from the original document that are considered for responding.
51
  The default size of each block is 250 tokens (which can be changed before uploading the first document).
@@ -61,8 +61,9 @@ Larger blocks will result in a larger context less constrained around the questi
61
 
62
  ### Query mode
63
  Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
64
- - LLM (default) enables question/answering related to the document content.
65
- - Embeddings: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.
 
66
 
67
  ### NER (Named Entities Recognition)
68
  This feature is specifically crafted for people working with scientific documents in materials science.
@@ -102,8 +103,9 @@ To install the library with Pypi:
102
 
103
  ## Acknowledgement
104
 
105
- This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
106
- Contributed by [Pedro Ortiz Suarez](https://github.com/pjox), [Tomoya Mato](https://github.com/t29mato).
 
107
  Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).
108
 
109
 
 
21
  ## Introduction
22
 
23
  Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, GPT4, GPT4-Turbo, Mistral-7b-instruct and Zephyr-7b-beta.
24
+ The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents.
25
  **Different to most of the projects**, we focus on scientific articles and we extract text from a structured document.
26
  We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
27
 
28
  Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
29
 
 
 
30
  (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
31
 
 
32
  [<img src="https://img.youtube.com/vi/M4UaYs5WKGs/hqdefault.jpg" height="300" align="right"
33
  />](https://www.youtube.com/embed/M4UaYs5WKGs)
34
 
 
43
 
44
  ## Documentation
45
 
46
+ ### Embedding selection
47
+ In the latest version there is the possibility to select both embedding functions and LLMs. There are some limitation, OpenAI embeddings cannot be used with open source models, and viceversa.
48
+
49
  ### Context size
50
  Allow to change the number of blocks from the original document that are considered for responding.
51
  The default size of each block is 250 tokens (which can be changed before uploading the first document).
 
61
 
62
  ### Query mode
63
  Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
64
+ - **LLM** (default) enables question/answering related to the document content.
65
+ - **Embeddings**: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.
66
+ - **Question coefficient** (experimental): provide a coefficient that indicate how the question has been far or closed to the retrieved context
67
 
68
  ### NER (Named Entities Recognition)
69
  This feature is specifically crafted for people working with scientific documents in materials science.
 
103
 
104
  ## Acknowledgement
105
 
106
+ The project was initiated at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan.
107
+ Currently, the development is possible thanks to [ScienciLAB](https://www.sciencialab.com).
108
+ This project was contributed by [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team), [Pedro Ortiz Suarez](https://github.com/pjox), and [Tomoya Mato](https://github.com/t29mato).
109
  Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).
110
 
111
 
streamlit_app.py CHANGED
@@ -299,7 +299,7 @@ with right_column:
299
  )
300
 
301
  placeholder = st.empty()
302
- messages = st.container(height=300, border=False)
303
 
304
  question = st.chat_input(
305
  "Ask something about the article",
@@ -483,6 +483,5 @@ with left_column:
483
  input=st.session_state['binary'],
484
  annotation_outline_size=2,
485
  annotations=st.session_state['annotations'],
486
- render_text=True,
487
- height=700
488
  )
 
299
  )
300
 
301
  placeholder = st.empty()
302
+ messages = st.container(height=300)
303
 
304
  question = st.chat_input(
305
  "Ask something about the article",
 
483
  input=st.session_state['binary'],
484
  annotation_outline_size=2,
485
  annotations=st.session_state['annotations'],
486
+ render_text=True
 
487
  )