Spaces:

lfoppiano
/

document-qa

Running

App Files Files Community

lfoppiano commited on Jun 24, 2024

Commit

041e0aa

1 Parent(s): 6eee84d

prepare for the new release

Browse files

Files changed (3) hide show

CHANGELOG.md +20 -0
README.md +10 -8
streamlit_app.py +2 -3

CHANGELOG.md CHANGED Viewed

@@ -4,6 +4,26 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
 ## [0.3.4] - 2023-12-26
 ### Added

 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/).
+## [0.4.0] - 2024-06-24
+### Added
++ Add selection of embedding functions
++ Add selection of text from the pdf viewer (provided by https://github.com/lfoppiano/streamlit-pdf-viewer)
++ Added an experimental feature for calculating the coefficient that relate the question and the embedding database
++ Added the data availability statement in the searchable text
+### Changed
++ Removed obsolete and non-working models zephyr and mistral v0.1
++ The underlying library was refactored to make it easier to maintain
++ Removed the native PDF viewer
++ Updated langchain and streamlit to the latest versions
++ Removed conversational memory which was causing more problems than bringing benefits
++ Rearranged the interface to get more space
+### Fixed
++ Updated and removed models that were not working
++ Fixed problems with langchain and other libraries
 ## [0.3.4] - 2023-12-26
 ### Added

README.md CHANGED Viewed

@@ -21,17 +21,14 @@ https://lfoppiano-document-qa.hf.space/
 ## Introduction
 Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, GPT4, GPT4-Turbo, Mistral-7b-instruct and Zephyr-7b-beta.
-The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents, that we are developing at NIMS (National Institute for Materials Science), in Tsukuba, Japan.
 **Different to most of the projects**, we focus on scientific articles and we extract text from a structured document.
 We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
 Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
-The conversation is kept in memory by a buffered sliding window memory (top 4 more recent messages) and the messages are injected in the context as "previous messages".
 (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
 [<img src="https://img.youtube.com/vi/M4UaYs5WKGs/hqdefault.jpg" height="300" align="right"
 />](https://www.youtube.com/embed/M4UaYs5WKGs)
@@ -46,6 +43,9 @@ The conversation is kept in memory by a buffered sliding window memory (top 4 mo
 ## Documentation
 ### Context size
 Allow to change the number of blocks from the original document that are considered for responding.
 The default size of each block is 250 tokens (which can be changed before uploading the first document).
@@ -61,8 +61,9 @@ Larger blocks will result in a larger context less constrained around the questi
 ### Query mode
 Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
- - LLM (default) enables question/answering related to the document content.
- - Embeddings: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.
 ### NER (Named Entities Recognition)
 This feature is specifically crafted for people working with scientific documents in materials science.
@@ -102,8 +103,9 @@ To install the library with Pypi:
 ## Acknowledgement
-This project is developed at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan in collaboration with [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team).
-Contributed by [Pedro Ortiz Suarez](https://github.com/pjox), [Tomoya Mato](https://github.com/t29mato).
 Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).

 ## Introduction
 Question/Answering on scientific documents using LLMs: ChatGPT-3.5-turbo, GPT4, GPT4-Turbo, Mistral-7b-instruct and Zephyr-7b-beta.
+The streamlit application demonstrates the implementation of a RAG (Retrieval Augmented Generation) on scientific documents.
 **Different to most of the projects**, we focus on scientific articles and we extract text from a structured document.
 We target only the full-text using [Grobid](https://github.com/kermitt2/grobid) which provides cleaner results than the raw PDF2Text converter (which is comparable with most of other solutions).
 Additionally, this frontend provides the visualisation of named entities on LLM responses to extract <span stype="color:yellow">physical quantities, measurements</span> (with [grobid-quantities](https://github.com/kermitt2/grobid-quantities)) and <span stype="color:blue">materials</span> mentions (with [grobid-superconductors](https://github.com/lfoppiano/grobid-superconductors)).
 (The image on the right was generated with https://huggingface.co/spaces/stabilityai/stable-diffusion)
 [<img src="https://img.youtube.com/vi/M4UaYs5WKGs/hqdefault.jpg" height="300" align="right"
 />](https://www.youtube.com/embed/M4UaYs5WKGs)
 ## Documentation
+### Embedding selection
+In the latest version there is the possibility to select both embedding functions and LLMs. There are some limitation, OpenAI embeddings cannot be used with open source models, and viceversa.
 ### Context size
 Allow to change the number of blocks from the original document that are considered for responding.
 The default size of each block is 250 tokens (which can be changed before uploading the first document).
 ### Query mode
 Indicates whether sending a question to the LLM (Language Model) or to the vector storage.
+ - **LLM** (default) enables question/answering related to the document content.
+ - **Embeddings**: the response will consist of the raw text from the document related to the question (based on the embeddings). This mode helps to test why sometimes the answers are not satisfying or incomplete.
+ - **Question coefficient** (experimental): provide a coefficient that indicate how the question has been far or closed to the retrieved context
 ### NER (Named Entities Recognition)
 This feature is specifically crafted for people working with scientific documents in materials science.
 ## Acknowledgement
+The project was initiated at the [National Institute for Materials Science](https://www.nims.go.jp) (NIMS) in Japan.
+Currently, the development is possible thanks to [ScienciLAB](https://www.sciencialab.com).
+This project was contributed by [Guillaume Lambard](https://github.com/GLambard) and the [Lambard-ML-Team](https://github.com/Lambard-ML-Team), [Pedro Ortiz Suarez](https://github.com/pjox), and [Tomoya Mato](https://github.com/t29mato).
 Thanks also to [Patrice Lopez](https://www.science-miner.com), the author of [Grobid](https://github.com/kermitt2/grobid).

streamlit_app.py CHANGED Viewed

@@ -299,7 +299,7 @@ with right_column:
     )
     placeholder = st.empty()
-    messages = st.container(height=300, border=False)
     question = st.chat_input(
         "Ask something about the article",
@@ -483,6 +483,5 @@ with left_column:
             input=st.session_state['binary'],
             annotation_outline_size=2,
             annotations=st.session_state['annotations'],
-            render_text=True,
-            height=700
         )

     )
     placeholder = st.empty()
+    messages = st.container(height=300)
     question = st.chat_input(
         "Ask something about the article",
             input=st.session_state['binary'],
             annotation_outline_size=2,
             annotations=st.session_state['annotations'],
+            render_text=True
         )