Spaces:

MachineLearningReply
/

search_mlReply

Running

@@ -1,12 +1,111 @@
 ---
-title: Rag Search
-emoji: 👀
-colorFrom: gray
-colorTo: red
 sdk: streamlit
-sdk_version: 1.28.2
 app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Haystack Search Pipeline with Streamlit
+emoji: 👑
+colorFrom: indigo
+colorTo: indigo
 sdk: streamlit
+sdk_version: 1.23.0
 app_file: app.py
 pinned: false
 ---
+# Template Streamlit App for Haystack Search Pipelines
+This template [Streamlit](https://docs.streamlit.io/) app set up for simple [Haystack search applications](https://docs.haystack.deepset.ai/docs/semantic_search). The template is ready to do QA with **Retrievel Augmented Generation**, or **Ectractive QA**
+See the ['How to use this template'](#how-to-use-this-template) instructions below to create a simple UI for your own Haystack search pipelines.
+Below you will also find instructions on how you could [push this to Hugging Face Spaces 🤗](#pushing-to-hugging-face-spaces-).
+## Installation and Running
+To run the bare application which does _nothing_:
+1. Install requirements: `pip install -r requirements.txt`
+2. Run the streamlit app: `streamlit run app.py`
+This will start up the app on `localhost:8501` where you will find a simple search bar. Before you start editing, you'll notice that the app will only show you instructions on what to edit.
+### Optional Configurations
+You can set optional cofigurations to set the:
+-  `--task` you want to start the app with: `rag` or `extractive` (default: rag)
+-  `--store` you want to use: `inmemory`, `opensearch`, `weaviate` or `milvus` (default: inmemory)
+-  `--name` you want to have for the app. (default: 'My Search App')
+E.g.:
+```bash
+streamlit run app.py -- --store opensearch --task extractive --name 'My Opensearch Documentation Search'
+```
+In a `.env` file, include all the config settings that you would like to use based on:
+- The DocumentStore of your choice
+- The Extractive/Generative model of your choice
+While the `/utils/config.py` will create default values for some configurations, others have to be set in the `.env` such as the `OPENAI_KEY`
+Example `.env`
+```
+OPENAI_KEY=YOUR_KEY
+EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L12-v2
+GENERATIVE_MODEL=text-davinci-003
+```
+## How to use this template
+1. Create a new repository from this template or simply open it in a codespace to start playing around 💙
+2. Make sure your `requirements.txt` file includes the Haystack and Streamlit versions you would like to use.
+3. Change the code in `utils/haystack.py` if you would like a different pipeline.
+4. Create a `.env`file with all of your configuration settings.
+5. Make any UI edits you'd like to and [share with the Haystack community](https://haystack.deepeset.ai/community)
+6. Run the app as show in [installation and running](#installation-and-running)
+### Repo structure
+- `./utils`: This is where we have 3 files:
+    - `config.py`: This file extracts all of the configuration settings from a `.env` file. For some config settings, it uses default values. An example of this is in [this demo project](https://github.com/TuanaCelik/should-i-follow/blob/main/utils/config.py).
+    - `haystack.py`: Here you will find some functions already set up for you to start creating your Haystack search pipeline. It includes 2 main functions called `start_haystack()` which is what we use to create a pipeline and cache it, and `query()` which is the function called by `app.py` once a user query is received.
+    - `ui.py`: Use this file for any UI and initial value setups.
+- `app.py`: This is the main Streamlit application file that we will run. In its current state it has a simple search bar, a 'Run' button, and a response that you can highlight answers with.
+### What to edit?
+There are default pipelines both in `start_haystack_extractive()` and `start_haystack_rag()`
+- Change the pipelines to use the embedding models, extractive or generative models as you need.
+- If using the `rag` task, change the `default_prompt_template` to use one of our available ones on [PromptHub](https://prompthub.deepset.ai) or create your own `PromptTemplate`
+## Pushing to Hugging Face Spaces 🤗
+Below is an example GitHub action that will let you push your Streamlit app straight to the Hugging Face Hub as a Space.
+A few things to pay attention to:
+1. Create a New Space on Hugging Face with the Streamlit SDK.
+2. Create a Hugging Face token on your HF account.
+3. Create a secret on your GitHub repo called `HF_TOKEN` and put your Hugging Face token here.
+4. If you're using DocumentStores or APIs that require some keys/tokens, make sure these are provided as a secret for your HF Space too!
+5. This readme is set up to tell HF spaces that it's using streamlit and that the app is running on `app.py`, make any changes to the frontmatter of this readme to display the title, emoji etc you desire.
+6. Create a file in `.github/workflows/hf_sync.yml`. Here's an example that you can change with your own information, and an [example workflow](https://github.com/TuanaCelik/should-i-follow/blob/main/.github/workflows/hf_sync.yml) working for the [Should I Follow demo](https://huggingface.co/spaces/deepset/should-i-follow)
+```yaml
+name: Sync to Hugging Face hub
+on:
+  push:
+    branches: [main]
+  # to run this workflow manually from the Actions tab
+  workflow_dispatch:
+jobs:
+  sync-to-hub:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v2
+        with:
+          fetch-depth: 0
+          lfs: true
+      - name: Push to hub
+        env:
+          HF_TOKEN: ${{ secrets.HF_TOKEN }}
+        run: git push --force https://{YOUR_HF_USERNAME}:$HF_TOKEN@{YOUR_HF_SPACE_REPO} main
+```

app.py ADDED Viewed

	@@ -0,0 +1,249 @@

+import pydantic
+module_file_path = pydantic.__file__
+module_file_path = module_file_path.split('pydantic')[0] + 'haystack'
+import os
+import fileinput
+def replace_string_in_files(folder_path, old_str, new_str):
+    for subdir, dirs, files in os.walk(folder_path):
+        for file in files:
+            file_path = os.path.join(subdir, file)
+            # Check if the file is a text file (you can modify this condition based on your needs)
+            if file.endswith(".txt") or file.endswith(".py"):
+                # Open the file in place for editing
+                with fileinput.FileInput(file_path, inplace=True) as f:
+                    for line in f:
+                        # Replace the old string with the new string
+                        print(line.replace(old_str, new_str), end='')
+with open('change_log.txt','r') as f:
+    status = f.readlines()
+if status[-1] != 'changed':
+    replace_string_in_files(module_file_path, 'from pydantic', 'from pydantic.v1')
+    with open('change_log.txt','w'):
+        f.write('changed')
+from operator import index
+import streamlit as st
+import logging
+import os
+from annotated_text import annotation
+from json import JSONDecodeError
+from markdown import markdown
+from utils.config import parser
+from utils.haystack import start_document_store, query, initialize_pipeline, start_preprocessor_node, start_retriever, start_reader
+from utils.ui import reset_results, set_initial_state
+import pandas as pd
+import haystack
+# Whether the file upload should be enabled or not
+DISABLE_FILE_UPLOAD = bool(os.getenv("DISABLE_FILE_UPLOAD"))
+# Define a function to handle file uploads
+def upload_files():
+    uploaded_files = st.sidebar.file_uploader(
+            "upload", type=["pdf", "txt", "docx"], accept_multiple_files=True, label_visibility="hidden"
+        )
+    return uploaded_files
+# Define a function to process a single file
+def process_file(data_file, preprocesor, document_store):
+    # read file and add content
+    file_contents = data_file.read().decode("utf-8")
+    docs = [{
+        'content': str(file_contents),
+        'meta': {'name': str(data_file.name)}
+    }]
+    try:
+        names = [item.meta.get('name') for item in document_store.get_all_documents()]
+        #if args.store == 'inmemory':
+        # doc = converter.convert(file_path=files, meta=None)
+        if data_file.name in names:
+            print(f"{data_file.name} already processed")
+        else:
+            print(f'preprocessing uploaded doc {data_file.name}.......')
+            #print(data_file.read().decode("utf-8"))
+            preprocessed_docs = preprocesor.process(docs)
+            print('writing to document store.......')
+            document_store.write_documents(preprocessed_docs)
+            print('updating emebdding.......')
+            document_store.update_embeddings(retriever)
+    except Exception as e:
+        print(e)
+try:
+    args = parser.parse_args()
+    preprocesor = start_preprocessor_node()
+    document_store = start_document_store(type=args.store)
+    retriever = start_retriever(document_store)
+    reader = start_reader()
+    st.set_page_config(
+        page_title="MLReplySearch",
+        layout="centered",
+        page_icon=":shark:",
+        menu_items={
+            'Get Help': 'https://www.extremelycoolapp.com/help',
+            'Report a bug': "https://www.extremelycoolapp.com/bug",
+            'About': "# This is a header. This is an *extremely* cool app!"
+        }
+    )
+    st.sidebar.image("ml_logo.png", use_column_width=True)
+    # Sidebar for Task Selection
+    st.sidebar.header('Options:')
+    # OpenAI Key Input
+    openai_key = st.sidebar.text_input("Enter OpenAI Key:", type="password")
+    if openai_key:
+        task_options = ['Extractive', 'Generative']
+    else:
+        task_options = ['Extractive']
+    task_selection = st.sidebar.radio('Select the task:', task_options)
+    # Check the task and initialize pipeline accordingly
+    if task_selection == 'Extractive':
+        pipeline_extractive = initialize_pipeline("extractive", document_store, retriever, reader)
+    elif task_selection == 'Generative' and openai_key:  # Check for openai_key to ensure user has entered it
+        pipeline_rag = initialize_pipeline("rag", document_store, retriever, reader, openai_key=openai_key)
+    set_initial_state()
+    st.write('# ' + args.name)
+    # File upload block
+    if not DISABLE_FILE_UPLOAD:
+        st.sidebar.write("## File Upload:")
+        #data_files = st.sidebar.file_uploader(
+        #    "upload", type=["pdf", "txt", "docx"], accept_multiple_files=True, label_visibility="hidden"
+        #)
+        data_files = upload_files()
+        if data_files is not None:
+            for data_file in data_files:
+                # Upload file
+                if data_file:
+                    try:
+                        #raw_json = upload_doc(data_file)
+                        # Call the process_file function for each uploaded file
+                        if args.store == 'inmemory':
+                            processed_data = process_file(data_file, preprocesor, document_store)
+                        st.sidebar.write(str(data_file.name) + " &nbsp;&nbsp; ✅ ")
+                    except Exception as e:
+                        st.sidebar.write(str(data_file.name) + " &nbsp;&nbsp; ❌ ")
+                        st.sidebar.write("_This file could not be parsed, see the logs for more information._")
+    if "question" not in st.session_state:
+        st.session_state.question = ""
+    # Search bar
+    question = st.text_input("", value=st.session_state.question, max_chars=100, on_change=reset_results)
+    run_pressed = st.button("Run")
+    run_query = (
+        run_pressed or question != st.session_state.question #or task_selection != st.session_state.task
+    )
+    # Get results for query
+    if run_query and question:
+        if task_selection == 'Extractive':
+            reset_results()
+            st.session_state.question = question
+            with st.spinner("🔎 &nbsp;&nbsp; Running your pipeline"):
+                try:
+                    st.session_state.results_extractive = query(pipeline_extractive, question)
+                    st.session_state.task = task_selection
+                except JSONDecodeError as je:
+                    st.error(
+                        "👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?"
+                    )
+                except Exception as e:
+                    logging.exception(e)
+                    st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+        elif task_selection == 'Generative':
+            reset_results()
+            st.session_state.question = question
+            with st.spinner("🔎 &nbsp;&nbsp; Running your pipeline"):
+                try:
+                    st.session_state.results_generative = query(pipeline_rag, question)
+                    st.session_state.task = task_selection
+                except JSONDecodeError as je:
+                    st.error(
+                        "👓 &nbsp;&nbsp; An error occurred reading the results. Is the document store working?"
+                    )
+                except Exception as e:
+                    if "API key is invalid" in str(e):
+                        logging.exception(e)
+                        st.error("🐞 &nbsp;&nbsp; incorrect API key provided. You can find your API key at https://platform.openai.com/account/api-keys.")
+                    else:
+                        logging.exception(e)
+                        st.error("🐞 &nbsp;&nbsp; An error occurred during the request.")
+    # Display results
+    if (st.session_state.results_extractive or st.session_state.results_generative) and run_query:
+        # Handle Extractive Answers
+        if task_selection == 'Extractive':
+            results = st.session_state.results_extractive
+            st.subheader("Extracted Answers:")
+            if 'answers' in results:
+                answers = results['answers']
+                treshold = 0.2
+                higher_then_treshold = any(ans.score > treshold for ans in answers)
+                if not higher_then_treshold:
+                    st.markdown(f"<span style='color:red'>Please note none of the answers achieved a score higher then {int(treshold) * 100}%. Which probably means that the desired answer is not in the searched documents.</span>", unsafe_allow_html=True)
+                for count, answer in enumerate(answers):
+                    if answer.answer:
+                        text, context = answer.answer, answer.context
+                        start_idx = context.find(text)
+                        end_idx = start_idx + len(text)
+                        score = round(answer.score, 3)
+                        st.markdown(f"**Answer {count + 1}:**")
+                        st.markdown(
+                            context[:start_idx] + str(annotation(body=text, label=f'SCORE {score}', background='#964448', color='#ffffff')) + context[end_idx:],
+                            unsafe_allow_html=True,
+                        )
+                    else:
+                        st.info(
+                            "🤔 &nbsp;&nbsp; Haystack is unsure whether any of the documents contain an answer to your question. Try to reformulate it!"
+                        )
+        # Handle Generative Answers
+        elif task_selection == 'Generative':
+            results = st.session_state.results_generative
+            st.subheader("Generated Answer:")
+            if 'results' in results:
+                st.markdown("**Answer:**")
+                st.write(results['results'][0])
+        # Handle Retrieved Documents
+        if 'documents' in results:
+            retrieved_documents = results['documents']
+            st.subheader("Retriever Results:")
+            data = []
+            for i, document in enumerate(retrieved_documents):
+                # Truncate the content
+                truncated_content = (document.content[:150] + '...') if len(document.content) > 150 else document.content
+                data.append([i + 1, document.meta['name'], truncated_content])
+            # Convert data to DataFrame and display using Streamlit
+            df = pd.DataFrame(data, columns=['Ranked Context', 'Document Name', 'Content'])
+            st.table(df)
+except SystemExit as e:
+    os._exit(e.code)

change_log.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ unchanged

ml_logo.png ADDED Viewed

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+safetensors==0.3.3.post1
+farm-haystack[inference,weaviate,opensearch]==1.20.0
+milvus-haystack
+streamlit==1.23.0
+markdown
+st-annotated-text
+datasets

utils/config.py ADDED Viewed

	@@ -0,0 +1,41 @@

+import argparse
+import os
+import os
+from dotenv import load_dotenv
+load_dotenv()
+parser = argparse.ArgumentParser(description='This app lists animals')
+document_store_choices = ('inmemory', 'weaviate', 'milvus', 'opensearch')
+parser.add_argument('--store', choices=document_store_choices, default='inmemory', help='DocumentStore selection (default: %(default)s)')
+parser.add_argument('--name', default="My Search App")
+model_configs = {
+    'EMBEDDING_MODEL': os.getenv("EMBEDDING_MODEL", "sentence-transformers/all-MiniLM-L12-v2"),
+    'GENERATIVE_MODEL': os.getenv("GENERATIVE_MODEL", "gpt-4"),
+    'EXTRACTIVE_MODEL': os.getenv("EXTRACTIVE_MODEL", "deepset/roberta-base-squad2"),
+    'OPENAI_KEY': os.getenv("OPENAI_KEY"),
+    'COHERE_KEY': os.getenv("COHERE_KEY"),
+}
+document_store_configs = {
+# Weaviate Config
+'WEAVIATE_HOST':  os.getenv("WEAVIATE_HOST", "http://localhost"),
+'WEAVIATE_PORT': os.getenv("WEAVIATE_PORT", 8080),
+'WEAVIATE_INDEX': os.getenv("WEAVIATE_INDEX", "Document"),
+'WEAVIATE_EMBEDDING_DIM': os.getenv("WEAVIATE_EMBEDDING_DIM", 768),
+# OpenSearch Config
+'OPENSEARCH_SCHEME': os.getenv("OPENSEARCH_SCHEME",  "https"),
+'OPENSEARCH_USERNAME': os.getenv("OPENSEARCH_USERNAME", "admin"),
+'OPENSEARCH_PASSWORD': os.getenv("OPENSEARCH_PASSWORD", "admin"),
+'OPENSEARCH_HOST': os.getenv("OPENSEARCH_HOST", "localhost"),
+'OPENSEARCH_PORT': os.getenv("OPENSEARCH_PORT", 9200),
+'OPENSEARCH_INDEX':  os.getenv("OPENSEARCH_INDEX", "document"),
+'OPENSEARCH_EMBEDDING_DIM': os.getenv("OPENSEARCH_EMBEDDING_DIM", 768),
+# Milvus Config
+'MILVUS_URI': os.getenv("MILVUS_URI", "http://localhost:19530/default"),
+'MILVUS_INDEX':  os.getenv("MILVUS_INDEX", "document"),
+'MILVUS_EMBEDDING_DIM': os.getenv("MILVUS_EMBEDDING_DIM", 768),
+}

utils/haystack.py ADDED Viewed

	@@ -0,0 +1,120 @@

+import streamlit as st
+from utils.config import document_store_configs, model_configs
+from haystack import Pipeline
+from haystack.schema import Answer
+from haystack.document_stores import BaseDocumentStore
+from haystack.document_stores import InMemoryDocumentStore, OpenSearchDocumentStore, WeaviateDocumentStore
+from haystack.nodes import EmbeddingRetriever, FARMReader, PromptNode, PreProcessor
+from milvus_haystack import MilvusDocumentStore
+#Use this file to set up your Haystack pipeline and querying
+@st.cache_resource(show_spinner=False)
+def start_preprocessor_node():
+    print('initializing preprocessor node')
+    processor = PreProcessor(
+        clean_empty_lines= True,
+        clean_whitespace=True,
+        clean_header_footer=True,
+        #remove_substrings=None,
+        split_by="word",
+        split_length=100,
+        split_respect_sentence_boundary=True,
+        #split_overlap=0,
+        #max_chars_check= 10_000
+    )
+    return processor
+    #return docs
+@st.cache_resource(show_spinner=False)
+def start_document_store(type: str):
+    #This function starts the documents store of your choice based on your command line preference
+    print('initializing document store')
+    if type == 'inmemory':
+        document_store = InMemoryDocumentStore(use_bm25=True, embedding_dim=384)
+        '''
+        documents = [
+            {
+                'content': "Pi is a super dog",
+                'meta': {'name': "pi.txt"}
+            },
+            {
+                'content': "The revenue of siemens is 5 milion Euro",
+                'meta': {'name': "siemens.txt"}
+            },
+        ]
+        document_store.write_documents(documents)
+        '''
+    elif type == 'opensearch':
+        document_store = OpenSearchDocumentStore(scheme = document_store_configs['OPENSEARCH_SCHEME'],
+                                                 username = document_store_configs['OPENSEARCH_USERNAME'],
+                                                 password = document_store_configs['OPENSEARCH_PASSWORD'],
+                                                 host = document_store_configs['OPENSEARCH_HOST'],
+                                                 port = document_store_configs['OPENSEARCH_PORT'],
+                                                 index = document_store_configs['OPENSEARCH_INDEX'],
+                                                 embedding_dim = document_store_configs['OPENSEARCH_EMBEDDING_DIM'])
+    elif type == 'weaviate':
+        document_store = WeaviateDocumentStore(host = document_store_configs['WEAVIATE_HOST'],
+                                                port = document_store_configs['WEAVIATE_PORT'],
+                                                index = document_store_configs['WEAVIATE_INDEX'],
+                                                embedding_dim = document_store_configs['WEAVIATE_EMBEDDING_DIM'])
+    elif type == 'milvus':
+        document_store = MilvusDocumentStore(uri = document_store_configs['MILVUS_URI'],
+                                            index = document_store_configs['MILVUS_INDEX'],
+                                            embedding_dim = document_store_configs['MILVUS_EMBEDDING_DIM'],
+                                            return_embedding=True)
+    return document_store
+# cached to make index and models load only at start
+@st.cache_resource(show_spinner=False)
+def start_retriever(_document_store: BaseDocumentStore):
+    print('initializing retriever')
+    retriever = EmbeddingRetriever(document_store=_document_store,
+                                   embedding_model=model_configs['EMBEDDING_MODEL'],
+                                   top_k=5)
+    #
+    #_document_store.update_embeddings(retriever)
+    return retriever
+@st.cache_resource(show_spinner=False)
+def start_reader():
+    print('initializing reader')
+    reader = FARMReader(model_name_or_path=model_configs['EXTRACTIVE_MODEL'])
+    return reader
+# cached to make index and models load only at start
+@st.cache_resource(show_spinner=False)
+def start_haystack_extractive(_document_store: BaseDocumentStore, _retriever: EmbeddingRetriever, _reader: FARMReader):
+    print('initializing pipeline')
+    pipe = Pipeline()
+    pipe.add_node(component=_retriever, name="Retriever", inputs=["Query"])
+    pipe.add_node(component= _reader, name="Reader", inputs=["Retriever"])
+    return pipe
+@st.cache_resource(show_spinner=False)
+def start_haystack_rag(_document_store: BaseDocumentStore, _retriever: EmbeddingRetriever, openai_key):
+    prompt_node = PromptNode(default_prompt_template="deepset/question-answering",
+                             model_name_or_path=model_configs['GENERATIVE_MODEL'],
+                             api_key=openai_key)
+    pipe = Pipeline()
+    pipe.add_node(component=_retriever, name="Retriever", inputs=["Query"])
+    pipe.add_node(component=prompt_node, name="PromptNode", inputs=["Retriever"])
+    return pipe
+#@st.cache_data(show_spinner=True)
+def query(_pipeline, question):
+    params = {}
+    results = _pipeline.run(question, params=params)
+    return results
+def initialize_pipeline(task, document_store, retriever, reader, openai_key = ""):
+    if task == 'extractive':
+        return start_haystack_extractive(document_store, retriever, reader)
+    elif task == 'rag':
+        return start_haystack_rag(document_store, retriever, openai_key)

utils/ui.py ADDED Viewed

	@@ -0,0 +1,16 @@

+import streamlit as st
+def set_state_if_absent(key, value):
+    if key not in st.session_state:
+        st.session_state[key] = value
+def set_initial_state():
+    set_state_if_absent("question", "Ask something here?")
+    set_state_if_absent("results_extractive", None)
+    set_state_if_absent("results_generative", None)
+    set_state_if_absent("task", None)
+def reset_results(*args):
+    st.session_state.results_extractive = None
+    st.session_state.results_generative = None
+    st.session_state.task = None