Spaces:

HemanthSai7
/

IntelligentQuestionGenerator

Sleeping

App Files Files Community

G.Hemanth Sai commited on Oct 20, 2022

Commit

32d9382

1 Parent(s): e1b10aa

qgen

Browse files

Files changed (12) hide show

.gitattributes +1 -0
.gitignore +11 -0
README.md +83 -11
app.py +222 -0
models/s2v_reddit_2015_md.tar.gz +3 -0
requirements.txt +11 -0
src/Pipeline/QAhaystack.py +158 -0
src/Pipeline/QuestGen.py +94 -0
src/Pipeline/Reader.py +58 -0
src/Pipeline/TextSummarization.py +50 -0
src/PreviousVersionCode/QuestionGenerator.py +127 -0
src/PreviousVersionCode/context.py +379 -0

.gitattributes CHANGED Viewed

@@ -31,3 +31,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*tar.gz filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,11 @@

+#editor specific files
+.vscode
+.idea
+#cache files
+__pycache__
+tempCodeRunnerFile.py
+# models
+models/s2v_old
+models/._s2v_old

README.md CHANGED Viewed

@@ -1,13 +1,85 @@
 ---
-title: IntelligentQuestionGenerator
-emoji: 📊
-colorFrom: pink
-colorTo: indigo
-sdk: streamlit
-sdk_version: 1.10.0
-app_file: app.py
-pinned: false
-license: apache-2.0
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Internship-IVIS-labs
+-  The *Intelligent Question Generator* app is an easy-to-use interface built in Streamlit which uses [KeyBERT](https://github.com/MaartenGr/KeyBERT), [Sense2vec](https://github.com/explosion/sense2vec), [T5](https://huggingface.co/ramsrigouthamg/t5_paraphraser)
+-  It uses a minimal keyword extraction technique that leverages multiple NLP embeddings and relies on [Transformers](https://huggingface.co/transformers/) 🤗 to create keywords/keyphrases that are most similar to a document.
+- [sense2vec](https://github.com/explosion/sense2vec) (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors.
+## Repository Breakdown
+### src Directory
 ---
+- `src/Pipeline/QAhaystack.py`: This file contains the code of question answering using [haystack](https://haystack.deepset.ai/overview/intro).
+- `src/Pipeline/QuestGen.py`: This file contains the code of question generation.
+- `src/Pipeline/Reader.py`: This file contains the code of reading the document.
+- `src/Pipeline/TextSummariztion.py`: This file contains the code of text summarization.
+- `src/PreviousVersionCode/context.py`: This file contains the finding the context of the paragraph.
+- `src/PreviousVersionCode/QuestionGenerator.py`: This file contains the code of first attempt of question generation.
+## Installation
+```shell
+$ git clone https://github.com/HemanthSai7/Internship-IVIS-labs.git
+```
+```shell
+$ cd Internship-IVIS-labs
+```
+```python
+pip install -r requirements.txt
+```
+- For the running the app for the first time locally, you need to uncomment the the lines in `src/Pipeline/QuestGen.py` to download the models to the models directory.
+```python
+streamlit run app.py
+```
+- Once the app is running, you can access it at http://localhost:8501
+```shell
+  You can now view your Streamlit app in your browser.
+  Local URL: http://localhost:8501
+  Network URL: http://192.168.0.103:8501
+```
+## Tech Stack Used
+![image](https://img.shields.io/badge/Sense2vec-EF546D?style=for-the-badge&logo=Explosion.ai&logoColor=white)
+![image](https://img.shields.io/badge/Spacy-09A3D5?style=for-the-badge&logo=spaCy&logoColor=white)
+![image](https://img.shields.io/badge/Haystack-03AF9D?style=for-the-badge&logo=Haystackh&logoColor=white)
+![image](https://img.shields.io/badge/Python-3776AB?style=for-the-badge&logo=python&logoColor=white)
+![image](https://img.shields.io/badge/PyTorch-D04139?style=for-the-badge&logo=pytorch&logoColor=white)
+![image](https://img.shields.io/badge/Numpy-013243?style=for-the-badge&logo=numpy&logoColor=white)
+![image](https://img.shields.io/badge/Pandas-130654?style=for-the-badge&logo=pandas&logoColor=white)
+![image](https://img.shields.io/badge/matplotlib-b2feb0?style=for-the-badge&logo=matplotlib&logoColor=white)
+![image](https://img.shields.io/badge/scikit_learn-F7931E?style=for-the-badge&logo=scikit-learn&logoColor=white)
+![image](https://img.shields.io/badge/Streamlit-EA6566?style=for-the-badge&logo=streamlit&logoColor=white)
+## Timeline
+### Week 1-2:
+#### Tasks
+- [x] Understanding and brushing up the concepts of NLP.
+- [x] Extracting images and text from a pdf file and storing it in a texty file.
+- [x] Exploring various open source tools for generating questions from a given text.
+- [x] Read papers related to the project (Bert,T5,RoBERTa etc).
+- [x] Summarizing the extracted text using T5 base pre-trained model from the pdf file.
+### Week 3-4:
+#### Tasks
+- [x] Understanding the concept of QA systems.
+- [x] Created a basic script for generating questions from the text.
+- [x] Created a basic script for finding the context of the paragraph.
+### Week 5-6:
+#### Tasks
+- [x] Understanding how Transformers models work for NLP tasks Question answering and generation
+- [x] Understanding how to use the Haystack library for QA systems.
+- [x] Understanding how to use the Haystack library for Question generation.
+- [x] PreProcessed the document for Haystack QA for better results .
+### Week 7-8:
+#### Tasks
+- [x] Understanding how to generate questions intelligently.
+- [x] Explored wordnet to find synonyms
+- [x] Used BertWSD for disambiguating the sentence provided.
+- [x] Used KeyBERT for finding the keywords in the document.
+- [x] Used sense2vec for finding better words with high relatedness for the keywords generated.
+### Week 9-10:
+#### Tasks
+- [x] Create a streamlit app to demonstrate the project.

app.py ADDED Viewed

	@@ -0,0 +1,222 @@

+import streamlit as st
+import pandas as pd
+from keybert import KeyBERT
+import seaborn as sns
+from src.Pipeline.TextSummarization import T5_Base
+from src.Pipeline.QuestGen import sense2vec_get_words,get_question
+st.title("❓ Intelligent Question Generator")
+st.header("")
+with st.expander("ℹ️ - About this app", expanded=True):
+    st.write(
+        """
+-   The *Intelligent Question Generator* app is an easy-to-use interface built in Streamlit which uses [KeyBERT](https://github.com/MaartenGr/KeyBERT), [Sense2vec](https://github.com/explosion/sense2vec), [T5](https://huggingface.co/ramsrigouthamg/t5_paraphraser)
+-   It uses a minimal keyword extraction technique that leverages multiple NLP embeddings and relies on [Transformers](https://huggingface.co/transformers/) 🤗 to create keywords/keyphrases that are most similar to a document.
+- [sense2vec](https://github.com/explosion/sense2vec) (Trask et. al, 2015) is a nice twist on word2vec that lets you learn more interesting and detailed word vectors.
+	    """
+    )
+    st.markdown("")
+st.markdown("")
+st.markdown("## 📌 Paste document ")
+with st.form(key="my_form"):
+    ce, c1, ce, c2, c3 = st.columns([0.07, 2, 0.07, 5, 1])
+    with c1:
+        ModelType = st.radio(
+            "Choose your model",
+            ["DistilBERT (Default)", "BERT", "RoBERTa", "ALBERT", "XLNet"],
+            help="At present, you can choose 1 model ie DistilBERT to embed your text. More to come!",
+        )
+        if ModelType == "Default (DistilBERT)":
+            # kw_model = KeyBERT(model=roberta)
+            @st.cache(allow_output_mutation=True)
+            def load_model(model):
+                return KeyBERT(model=model)
+            kw_model = load_model('roberta')
+        else:
+            @st.cache(allow_output_mutation=True)
+            def load_model(model):
+                return KeyBERT(model=model)
+            kw_model = load_model("distilbert-base-nli-mean-tokens")
+        top_N = st.slider(
+            "# of results",
+            min_value=1,
+            max_value=30,
+            value=10,
+            help="You can choose the number of keywords/keyphrases to display. Between 1 and 30, default number is 10.",
+        )
+        min_Ngrams = st.number_input(
+            "Minimum Ngram",
+            min_value=1,
+            max_value=4,
+            help="""The minimum value for the ngram range.
+            *Keyphrase_ngram_range* sets the length of the resulting keywords/keyphrases.To extract keyphrases, simply set *keyphrase_ngram_range* to (1, 2) or higher depending on the number of words you would like in the resulting keyphrases.""",
+            # help="Minimum value for the keyphrase_ngram_range. keyphrase_ngram_range sets the length of the resulting keywords/keyphrases. To extract keyphrases, simply set keyphrase_ngram_range to (1, # 2) or higher depending on the number of words you would like in the resulting keyphrases.",
+        )
+        max_Ngrams = st.number_input(
+            "Maximum Ngram",
+            value=1,
+            min_value=1,
+            max_value=4,
+            help="""The maximum value for the keyphrase_ngram_range.
+            *Keyphrase_ngram_range* sets the length of the resulting keywords/keyphrases.
+            To extract keyphrases, simply set *keyphrase_ngram_range* to (1, 2) or higher depending on the number of words you would like in the resulting keyphrases.""",
+        )
+        StopWordsCheckbox = st.checkbox(
+            "Remove stop words",
+            value=True,
+            help="Tick this box to remove stop words from the document (currently English only)",
+        )
+        use_MMR = st.checkbox(
+            "Use MMR",
+            value=True,
+            help="You can use Maximal Margin Relevance (MMR) to diversify the results. It creates keywords/keyphrases based on cosine similarity. Try high/low 'Diversity' settings below for interesting variations.",
+        )
+        Diversity = st.slider(
+            "Keyword diversity (MMR only)",
+            value=0.5,
+            min_value=0.0,
+            max_value=1.0,
+            step=0.1,
+            help="""The higher the setting, the more diverse the keywords.Note that the *Keyword diversity* slider only works if the *MMR* checkbox is ticked.""",
+        )
+    with c2:
+        doc = st.text_area(
+            "Paste your text below (max 500 words)",
+            height=510,
+        )
+        MAX_WORDS = 500
+        import re
+        res = len(re.findall(r"\w+", doc))
+        if res > MAX_WORDS:
+            st.warning(
+                "⚠️ Your text contains "
+                + str(res)
+                + " words."
+                + " Only the first 500 words will be reviewed. Stay tuned as increased allowance is coming! 😊"
+            )
+            doc = doc[:MAX_WORDS]
+            # base=base=T5_Base("t5-base","cpu",2048)
+            # doc=base.summarize(doc)
+        submit_button = st.form_submit_button(label="✨ Get me the data!")
+    if use_MMR:
+        mmr = True
+    else:
+        mmr = False
+    if StopWordsCheckbox:
+        StopWords = "english"
+    else:
+        StopWords = None
+if min_Ngrams > max_Ngrams:
+    st.warning("min_Ngrams can't be greater than max_Ngrams")
+    st.stop()
+# Uses KeyBERT to extract the top keywords from a text
+# Arguments: text (str)
+# Returns: list of keywords (list)
+keywords = kw_model.extract_keywords(
+    doc,
+    keyphrase_ngram_range=(min_Ngrams, max_Ngrams),
+    use_mmr=mmr,
+    stop_words=StopWords,
+    top_n=top_N,
+    diversity=Diversity,
+)
+# print(keywords)
+st.markdown("## 🎈 Results ")
+st.header("")
+df = (
+    pd.DataFrame(keywords, columns=["Keyword/Keyphrase", "Relevancy"])
+    .sort_values(by="Relevancy", ascending=False)
+    .reset_index(drop=True)
+)
+df.index += 1
+# Add styling
+cmGreen = sns.light_palette("green", as_cmap=True)
+cmRed = sns.light_palette("red", as_cmap=True)
+df = df.style.background_gradient(
+    cmap=cmGreen,
+    subset=[
+        "Relevancy",
+    ],
+)
+c1, c2, c3 = st.columns([1, 3, 1])
+format_dictionary = {
+    "Relevancy": "{:.2%}",
+}
+df = df.format(format_dictionary)
+with c2:
+    st.table(df)
+with st.expander("Note about Quantitative Relevancy"):
+    st.markdown(
+        """
+    - The relevancy score is a quantitative measure of how relevant the keyword/keyphrase is to the document. It is calculated using cosine similarity. The higher the score, the more relevant the keyword/keyphrase is to the document.
+    - So if you see a keyword/keyphrase with a high relevancy score, it means that it is a good keyword/keyphrase to use in question answering, generation ,summarization, and other NLP tasks.
+    """
+    )
+with st.form(key="ques_form"):
+    ice, ic1, ice, ic2 ,ic3= st.columns([0.07, 2, 0.07, 5,0.07])
+    with ic1:
+        TopN = st.slider(
+            "Top N sense2vec results",
+            value=20,
+            min_value=0,
+            max_value=50,
+            step=1,
+            help="""Get the n most similar terms.""",
+        )
+    with ic2:
+        input_keyword = st.text_input("Paste any keyword generated above")
+        keywrd_button = st.form_submit_button(label="✨ Get me the questions!")
+if keywrd_button:
+    st.markdown("## 🎈 Questions ")
+    ext_keywrds=sense2vec_get_words(TopN,input_keyword)
+    if len(ext_keywrds)<1:
+        st.warning("Sorry questions couldn't be generated")
+    for answer in ext_keywrds:
+        sentence_for_T5=" ".join(doc.split())
+        ques=get_question(sentence_for_T5,answer)
+        ques=ques.replace("<pad>","").replace("</s>","").replace("<s>","")
+        st.markdown(f'> #### {ques} ')

models/s2v_reddit_2015_md.tar.gz ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5afb7c665d7833e54b04dfaf181500acca0327b5509e5e1f8ccb3b5986f53713
+size 600444501

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+streamlit
+numpy
+pandas
+seaborn
+scikit-learn
+PyPDF2
+fitz
+transformers
+spacy
+keybert
+sense2vec

src/Pipeline/QAhaystack.py ADDED Viewed

	@@ -0,0 +1,158 @@

+import re
+import logging
+from haystack.document_stores import ElasticsearchDocumentStore
+from haystack.utils import launch_es,print_answers
+from haystack.nodes import FARMReader,TransformersReader,BM25Retriever
+from haystack.pipelines import ExtractiveQAPipeline
+from haystack.nodes import TextConverter,PDFToTextConverter,PreProcessor
+from haystack.utils import convert_files_to_docs, fetch_archive_from_http
+from Reader import PdfReader,ExtractedText
+launch_es() # Launches an Elasticsearch instance on your local machine
+# Install the latest release of Haystack in your own environment
+#! pip install farm-haystack
+"""Install the latest main of Haystack"""
+# !pip install --upgrade pip
+# !pip install git+https://github.com/deepset-ai/haystack.git#egg=farm-haystack[colab,ocr]
+# # For Colab/linux based machines
+# !wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-linux-4.04.tar.gz
+# !tar -xvf xpdf-tools-linux-4.04.tar.gz && sudo cp xpdf-tools-linux-4.04/bin64/pdftotext /usr/local/bin
+# For Macos machines
+# !wget --no-check-certificate https://dl.xpdfreader.com/xpdf-tools-mac-4.03.tar.gz
+# !tar -xvf xpdf-tools-mac-4.03.tar.gz && sudo cp xpdf-tools-mac-4.03/bin64/pdftotext /usr/local/bin
+"Run this script from the root of the project"
+# # In Colab / No Docker environments: Start Elasticsearch from source
+# ! wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-linux-x86_64.tar.gz -q
+# ! tar -xzf elasticsearch-7.9.2-linux-x86_64.tar.gz
+# ! chown -R daemon:daemon elasticsearch-7.9.2
+# import os
+# from subprocess import Popen, PIPE, STDOUT
+# es_server = Popen(
+#     ["elasticsearch-7.9.2/bin/elasticsearch"], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1)  # as daemon
+# )
+# # wait until ES has started
+# ! sleep 30
+logging.basicConfig(format="%(levelname)s - %(name)s -  %(message)s", level=logging.WARNING)
+logging.getLogger("haystack").setLevel(logging.INFO)
+class Connection:
+    def __init__(self,host="localhost",username="",password="",index="document"):
+        """
+        host: Elasticsearch host. If no host is provided, the default host "localhost" is used.
+        port: Elasticsearch port. If no port is provided, the default port 9200 is used.
+        username: Elasticsearch username. If no username is provided, no username is used.
+        password: Elasticsearch password. If no password is provided, no password is used.
+        index: Elasticsearch index. If no index is provided, the default index "document" is used.
+        """
+        self.host=host
+        self.username=username
+        self.password=password
+        self.index=index
+    def get_connection(self):
+        document_store=ElasticsearchDocumentStore(host=self.host,username=self.username,password=self.password,index=self.index)
+        return document_store
+class QAHaystack:
+    def __init__(self, filename):
+        self.filename=filename
+    def preprocessing(self,data):
+        """
+        This function is used to preprocess the data. Its a simple function which removes the special characters and converts the data to lower case.
+        """
+        converter = TextConverter(remove_numeric_tables=True, valid_languages=["en"])
+        doc_txt = converter.convert(file_path=ExtractedText(self.filename,'data.txt').save(4,6), meta=None)[0]
+        converter = PDFToTextConverter(remove_numeric_tables=True, valid_languages=["en"])
+        doc_pdf = converter.convert(file_path="data/tutorial8/manibook.pdf", meta=None)[0]
+        preprocess_text=data.lower() # lowercase
+        preprocess_text = re.sub(r'\s+', ' ', preprocess_text) # remove extra spaces
+        return preprocess_text
+    def convert_to_document(self,data):
+        """
+        Write the data to a text file. This is required since the haystack library requires the data to be in a text file so that it can then be converted to a document.
+        """
+        data=self.preprocessing(data)
+        with open(self.filename,'w') as f:
+            f.write(data)
+        """
+        Read the data from the text file.
+        """
+        data=self.preprocessing(data)
+        with open(self.filename,'r') as f:
+            data=f.read()
+        data=data.split("\n")
+        """
+        DocumentStores expect Documents in dictionary form, like that below. They are loaded using the DocumentStore.write_documents()
+        dicts=[
+            {
+                'content': DOCUMENT_TEXT_HERE,
+                'meta':{'name': DOCUMENT_NAME,...}
+            },...
+        ]
+        (Optionally: you can also add more key-value-pairs here, that will be indexed as fields in Elasticsearch and can be accessed later for filtering or shown in the responses of the Pipeline)
+        """
+        data_json=[{
+            'content':paragraph,
+            'meta':{
+                'name':self.filename
+            }
+            } for paragraph in data
+        ]
+        document_store=Connection().get_connection()
+        document_store.write_documents(data_json)
+        return document_store
+class Pipeline:
+    def __init__(self,filename,retriever=BM25Retriever,reader=FARMReader):
+        self.reader=reader
+        self.retriever=retriever
+        self.filename=filename
+    def get_prediction(self,data,query):
+        """
+        Retrievers help narrowing down the scope for the Reader to smaller units of text where a given question could be answered. They use some simple but fast algorithm.
+        Here: We use Elasticsearch's default BM25 algorithm . I'll check out the other retrievers as well.
+        """
+        retriever=self.retriever(document_store=QAHaystack(self.filename).convert_to_document(data))
+        """
+        Readers scan the texts returned by retrievers in detail and extract k best answers. They are based on powerful, but slower deep learning models.Haystack currently supports Readers based on the frameworks FARM and Transformers.
+        """
+        reader = self.reader(model_name_or_path="deepset/roberta-base-squad2", use_gpu=True)
+        """
+        With a Haystack Pipeline we can stick together your building blocks to a search pipeline. Under the hood, Pipelines are Directed Acyclic Graphs (DAGs) that you can easily customize for our own use cases. To speed things up, Haystack also comes with a few predefined Pipelines. One of them is the ExtractiveQAPipeline that combines a retriever and a reader to answer our questions.
+        """
+        pipe = ExtractiveQAPipeline(reader, retriever)
+        """
+        This function is used to get the prediction from the pipeline.
+        """
+        prediction = pipe.run(query=query, params={"Retriever":{"top_k":10}, "Reader":{"top_k":5}})
+        return prediction

src/Pipeline/QuestGen.py ADDED Viewed

	@@ -0,0 +1,94 @@

+"""Download important files for the pipeline. Uncomment the following lines if you are running this script for the first time"""
+# !wget https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz
+# !tar -xvf  s2v_reddit_2015_md.tar.gz
+# if tar file is already downloaded don't download it again
+import os
+import urllib.request
+import tarfile
+if not os.path.exists("models/s2v_reddit_2015_md.tar.gz"):
+  print ("Downloading Sense2Vec model")
+  urllib.request.urlretrieve(r"https://github.com/explosion/sense2vec/releases/download/v1.0.0/s2v_reddit_2015_md.tar.gz",filename=r"models/s2v_reddit_2015_md.tar.gz")
+else:
+  print ("Sense2Vec model already downloaded")
+reddit_s2v= "models/s2v_reddit_2015_md.tar.gz"
+extract_s2v="models"
+extract_s2v_folder=reddit_s2v.replace(".tar.gz","")
+if not os.path.isdir(extract_s2v_folder):
+  with tarfile.open(reddit_s2v, 'r:gz') as tar:
+    tar.extractall(f"models/")
+else:
+  print ("Already extracted")
+"""Import required libraries"""
+import warnings
+warnings.filterwarnings('ignore')
+from transformers import T5ForConditionalGeneration,T5Tokenizer
+import streamlit as st
+from sense2vec import Sense2Vec
+@st.cache(allow_output_mutation=True)
+def cache_models(paths2v,pathT5cond,pathT5):
+    s2v = Sense2Vec().from_disk(paths2v)
+    question_model = T5ForConditionalGeneration.from_pretrained(pathT5cond)
+    question_tokenizer = T5Tokenizer.from_pretrained(pathT5)
+    return (s2v,question_model,question_tokenizer)
+s2v,question_model,question_tokenizer=cache_models("models/s2v_old",'ramsrigouthamg/t5_squad_v1','t5-base')
+"""Filter out same sense words using sense2vec algorithm"""
+def filter_same_sense_words(original,wordlist):
+  filtered_words=[]
+  base_sense =original.split('|')[1]
+  for eachword in wordlist:
+    if eachword[0].split('|')[1] == base_sense:
+      filtered_words.append(eachword[0].split('|')[0].replace("_", " ").title().strip())
+  return filtered_words
+def sense2vec_get_words(topn,input_keyword):
+  word=input_keyword
+  output=[]
+  required_keywords=[]
+  output = []
+  try:
+    sense = s2v.get_best_sense(word)
+    most_similar = s2v.most_similar(sense, n=topn)
+    for i in range(len(most_similar)):
+        required_keywords.append(most_similar[i])
+    output = filter_same_sense_words(sense,required_keywords)
+    print (f"Similar:{output}")
+  except:
+    output =[]
+  return output
+"""T5 Question generation"""
+question_model = T5ForConditionalGeneration.from_pretrained('ramsrigouthamg/t5_squad_v1')
+question_tokenizer = T5Tokenizer.from_pretrained('t5-base')
+def get_question(sentence,answer):
+  text = f"context: {sentence} answer: {answer} </s>"
+  max_len = 256
+  encoding = question_tokenizer.encode_plus(text,max_length=max_len, pad_to_max_length=True, return_tensors="pt")
+  input_ids, attention_mask = encoding["input_ids"], encoding["attention_mask"]
+  outs = question_model.generate(input_ids=input_ids,
+                                  attention_mask=attention_mask,
+                                  early_stopping=True,
+                                  num_beams=5,
+                                  num_return_sequences=1,
+                                  no_repeat_ngram_size=2,
+                                  max_length=200)
+  dec = [question_tokenizer.decode(ids) for ids in outs]
+  Question = dec[0].replace("question:","")
+  Question= Question.strip()
+  return Question

src/Pipeline/Reader.py ADDED Viewed

	@@ -0,0 +1,58 @@

+import PyPDF2
+import fitz
+class PdfReader:
+    def __init__(self, filename):
+        self.filename = filename
+    def total_pages(self):
+        with open(self.filename, 'rb') as f:
+            pdf_reader = PyPDF2.PdfFileReader(f)
+            return pdf_reader.numPages
+    def read(self):
+        with open(self.filename, 'rb') as f:
+            pdf_reader = PyPDF2.PdfFileReader(f)
+            num_pages = pdf_reader.numPages
+            count = 0
+            text = ''
+            while count < num_pages:
+                text += pdf_reader.getPage(count).extractText()
+                count += 1
+            return text
+    def read_pages(self, start_page, end_page):
+        with open(self.filename, 'rb') as f:
+            pdf_reader = PyPDF2.PdfFileReader(f)
+            text = ''
+            for page in range(start_page, end_page):
+                text += pdf_reader.getPage(page).extractText()
+            return text
+    def extract_images(self):
+        doc = fitz.open(self.filename)
+        for page_index in range(len(doc)):
+            for img in doc.get_page_images(page_index):
+                xref = img[0]
+                pix = fitz.Pixmap(doc, xref)
+                if pix.n < 5:       # GRAY or RGB
+                    pix.save(f"{xref}.png")
+                else:               # convert to RGB
+                    pix1 = fitz.Pixmap(fitz.csRGB, pix)
+                    pix1.save(f"{xref}.png")
+                    pix1 = None
+                pix = None
+class ExtractedText(PdfReader):
+    def __init__(self, filename, output_filename):
+        super().__init__(filename)
+        self.output_filename = output_filename
+    def save(self,start_page, end_page):
+        with open(self.filename,'rb') as f:
+            pdf_reader = PyPDF2.PdfFileReader(f)
+            text = ''
+            for page in range(start_page, end_page):
+                text += pdf_reader.getPage(page).extractText()
+            with open(self.output_filename, 'w',encoding='utf-8') as f:
+                f.write(text)

src/Pipeline/TextSummarization.py ADDED Viewed

	@@ -0,0 +1,50 @@

+import torch
+from transformers import T5Tokenizer, T5ForConditionalGeneration
+import random
+import numpy as np
+from nltk.tokenize import sent_tokenize
+class T5_Base:
+    def __init__(self,path,device,model_max_length):
+        self.model=T5ForConditionalGeneration.from_pretrained(path)
+        self.tokenizer=T5Tokenizer.from_pretrained(path,model_max_length=model_max_length)
+        self.device=torch.device(device)
+    def set_seed(seed):
+        random.seed(seed)
+        np.random.seed(seed)
+        torch.manual_seed(seed)
+        torch.cuda.manual_seed_all(seed)
+    def preprocess(self,data):
+        preprocess_text=data.strip().replace('\n',' ')
+        return preprocess_text
+    def post_process(self,data):
+        final=""
+        for sent in sent_tokenize(data):
+            sent=sent.capitalize()
+            final+=sent+" "+sent
+        return final
+    def getSummary(self,data):
+        data=self.preprocess(data)
+        t5_prepared_Data="summarize: "+data
+        tokenized_text=self.tokenizer.encode_plus(t5_prepared_Data,max_length=512,pad_to_max_length=False,truncation=True,return_tensors='pt').to(self.device)
+        input_ids,attention_mask=tokenized_text['input_ids'],tokenized_text['attention_mask']
+        summary_ids=self.model.generate(input_ids=input_ids,
+                                  attention_mask=attention_mask,
+                                  early_stopping=True,
+                                  num_beams=3,
+                                  num_return_sequences=1,
+                                  no_repeat_ngram_size=2,
+                                  min_length = 75,
+                                  max_length=300)
+        output=[self.tokenizer.decode(ids,skip_special_tokens=True) for ids in summary_ids]
+        summary=output[0]
+        summary=self.post_process(summary)
+        summary=summary.strip()
+        return summary

src/PreviousVersionCode/QuestionGenerator.py ADDED Viewed

	@@ -0,0 +1,127 @@

+from TextSummarization import T5_Base
+import spacy
+import torch
+from transformers import BertTokenizer, BertModel
+from transformers import T5ForConditionalGeneration, T5Tokenizer, BertTokenizer, BertModel, AutoTokenizer
+from sentence_transformers import SentenceTransformer
+from sklearn.feature_extraction.text import CountVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+"""
+spacy.load() returns a language model object containing all components and data needed to process text. It is usually called nlp. Calling the nlp object on a string of text will return a processed Doc
+"""
+nlp = spacy.load("en_core_web_sm") #spacy's trained pipeline model
+from warnings import filterwarnings as filt
+filt('ignore')
+class QuestionGenerator:
+    def __init__(self,path,device,model_max_length):
+        self.model=T5ForConditionalGeneration.from_pretrained(path)
+        self.tokenizer=AutoTokenizer.from_pretrained(path,model_max_length=model_max_length)
+        self.device=torch.device(device)
+    def preprocess(self,data):
+        preprocess_text=data.strip().replace('\n','')
+        return preprocess_text
+    def gen_question(self,data,answer):
+        data=self.preprocess(data)
+        t5_prepared_data=f'context: {data} answer: {answer}'
+        encoding=self.tokenizer.encode_plus(t5_prepared_data,max_length=512,pad_to_max_length=True,truncation=True,return_tensors='pt').to(self.device)
+        input_ids,attention_mask=encoding['input_ids'],encoding['attention_mask']
+        output=self.model.generate(input_ids,
+                                        attention_mask=attention_mask,
+                                        num_beams=4,
+                                        num_return_sequences=1,
+                                        no_repeat_ngram_size=2,
+                                        min_length=30,
+                                        max_length=512,
+                                        early_stopping=True)
+        dec=[self.tokenizer.decode(ids,skip_special_tokens=True) for ids in output]
+        Question=dec[0].replace("question:","").strip()
+        return Question
+class KeywordGenerator:
+    def __init__(self,path,device):
+        self.bert_model=BertModel.from_pretrained(path)
+        self.bert_tokenizer=BertTokenizer.from_pretrained(path)
+        self.sentence_model=SentenceTransformer('distilbert-base-nli-mean-tokens')
+        self.device=torch.device(device)
+    def get_embedding(self):
+        """
+        Token Embedding
+        txt = '[CLS] ' + doc + ' [SEP]' where CLS (used for classification task) is the token for the start of the sentence and SEP is the token for the end of the sentence and doc is the document to be encoded.
+        Ex: Sentence A : Paris is a beautiful city.
+            Sentence B : I love Paris.
+            tokens =[[cls] , Paris, is , a , beautiful , city ,[sep] , I , love , Paris ]
+            Before feeding the tokens to the Bert we convert the tokens into embeddings using an embedding layer called token embedding layer.
+        """
+        tokens=self.bert_tokenizer.tokenize(txt)
+        token_idx = self.bert_tokenizer.convert_tokens_to_ids(tokens)
+        """
+        Segment Embedding
+        Segment embedding is used to distinguish between the two gives sentences.The segment embedding layer returns only either of the two embedding EA(embedding of Sentence A) or EB(embedding of Sentence B) i.e if the input token belongs to sentence A then EA else EB for sentence B.
+        """
+        segment_ids=[1]*len(token_idx) #This is the segment_ids for the document. [1]*len(token_idxs) is a list of 1s of length len(token_idxs).
+        torch_token = torch.tensor([token_idx])
+        torch_segment = torch.tensor([segment_ids])
+        return self.bert_model(torch_token,torch_segment)[-1].detach().numpy() #
+    def get_posTags(self,context):
+        """This function returns the POS tags of the words in the context. Uses Spacy's POS tagger"""
+        doc=nlp(context)
+        doc_pos=[document.pos_ for document in doc]
+        return doc_pos,context.split()
+    def get_sentence(self,context):
+        """This function returns the sentences in the context. Uses Spacy's sentence tokenizer"""
+        doc=nlp(context)
+        return list(doc.sents)
+    def get_vector(self,doc):
+        """
+        Machines cannot understand characters and words. So when dealing with text data we need to represent it in numbers to be understood by the machine. Countvectorizer is a method to convert text to numerical data.
+        """
+        stop_words="english" #This is the list of stop words that we want to remove from the text
+        n_gram_range=(1,1) # This is the n-gram range. (1,1)->(unigram,unigram), (1,2)->(unigram,bigram), (1,3)->(unigram,trigram), (2,2)->(bigram,bigram) etc.
+        df=CountVectorizer(stop_words=stop_words,ngram_range=n_gram_range).fit([doc])
+        return df.get_feature_names() #This returns the list of words in the text.
+    def get_key_words(self,context,module_type='t'):
+        """
+        module_type: 't' for token, 's' for sentence, 'v' for vector
+        """
+        keywords=[]
+        top_n=5
+        for txt in self.get_sentence(context):
+            keyword=self.get_vector(str(txt))
+            print(f'vectors: {keyword}')
+            if module_type=='t':
+                doc_embedding=self.get_embedding(str(txt))
+                keyword_embedding=self.get_embedding(' '.join(keyword))
+            else:
+                doc_embedding=self.sentence_model.encode([str(txt)])
+                keyword_embedding=self.sentence_model.encode(keyword)
+            distances=cosine_similarity(doc_embedding,keyword_embedding)
+            print(distances)
+            keywords+=[(keyword[index],str(txt)) for index in distances.argsort()[0][-top_n:]]
+        return keywords
+txt = """Enter text"""
+for ans, context in KeywordGenerator('bert-base-uncased','cpu').get_key_words(txt,'st'):
+  print(QuestionGenerator('ramsrigouthamg/t5_squad_v1','cpu',512).gen_question(context, ans))
+  print()

src/PreviousVersionCode/context.py ADDED Viewed

	@@ -0,0 +1,379 @@

+# -*- coding: utf-8 -*-
+"""context
+Automatically generated by Colaboratory.
+Original file is located at
+    https://colab.research.google.com/drive/1qLh1aASQj5HIENPZpHQltTuShZny_567
+"""
+# !pip install -q  transformers
+# Import important libraries
+# Commented out IPython magic to ensure Python compatibility.
+import os
+import json
+import wanb
+from pprint import pprint
+import torch
+from torch.utils.data import Dataset
+from torch.utils.data import DataLoader
+from transformers import AdamW
+from tqdm.notebook import tqdm
+from transformers import BertForQuestionAnswering,BertTokenizer,BertTokenizerFast
+import numpy as np
+import matplotlib.pyplot as plt
+import seaborn as sns
+import pandas as pd
+# %matplotlib inline
+#connecting to wandb
+wandb.login()
+#Sweep Configuration
+PROJECT_NAME="context"
+ENTITY=None
+sweep_config={
+    'method':'random'
+}
+#set metric information --> we want to minimize the loss function.
+metric = {
+    'name': 'Validation accuracy',
+    'goal': 'maximize'
+    }
+sweep_config['metric'] = metric
+#set all other hyperparameters
+parameters_dict = {
+    'epochs':{
+        'values': [1]
+    },
+    'optimizer':{
+        'values': ['sgd','adam']
+    },
+    'momentum':{
+        'distribution': 'uniform',
+        'min': 0.5,
+        'max': 0.99
+    },
+    'batch_size':{
+        'distribution': 'q_log_uniform_values',
+        'q': 8,
+        'min': 16,
+        'max': 256
+    }
+    }
+sweep_config['parameters'] = parameters_dict
+#print the configuration of the sweep
+pprint(sweep_config)
+#initialize the sweep
+sweep_id=wandb.sweep(sweep_config,project=PROJECT_NAME,entity=ENTITY)
+# Mount the Google Drive to save the model
+from google.colab import drive
+drive.mount('/content/drive')
+if not os.path.exists('/content/drive/MyDrive/BERT-SQuAD'):
+  os.mkdir('/content/drive/MyDrive/BERT-SQuAD')
+# Download SQuAD 2.0 data
+# !wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json
+# !wget -nc https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json
+"""Load the training dataset and take a look at it"""
+with open('train-v2.0.json','rb') as f:
+  squad=json.load(f)
+# Each 'data' dict has two keys (title and paragraphs)
+squad['data'][150]['paragraphs'][0]['context']
+"""Load the dev dataset and take a look at it"""
+def read_data(path):
+  with open(path,'rb') as f:
+    squad=json.load(f)
+  contexts=[]
+  questions=[]
+  answers=[]
+  for group in squad['data']:
+    for passage in group['paragraphs']:
+      context=passage['context']
+      for qna in passage['qas']:
+        question=qna['question']
+        for answer in qna['answers']:
+          contexts.append(context)
+          questions.append(question)
+          answers.append(answer)
+  return contexts,questions,answers
+#Put the contexts, questions and answers for training and validation into the appropriate lists.
+"""
+The answers are dictionaries whith the answer text and an integer which indicates the start index of the answer in the context.
+"""
+train_contexts,train_questions,train_answers=read_data('train-v2.0.json')
+valid_contexts,valid_questions,valid_answers=read_data('dev-v2.0.json')
+# print(train_contexts[:10])
+# Create a dictionary to map the words to their indices
+def end_idx(answers,contexts):
+  for answers,context in zip(answers,contexts):
+    gold_text=answers['text']
+    start_idx=answers['answer_start']
+    end_idx=start_idx+len(gold_text)
+    # sometimes squad answers are off by a character or two so we fix this
+    if context[start_idx:end_idx] == gold_text:
+      answers['answer_end'] = end_idx
+    elif context[start_idx-1:end_idx-1] == gold_text:
+      answers['answer_start'] = start_idx - 1
+      answers['answer_end'] = end_idx - 1     # When the gold label is off by one character
+    elif context[start_idx-2:end_idx-2] == gold_text:
+      answers['answer_start'] = start_idx - 2
+      answers['answer_end'] = end_idx - 2     # When the gold label is off by two characters
+""""Tokenization"""
+tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
+train_encodings = tokenizer(train_contexts, train_questions, truncation=True, padding=True)
+valid_encodings = tokenizer(valid_contexts, valid_questions, truncation=True, padding=True)
+# print(train_encodings.keys()) ---> dict_keys(['input_ids', 'token_type_ids', 'attention_mask'])
+# Positional encoding
+def add_token_positions(encodings,answers):
+  start_positions=[]
+  end_positions=[]
+  for i in range(len(answers)):
+    start_positions.append(encodings.char_to_token(i,answers[i]['answer_start']))
+    end_positions.append(encodings.char_to_token(i,answers[i]['answer_end']))
+    # if start position is None, the answer passage has been truncated
+    if start_positions[-1] is None:
+      start_positions[-1] = tokenizer.model_max_length
+    if end_positions[-1] is None:
+      end_positions[-1] = tokenizer.model_max_length
+  encodings.update({'start_positions': start_positions, 'end_positions': end_positions})
+"""Dataloader for the training dataset"""
+class DatasetRetriever(Dataset):
+  def __init__(self,encodings):
+    self.encodings=encodings
+  def __getitem__(self,idx):
+    return {key:torch.tensor(val[idx]) for key,val in self.encodings.items()}
+  def __len__(self):
+    return len(self.encodings.input_ids)
+#Split the dataset into train and validation
+train_dataset=DatasetRetriever(train_encodings)
+valid_dataset=DatasetRetriever(valid_encodings)
+train_loader=DataLoader(train_dataset,batch_size=16,shuffle=True)
+valid_loader=DataLoader(valid_dataset,batch_size=16)
+model = BertForQuestionAnswering.from_pretrained("bert-base-uncased")
+device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
+#Training and testing Loop
+def pipeline():
+  epochs=1,
+  optimizer = torch.optim.AdamW(model.parameters(),lr=5e-5)
+  with wandb.init(config=None):
+    config=wandb.config
+    model.to(device)
+    #train the model
+    model.train()
+    for epoch in range(config.epochs):
+      loop = tqdm(train_loader, leave=True)
+      for batch in loop:
+        optimizer.zero_grad()
+        input_ids = batch['input_ids'].to(device)
+        attention_mask = batch['attention_mask'].to(device)
+        start_positions = batch['start_positions'].to(device)
+        end_positions = batch['end_positions'].to(device)
+        outputs = model(input_ids, attention_mask=attention_mask, start_positions=start_positions, end_positions=end_positions)
+        loss = outputs[0]
+        loss.backward()
+        optimizer.step()
+        loop.set_description(f'Epoch {epoch+1}')
+        loop.set_postfix(loss=loss.item())
+        wandb.log({'Validation Loss':loss})
+    #set the model to evaluation phase
+    model.eval()
+    acc=[]
+    for batch in tqdm(valid_loader):
+      with torch.no_grad():
+        input_ids=batch['input_ids'].to(device)
+        attention_mask=batch['attention_mask'].to(device)
+        start_true=batch['start_positions'].to(device)
+        end_true=batch['end_positions'].to(device)
+        outputs=model(input_ids,attention_mask=attention_mask)
+        start_pred=torch.argmax(outputs['start_logits'],dim=1)
+        end_pred=torch.argmax(outputs['end_logits'],dim=1)
+        acc.append(((start_pred == start_true).sum()/len(start_pred)).item())
+        acc.append(((end_pred == end_true).sum()/len(end_pred)).item())
+    acc = sum(acc)/len(acc)
+    print("\n\nT/P\tanswer_start\tanswer_end\n")
+    for i in range(len(start_true)):
+      print(f"true\t{start_true[i]}\t{end_true[i]}\n"
+            f"pred\t{start_pred[i]}\t{end_pred[i]}\n")
+    wandb.log({'Validation accuracy': acc})
+#Run the pipeline
+wandb.agent(sweep_id, pipeline, count = 4)
+"""Save the model so we dont have to train it again"""
+model_path = '/content/drive/MyDrive/BERT-SQuAD'
+model.save_pretrained(model_path)
+tokenizer.save_pretrained(model_path)
+"""Load the model"""
+model_path = '/content/drive/MyDrive/BERT-SQuAD'
+model = BertForQuestionAnswering.from_pretrained(model_path)
+tokenizer = BertTokenizerFast.from_pretrained(model_path)
+device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
+model = model.to(device)
+#Get predictions
+def get_prediction(context,answer):
+  inputs=tokenizer.encode_plus(question,context,return_tensors='pt').to(device)
+  outputs=model(**inputs)
+  answer_start=torch.argmax(outputs[0]) # start position of the answer
+  answer_end=torch.argmax(outputs[1])+1 # end position of the answer
+  answer = tokenizer.convert_tokens_to_string(tokenizer. ## convert the tokens to string
+  convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))
+  return answer
+"""
+Question testing
+Official SQuAD evaluation script-->
+https://colab.research.google.com/github/fastforwardlabs/ff14_blog/blob/master/_notebooks/2020-06-09-Evaluating_BERT_on_SQuAD.ipynb#scrollTo=MzPlHgWEBQ8D
+"""
+def normalize_text(s):
+  """Removing articles and punctuation, and standardizing whitespace are all typical text processing steps."""
+  import string, re
+  def remove_articles(text):
+    regex = re.compile(r"\b(a|an|the)\b", re.UNICODE)
+    return re.sub(regex, " ", text)
+  def white_space_fix(text):
+    return " ".join(text.split())
+  def remove_punc(text):
+    exclude = set(string.punctuation)
+    return "".join(ch for ch in text if ch not in exclude)
+  def lower(text):
+    return text.lower()
+  return white_space_fix(remove_articles(remove_punc(lower(s))))
+def exact_match(prediction, truth):
+    return bool(normalize_text(prediction) == normalize_text(truth))
+def compute_f1(prediction, truth):
+  pred_tokens = normalize_text(prediction).split()
+  truth_tokens = normalize_text(truth).split()
+  # if either the prediction or the truth is no-answer then f1 = 1 if they agree, 0 otherwise
+  if len(pred_tokens) == 0 or len(truth_tokens) == 0:
+    return int(pred_tokens == truth_tokens)
+  common_tokens = set(pred_tokens) & set(truth_tokens)
+  # if there are no common tokens then f1 = 0
+  if len(common_tokens) == 0:
+    return 0
+  prec = len(common_tokens) / len(pred_tokens)
+  rec = len(common_tokens) / len(truth_tokens)
+  return round(2 * (prec * rec) / (prec + rec), 2)
+def question_answer(context, question,answer):
+  prediction = get_prediction(context,question)
+  em_score = exact_match(prediction, answer)
+  f1_score = compute_f1(prediction, answer)
+  print(f'Question: {question}')
+  print(f'Prediction: {prediction}')
+  print(f'True Answer: {answer}')
+  print(f'Exact match: {em_score}')
+  print(f'F1 score: {f1_score}\n')
+context = """Space exploration is a very exciting field of research. It is the
+           frontier of Physics and no doubt will change the understanding of science.
+           However, it does come at a cost. A normal space shuttle costs about 1.5 billion dollars to make.
+           The annual budget of NASA, which is a premier space exploring organization is about 17 billion.
+           So the question that some people ask is that whether it is worth it."""
+questions =["What wil change the understanding of science?",
+            "What is the main idea in the paragraph?"]
+answers = ["Space Exploration",
+           "The cost of space exploration is too high"]
+"""
+VISUALISATION IN PROGRESS
+for question, answer in zip(questions, answers):
+  question_answer(context, question, answer)
+    #Visualize the start scores
+    plt.rcParams["figure.figsize"]=(20,10)
+    ax=sns.barplot(x=token_labels,y=start_scores)
+    ax.set_xticklabels(ax.get_xticklabels(),rotation=90,ha="center")
+    ax.grid(True)
+    plt.title("Start word scores")
+    plt.show()
+    #Visualize the end scores
+    plt.rcParams["figure.figsize"]=(20,10)
+    ax=sns.barplot(x=token_labels,y=end_scores)
+    ax.set_xticklabels(ax.get_xticklabels(),rotation=90,ha="center")
+    ax.grid(True)
+    plt.title("End word scores")
+    plt.show()
+    #Visualize both the scores
+    scores=[]
+    for (i,token_label) in enumerate(token_labels):
+      # Add the token's start score as one row.
+      scores.append({'token_label':token_label,
+                     'score':start_scores[i],
+                     'marker':'start'})
+      # Add  the token's end score as another row.
+      scores.append({'token_label': token_label,
+                   'score': end_scores[i],
+                   'marker': 'end'})
+    df=pd.DataFrame(scores)
+    group_plot=sns.catplot(x="token_label",y="score",hue="marker",data=df,
+                           kind="bar",height=6,aspect=4)
+    group_plot.set_xticklabels(ax.get_xticklabels(),rotation=90,ha="center")
+    group_plot.ax.grid(True)
+"""