# Indexing and pipeline creation
This notebook is inspired by ["Build Your First QA System" tutorial](https://haystack.deepset.ai/tutorials/first-qa-system), from Haystack documentation.

Here we use a collection of articles about Twin Peaks to answer a variety of questions about that awesome TV series!

The following steps are performed:
* load and preprocess data
* create document store and write documents
* initialize retriever and generate document embeddings
* initialize reader
* compose and try Question Answering pipeline
* save and export index

## Preliminary operations

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# install dependencies
! pip install farm-haystack[faiss-gpu]==1.4.0

## Load and preprocess data

In [12]:
import glob, json

In [4]:
DATA_DIRECTORY = '/content/drive/MyDrive/Colab Notebooks/wklp/data'

docs=[]

for json_file in glob.glob(f'{DATA_DIRECTORY}/*.json'):
    with open(json_file, 'r') as fin:
        json_content=json.load(fin)
        
    doc={'content': json_content['text'],
        'meta': {'name': json_content['name'],
                 'url': json_content['url']}}
    docs.append(doc)

In [5]:
len(docs)

1087

In [6]:
docs[5]

{'content': "Pete Lindstrom\nPete Lindstrom was a citizen of Twin Peaks, Washington who was killed in the Blizzard of 1889.\nHis death was witnessed by Knut Zimmerman, who reported that wind had plunged a candle from the Annual Candlelighting and Christmas Tree Ceremony into the back of Lindstrom's head, killing him.",
 'meta': {'name': 'Pete_Lindstrom',
  'url': 'https://twinpeaks.fandom.com/wiki/Pete_Lindstrom'}}

In [8]:
# preprocess documents, splitting by chunks of 200 words

from haystack.nodes import PreProcessor

processor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=200,
   split_respect_sentence_boundary=True,
    split_overlap=0,
    language ='en'
)
preprocessed_docs = processor.process(docs)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


100%|██████████| 1087/1087 [00:01<00:00, 717.14docs/s]


In [9]:
print(preprocessed_docs[5])


<Document: id=3f6b71a59e1226326e53871d05393810, content='Pete Lindstrom
Pete Lindstrom was a citizen of Twin Peaks, Washington who was killed in the Blizzard ...'>


In [10]:
len(preprocessed_docs)

2825

## Create document store ([FAISS](https://github.com/facebookresearch/faiss)) and write documents



In [14]:
from haystack.document_stores import FAISSDocumentStore

# the document store settings are those compatible with Embedding Retriever
document_store = FAISSDocumentStore(
    similarity="dot_product",
    embedding_dim=768)

In [15]:
# write documents
document_store.write_documents(preprocessed_docs)


Writing Documents:   0%|          | 0/2825 [00:00<?, ?it/s]

## Initialize retriever (Embedding Retriever) and generate document embeddings


In [16]:
from haystack.nodes import EmbeddingRetriever

retriever = EmbeddingRetriever(
    document_store=document_store,
   embedding_model="sentence-transformers/multi-qa-mpnet-base-dot-v1",
   model_format="sentence_transformers"
)

# generate embeddings
document_store.update_embeddings(retriever)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.nodes.retriever.dense -  Init retriever using embeddings of model sentence-transformers/multi-qa-mpnet-base-dot-v1


Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/8.40k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/229 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.9k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

INFO - haystack.document_stores.faiss -  Updating embeddings for 2811 docs...


Updating Embedding:   0%|          | 0/2811 [00:00<?, ? docs/s]

Batches:   0%|          | 0/88 [00:00<?, ?it/s]

## Initialize reader

In [17]:
from haystack.nodes import FARMReader


In [18]:
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2",
                    use_gpu=True)

INFO - haystack.modeling.utils -  Using devices: CUDA:0
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.model.language_model -  LOADING MODEL
INFO - haystack.modeling.model.language_model -  Could not find deepset/roberta-base-squad2 locally.
INFO - haystack.modeling.model.language_model -  Looking on Transformers Model Hub (in local cache and online)...


Downloading:   0%|          | 0.00/571 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/473M [00:00<?, ?B/s]

INFO - haystack.modeling.model.language_model -  Loaded deepset/roberta-base-squad2


Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/772 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

INFO - haystack.modeling.utils -  Using devices: CUDA
INFO - haystack.modeling.utils -  Number of GPUs: 1
INFO - haystack.modeling.infer -  Got ya 2 parallel workers to do inference ...
INFO - haystack.modeling.infer -   0     0  
INFO - haystack.modeling.infer -  /w\   /w\ 
INFO - haystack.modeling.infer -  /'\   / \ 


## Compose and try Question Answering pipeline (retriever + reader)

In [19]:
from haystack.pipelines import ExtractiveQAPipeline


In [20]:
pipe = ExtractiveQAPipeline(reader, retriever)


In [21]:
import time
from haystack.utils import print_answers

In [22]:
start_time=time.time()

prediction = pipe.run(
    query="Where is Twin Peaks", params={"Retriever": {"top_k": 10}, "Reader": {"top_k": 5}}
)

end_time=time.time()

print()
print(end_time - start_time)
print_answers(prediction, details="medium")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00,  6.74 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.05 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 16.98 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 25.82 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 24.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 10.44 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 19.66 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 38.78 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 47.32 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 31.77 Batches/s]


1.4476244449615479

Query: Where is Twin Peaks
Answers:
[   {   'answer': 'Washington',
        'context': 'Highway J\n'
                   'Highway J was a highway that ran through Twin Peaks, '
                   'Washington. Notable buildings\n'
                   "Gentleman Jim's\n"
                   "Horne's Department Store\n"
                   'Pine View Motel ',
        'score': 0.942712664604187},
    {   'answer': 'Washington',
        'context': 'Chapel-in-the-Woods\n'
                   'Chapel-in-the-Woods was a chapel in Twin Peaks, '
                   'Washington. Hank Jennings and Norma Jennings as well as Ed '
                   'Hurley and Nadine Hurle',
        'score': 0.7930099964141846},
    {   'answer': 'northeastern Washington State',
        'context': 'eriff Harry S. Truman\n'
                   'Twin Peaks was a small logging town in northeastern '
                   'Washington State, five miles south of the Canadian border '
                   'and twe




In [23]:
start_time=time.time()

prediction = pipe.run(
    query="Who is Mike", 
    params={"Retriever": {"top_k": 10}, 
            "Reader": {"top_k": 5}}
)

end_time=time.time()

print()
print(end_time - start_time)
print_answers(prediction, details="medium")


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

  start_indices = flat_sorted_indices // max_seq_len
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 15.02 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 11.34 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.71 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 14.61 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 22.03 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 29.09 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.55 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 27.18 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 28.01 Batches/s]
Inferencing Samples: 100%|██████████| 1/1 [00:00<00:00, 34.82 Batches/s]


1.180633544921875

Query: Who is Mike
Answers:
[   {   'answer': 'inhabiting spirit',
        'context': 's. Cooper refused to give him his medicine and he changed '
                   'into the inhabiting spirit, Mike. He explained Gerard as '
                   'being his host and described BOB as',
        'score': 0.6887995302677155},
    {   'answer': 'his name is Mike and that he lived above a convenience '
                  'store with a man named BOB',
        'context': 'walk with me," and tells them that his name is Mike and '
                   'that he lived above a convenience store with a man named '
                   'BOB. He says that he was in the eleva',
        'score': 0.3988475129008293},
    {   'answer': 'one-armed man',
        'context': 'duos in the series named "Mike" and "Bob," the other being '
                   'Mike (the one-armed man) and BOB. Co-author Mark Frost '
                   'stated that Mike and Bobby remained ',
        'score': 0.35072641




## Save and export index


In [24]:
import shutil
import glob

In [25]:
document_store.save("my_faiss_index.faiss")

In [26]:
OUT_DIR = '/content/drive/MyDrive/Colab Notebooks/wklp/'

In [27]:
for f in glob.glob('*faiss*.*')+glob.glob('faiss*.*'):
  print(f)
  shutil.copy(f, OUT_DIR)

my_faiss_index.faiss
my_faiss_index.json
faiss_document_store.db
faiss_document_store.db
