File size: 4,214 Bytes
e49f5ad
3366771
486ead7
7f5df23
 
e49f5ad
 
 
 
be71634
e49f5ad
 
0ef7038
f44712e
a8158b1
 
 
261cff9
0ef7038
 
 
 
a8158b1
 
 
8e15e0d
a8158b1
0ef7038
 
 
 
f44712e
0ef7038
 
 
 
 
 
 
 
eaa41fe
261cff9
 
 
 
 
f44712e
a8158b1
261cff9
 
 
 
 
 
3dcd3ce
261cff9
 
 
5f2135e
 
 
 
 
 
 
 
261cff9
943a5e0
 
 
 
 
 
261cff9
ca3d4df
943a5e0
261cff9
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
title: Who killed Laura Palmer?
emoji: πŸ—»πŸ—»
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.2.0
app_file: app.py
pinned: false
license: apache-2.0
---

# Who killed Laura Palmer?   [![Generic badge](https://img.shields.io/badge/πŸ€—-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/who-killed-laura-palmer) [![Generic badge](https://img.shields.io/github/stars/anakin87/who-killed-laura-palmer?label=Github&style=social)](https://github.com/anakin87/who-killed-laura-palmer)
[<img src="./data/readme_images/spaces_logo.png" align="center" style="display: block;margin-left: auto;
  margin-right: auto;  max-width: 70%;}">](https://huggingface.co/spaces/anakin87/who-killed-laura-palmer)



## πŸ—»πŸ—» Twin Peaks Question Answering system

WKLP is a simple Question Answering system, based on data crawled from [Twin Peaks Wiki](https://twinpeaks.fandom.com/wiki/Twin_Peaks_Wiki). It is built using [πŸ” Haystack](https://github.com/deepset-ai/haystack), an awesome open-source framework for building search systems that work intelligently over large document collections.

  - [Project architecture 🧱](#project-architecture-)
  - [What can I learn from this project? πŸ“š](#what-can-i-learn-from-this-project-)
  - [Repository structure πŸ“](#repository-structure-)
  - [Installation πŸ’»](#installation-)
  - [Possible improvements ✨](#possible-improvements-)
---

## Project architecture 🧱

[![Project architecture](./data/readme_images/project_architecture.png)](#) 

* Crawler: implemented using [Scrapy](https://github.com/scrapy/scrapy) and [fandom-py](https://github.com/NikolajDanger/fandom-py)
* Question Answering pipelines: created with [Haystack](https://github.com/deepset-ai/haystack)
* Web app: developed with [Streamlit](https://github.com/streamlit/streamlit)
* Free hosting: [Hugging Face Spaces](https://huggingface.co/spaces)

---

## What can I learn from this project? πŸ“š
- How to quickly ⌚ build a modern Question Answering system using [πŸ” Haystack](https://github.com/deepset-ai/haystack)
- How to generate questions based on your documents
- How to build a nice [Streamlit](https://github.com/streamlit/streamlit) web app to show your QA system
- How to optimize the web app to πŸš€ deploy in [πŸ€— Spaces](https://huggingface.co/spaces)

[![Web app preview](./data/readme_images/webapp.png)](https://huggingface.co/spaces/anakin87/who-killed-laura-palmer)

## Repository structure πŸ“
- [app.py](./app.py): Streamlit web app
- [app_utils folder](./app_utils/): python modules used in the web app
- [crawler folder](./crawler/): Twin Peaks crawler, developed with Scrapy and fandom-py
- [notebooks folder](./notebooks/): Jupyter/Colab notebooks to create the Search pipeline and generate questions (using Haystack)
- [data folder](./data/): all necessary data
- [presentations](./presentations/): Video presentation and slides (PyCon Italy 2022)

Within each folder, you can find more in-depth explanations.

## Installation πŸ’»
To install this project locally, follow these steps:
- `git clone https://github.com/anakin87/who-killed-laura-palmer`
- `cd who-killed-laura-palmer`
- `pip install -r requirements.txt`

To run the web app, simply type: `streamlit run app.py`

## Possible improvements ✨
### Project structure
- The project is optimized to be deployed in Hugging Face Spaces and consists of an all-in-one Streamlit web app. In more structured production environments, I suggest dividing the software into three parts:
  - Haystack backend API (as explained in [the official documentation](https://haystack.deepset.ai/components/rest-api))
  - Document store service
  - Streamlit web app
### Reader
- The reader model (`deepset/roberta-base-squad2`) is a good compromise between speed and accuracy, running on CPU. There are certainly better (and more computationally expensive) models, as you can read in the [Haystack documentation](https://haystack.deepset.ai/pipeline_nodes/reader).
- You can also think about preparing a Twin Peaks QA dataset and fine-tuning the reader model to get better accuracy, as explained in this [Haystack tutorial](https://haystack.deepset.ai/tutorials/fine-tuning-a-model).