File size: 4,214 Bytes
65e2d75
4b5566b
814fcf5
4deb86a
 
65e2d75
 
 
 
72ca6ef
65e2d75
 
c5a9359
08a6165
b89fac0
 
 
522dfc6
c5a9359
 
 
 
b89fac0
 
 
1af8d3b
b89fac0
c5a9359
 
 
 
08a6165
c5a9359
 
 
 
 
 
 
 
d0ed384
522dfc6
 
 
 
 
08a6165
b89fac0
522dfc6
 
 
 
 
 
f26618b
522dfc6
 
 
91a52f0
 
 
 
 
 
 
 
522dfc6
c1edff4
 
 
 
 
 
522dfc6
ecbb359
c1edff4
522dfc6
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
title: Who killed Laura Palmer?
emoji: πŸ—»πŸ—»
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.2.0
app_file: app.py
pinned: false
license: apache-2.0
---

# Who killed Laura Palmer?   [![Generic badge](https://img.shields.io/badge/πŸ€—-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/who-killed-laura-palmer) [![Generic badge](https://img.shields.io/github/stars/anakin87/who-killed-laura-palmer?label=Github&style=social)](https://github.com/anakin87/who-killed-laura-palmer)
[<img src="./data/readme_images/spaces_logo.png" align="center" style="display: block;margin-left: auto;
  margin-right: auto;  max-width: 70%;}">](https://huggingface.co/spaces/anakin87/who-killed-laura-palmer)



## πŸ—»πŸ—» Twin Peaks Question Answering system

WKLP is a simple Question Answering system, based on data crawled from [Twin Peaks Wiki](https://twinpeaks.fandom.com/wiki/Twin_Peaks_Wiki). It is built using [πŸ” Haystack](https://github.com/deepset-ai/haystack), an awesome open-source framework for building search systems that work intelligently over large document collections.

  - [Project architecture 🧱](#project-architecture-)
  - [What can I learn from this project? πŸ“š](#what-can-i-learn-from-this-project-)
  - [Repository structure πŸ“](#repository-structure-)
  - [Installation πŸ’»](#installation-)
  - [Possible improvements ✨](#possible-improvements-)
---

## Project architecture 🧱

[![Project architecture](./data/readme_images/project_architecture.png)](#) 

* Crawler: implemented using [Scrapy](https://github.com/scrapy/scrapy) and [fandom-py](https://github.com/NikolajDanger/fandom-py)
* Question Answering pipelines: created with [Haystack](https://github.com/deepset-ai/haystack)
* Web app: developed with [Streamlit](https://github.com/streamlit/streamlit)
* Free hosting: [Hugging Face Spaces](https://huggingface.co/spaces)

---

## What can I learn from this project? πŸ“š
- How to quickly ⌚ build a modern Question Answering system using [πŸ” Haystack](https://github.com/deepset-ai/haystack)
- How to generate questions based on your documents
- How to build a nice [Streamlit](https://github.com/streamlit/streamlit) web app to show your QA system
- How to optimize the web app to πŸš€ deploy in [πŸ€— Spaces](https://huggingface.co/spaces)

[![Web app preview](./data/readme_images/webapp.png)](https://huggingface.co/spaces/anakin87/who-killed-laura-palmer)

## Repository structure πŸ“
- [app.py](./app.py): Streamlit web app
- [app_utils folder](./app_utils/): python modules used in the web app
- [crawler folder](./crawler/): Twin Peaks crawler, developed with Scrapy and fandom-py
- [notebooks folder](./notebooks/): Jupyter/Colab notebooks to create the Search pipeline and generate questions (using Haystack)
- [data folder](./data/): all necessary data
- [presentations](./presentations/): Video presentation and slides (PyCon Italy 2022)

Within each folder, you can find more in-depth explanations.

## Installation πŸ’»
To install this project locally, follow these steps:
- `git clone https://github.com/anakin87/who-killed-laura-palmer`
- `cd who-killed-laura-palmer`
- `pip install -r requirements.txt`

To run the web app, simply type: `streamlit run app.py`

## Possible improvements ✨
### Project structure
- The project is optimized to be deployed in Hugging Face Spaces and consists of an all-in-one Streamlit web app. In more structured production environments, I suggest dividing the software into three parts:
  - Haystack backend API (as explained in [the official documentation](https://haystack.deepset.ai/components/rest-api))
  - Document store service
  - Streamlit web app
### Reader
- The reader model (`deepset/roberta-base-squad2`) is a good compromise between speed and accuracy, running on CPU. There are certainly better (and more computationally expensive) models, as you can read in the [Haystack documentation](https://haystack.deepset.ai/pipeline_nodes/reader).
- You can also think about preparing a Twin Peaks QA dataset and fine-tuning the reader model to get better accuracy, as explained in this [Haystack tutorial](https://haystack.deepset.ai/tutorials/fine-tuning-a-model).