Ramon Meffert
Add longformer
be1f224
metadata
title: Speech_Language_Processing_Jurafsky_Martin
emoji: 📚
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 2.9.0
app_file: app.py
pinned: true

NLP FlashCards

DEMO

View the demo at huggingface spaces:

Dependencies

Make sure you have the following tools installed:

  • Poetry for Python package management;
  • Docker for running ElasticSearch.
  • Git LFS for downloading binary files that do not fit in git.

Then, run the following commands:

poetry install
docker pull docker.elastic.co/elasticsearch/elasticsearch:8.1.1
docker network create elastic
docker run --name es01 --net elastic -p 9200:9200 -p 9300:9300 -it docker.elastic.co/elasticsearch/elasticsearch:8.1.1

After the last command, a password for the elastic user should show up in the terminal output (you might have to scroll up a bit). Copy this password, and create a copy of the .env.example file and rename it to .env. Replace the <password> placeholder with your copied password.

Next, run the following command from the root of the repository:

docker cp es01:/usr/share/elasticsearch/config/certs/http_ca.crt .

Running

To make sure we're using the dependencies managed by Poetry, run poetry shell before executing any of the following commands. Alternatively, replace any call like python file.py with poetry run python file.py (but we suggest the shell option, since it is much more convenient).

Training

N/A for now

Using the QA system

⚠️ Important ⚠️ If you want to run an ElasticSearch query, make sure the docker container is running! You can check this by running docker container ls. If your container shows up (it's named es01 if you followed these instructions), it's running. If not, you can run docker start es01 to start it, or start it from Docker Desktop.

To query the QA system, run any query as follows:

python query.py "Why can dot product be used as a similarity metric?"

By default, the best answer along with its location in the book will be returned. If you want to generate more answers (say, a top-5), you can supply the --top=5 option. The default retriever uses FAISS, but you can also use ElasticSearch using the --retriever=es option. You can also pick a language model using the --lm option, which accepts either dpr (Dense Passage Retrieval) or longformer. The language model is used to generate embeddings for FAISS, and is used to generate the answer.

CLI overview

To get an overview of all available options, run python query.py --help. The options are also printed below.

usage: query.py [-h] [--top int] [--retriever {faiss,es}] [--lm {dpr,longformer}] str

positional arguments:
  str                   The question to feed to the QA system

options:
  -h, --help            show this help message and exit
  --top int, -t int     The number of answers to retrieve
  --retriever {faiss,es}, -r {faiss,es}
                        The retrieval method to use
  --lm {dpr,longformer}, -l {dpr,longformer}
                        The language model to use for the FAISS retriever

Check out the configuration reference at https://huggingface.co/docs/hub/spaces#reference