Spaces:

RugNlpFlashcards
/

Speech_Language_Processing_Jurafsky_Martin

Build error

App Files Files Community

Speech_Language_Processing_Jurafsky_Martin / README.old.md

GGroenendaal

add experiment code

b06298d about 2 years ago

preview code

raw history blame

No virus

3.35 kB

nlp-flashcard-project

Todo 2

Contexts preprocessing
- Formules enzo eruit filteren
- Splitsen op zinnen...?
Meer language models proberen
Elasticsearch
CLI voor vragen beantwoorden

Extra dingen

Huggingface spaces demo
Question generation voor finetuning
Language model finetunen

Todo voor progress meeting

Data inlezen/Repo klaarmaken
Proof of concept met UnifiedQA
Standaard QA model met de dataset
Papers verzamelen/lezen
Eerder werk bekijken, inspiratie opdoen voor research richting

Overview

De meeste QA systemen bestaan uit twee onderdelen:

Een retriever. Die haalt adhv de vraag k relevante stukken context op, bv. met tf-idf.
Een model dat het antwoord genereert. Wat je hier precies gebruikt hangt af van de manier van question answering:
- Voor extractive QA gebruik je een reader;
- Voor generative QA gebruik je een generator.
Beide werken op basis van een language model.

Handige info

Huggingface QA tutorial: https://huggingface.co/docs/transformers/tasks/question_answering#finetune-with-tensorflow
Overview van open-domain question answering technieken: https://lilianweng.github.io/posts/2020-10-29-odqa/

Base model

Tot nu toe alleen een retriever die adhv een vraag de top-k relevante documents ophaalt. Haalt voor veel vragen wel hoge similarity scores, maar de documents die die ophaalt zijn meestal niet erg relevant.

poetry shell
cd base_model
poetry run python main.py

Voorbeeld

"What is the perplexity of a language model?"

Result 1 (score: 74.10):
Figure 10 .17 A sample alignment between sentences in English and French, with sentences extracted from Antoine de Saint-Exupery's Le Petit Prince and a hypothetical translation. Sentence alignment takes sentences e 1 , ..., e n , and f 1 , ..., f n and finds minimal > sets of sentences that are translations of each other, including single sentence mappings like (e 1 ,f 1 ), (e 4 -f 3 ), (e 5 -f 4 ), (e 6 -f 6 ) as well as 2-1 alignments (e 2 /e 3 ,f 2 ), (e 7 /e 8 -f 7 ), and null alignments (f 5 ).

Result 2 (score: 74.23):
Character or word overlap-based metrics like chrF (or BLEU, or etc.) are mainly used to compare two systems, with the goal of answering questions like: did the new algorithm we just invented improve our MT system? To know if the difference between the chrF scores of two > MT systems is a significant difference, we use the paired bootstrap test, or the similar randomization test.

Result 3 (score: 74.43):
The model thus predicts the class negative for the test sentence.

Result 4 (score: 74.95):
Translating from languages with extensive pro-drop, like Chinese or Japanese, to non-pro-drop languages like English can be difficult since the model must somehow identify each zero and recover who or what is being talked about in order to insert the proper pronoun.

Result 5 (score: 76.22):
Similarly, a recent challenge set, the WinoMT dataset (Stanovsky et al., 2019) shows that MT systems perform worse when they are asked to translate sentences that describe people with non-stereotypical gender roles, like "The doctor asked the nurse to help her in the > operation".