nlp-flashcard-project
Todo voor progress meeting
- Data inlezen/Repo klaarmaken
- Proof of concept met UnifiedQA
- Standaard QA model met de dataset
- Papers verzamelen/lezen
- Eerder werk bekijken, inspiratie opdoen voor research richting
Overview
De meeste QA systemen bestaan uit twee onderdelen:
Een retriever. Die haalt adhv de vraag k relevante stukken context op, bv. met
tf-idf
.Een model dat het antwoord genereert. Wat je hier precies gebruikt hangt af van de manier van question answering:
- Voor extractive QA gebruik je een reader;
- Voor generative QA gebruik je een generator.
Beide werken op basis van een language model.
Handige info
- Huggingface QA tutorial: https://huggingface.co/docs/transformers/tasks/question_answering#finetune-with-tensorflow
- Overview van open-domain question answering technieken: https://lilianweng.github.io/posts/2020-10-29-odqa/
Base model
Tot nu toe alleen een retriever die adhv een vraag de top-k relevante documents ophaalt. Haalt voor veel vragen wel hoge similarity scores, maar de documents die die ophaalt zijn meestal niet erg relevant.
poetry shell
cd base_model
poetry run python main.py
Voorbeeld
"What is the perplexity of a language model?"
Result 1 (score: 74.10):
Figure 10 .17 A sample alignment between sentences in English and French, with sentences extracted from Antoine de Saint-Exupery's Le Petit Prince and a hypothetical translation. Sentence alignment takes sentences e 1 , ..., e n , and f 1 , ..., f n and finds minimal > sets of sentences that are translations of each other, including single sentence mappings like (e 1 ,f 1 ), (e 4 -f 3 ), (e 5 -f 4 ), (e 6 -f 6 ) as well as 2-1 alignments (e 2 /e 3 ,f 2 ), (e 7 /e 8 -f 7 ), and null alignments (f 5 ).Result 2 (score: 74.23):
Character or word overlap-based metrics like chrF (or BLEU, or etc.) are mainly used to compare two systems, with the goal of answering questions like: did the new algorithm we just invented improve our MT system? To know if the difference between the chrF scores of two > MT systems is a significant difference, we use the paired bootstrap test, or the similar randomization test.Result 3 (score: 74.43):
The model thus predicts the class negative for the test sentence.Result 4 (score: 74.95):
Translating from languages with extensive pro-drop, like Chinese or Japanese, to non-pro-drop languages like English can be difficult since the model must somehow identify each zero and recover who or what is being talked about in order to insert the proper pronoun.Result 5 (score: 76.22):
Similarly, a recent challenge set, the WinoMT dataset (Stanovsky et al., 2019) shows that MT systems perform worse when they are asked to translate sentences that describe people with non-stereotypical gender roles, like "The doctor asked the nurse to help her in the > operation".