# nlp-flashcard-project ## Todo 2 - [ ] Contexts preprocessing - [ ] Formules enzo eruit filteren - [ ] Splitsen op zinnen...? - [ ] Meer language models proberen - [ ] Elasticsearch - [ ] CLI voor vragen beantwoorden ### Extra dingen - [ ] Huggingface spaces demo - [ ] Question generation voor finetuning - [ ] Language model finetunen ## Todo voor progress meeting - [ ] Data inlezen/Repo klaarmaken - [ ] Proof of concept met UnifiedQA - [ ] Standaard QA model met de dataset - [ ] Papers verzamelen/lezen - [ ] Eerder werk bekijken, inspiratie opdoen voor research richting ## Overview De meeste QA systemen bestaan uit twee onderdelen: - Een retriever. Die haalt adhv de vraag _k_ relevante stukken context op, bv. met `tf-idf`. - Een model dat het antwoord genereert. Wat je hier precies gebruikt hangt af van de manier van question answering: - Voor **extractive QA** gebruik je een reader; - Voor **generative QA** gebruik je een generator. Beide werken op basis van een language model. ## Handige info - Huggingface QA tutorial: - Overview van open-domain question answering technieken: ## Base model Tot nu toe alleen een retriever die adhv een vraag de top-k relevante documents ophaalt. Haalt voor veel vragen wel hoge similarity scores, maar de documents die die ophaalt zijn meestal niet erg relevant. ```bash poetry shell cd base_model poetry run python main.py ``` ### Voorbeeld "What is the perplexity of a language model?" > Result 1 (score: 74.10): > Figure 10 .17 A sample alignment between sentences in English and French, with > sentences extracted from Antoine de Saint-Exupery's Le Petit Prince and a > hypothetical translation. Sentence alignment takes sentences e 1 , ..., e n , > and f 1 , ..., f n and finds minimal > sets of sentences that are translations > of each other, including single sentence mappings like (e 1 ,f 1 ), (e 4 -f 3 > ), (e 5 -f 4 ), (e 6 -f 6 ) as well as 2-1 alignments (e 2 /e 3 ,f 2 ), (e 7 > /e 8 -f 7 ), and null alignments (f 5 ). > > Result 2 (score: 74.23): > Character or word overlap-based metrics like chrF (or BLEU, or etc.) are > mainly used to compare two systems, with the goal of answering questions like: > did the new algorithm we just invented improve our MT system? To know if the > difference between the chrF scores of two > MT systems is a significant > difference, we use the paired bootstrap test, or the similar randomization > test. > > Result 3 (score: 74.43): > The model thus predicts the class negative for the test sentence. > > Result 4 (score: 74.95): > Translating from languages with extensive pro-drop, like Chinese or Japanese, > to non-pro-drop languages like English can be difficult since the model must > somehow identify each zero and recover who or what is being talked about in > order to insert the proper pronoun. > > Result 5 (score: 76.22): > Similarly, a recent challenge set, the WinoMT dataset (Stanovsky et al., 2019) > shows that MT systems perform worse when they are asked to translate sentences > that describe people with non-stereotypical gender roles, like "The doctor > asked the nurse to help her in the > operation". ## Setting up elastic search.