Spaces:

Endre
/

SemanticSearch-HU

Runtime error

File size: 2,163 Bytes

Types of Question Answering
    - extractive question answering (encoder only models BERT)
        - posing questions about a document and identifying the answers as spans of text in the document itself
    - generative question answering (encoder-decoder T5/BART)
        - open ended questions, which need to synthesize information
    - retrieval based/community question answering 



First approach - translate dataset, fine-tune model
!Not really feasible, because it needs lots of human evaluation for correctly determine answer start token

    1. Translate English QA dataset into Hungarian
        - SQuAD - reading comprehension based on Wikipedia articles
        - ~ 100.000 question/answers
    2. Fine-tune a model and evaluate on this dataset


Second approach - fine-tune multilingual model
!MQA format different than SQuAD, cannot use ModelForQuestionAnswering

    1. Use a Hungarian dataset
        - MQA - multilingual parsed from Common Crawl
            - FAQ - 878.385 (2.415 domain)
            - CQA - 27.639 (171 domain)
    2. Fine-tune and evaluate a model on this dataset
        
        
    Possible steps:
        - Use an existing pre-trained model in Hungarian/Romanian/or multilingual to generate embeddings
            - Select Model:
                - multilingual which includes hu:
                    - distiluse-base-multilingual-cased-v2 (400MB)
                    - paraphrase-multilingual-MiniLM-L12-v2 (400MB) - fastest
                    - paraphrase-multilingual-mpnet-base-v2 (900MB) - best performing
                - hubert
        - Select a dataset
            - use MQA hungarian subset
            - use hungarian wikipedia pages data, split it up
                - DBpedia, shortened abstracts = 500.000
        - Pre-compute embeddings for all answers/paragraphs
        - Compute embedding for incoming query
            - Compare similarity between query embedding and precomputed 
            - return top-3 answers/questions
    
    Alternative steps:
        - train a sentence transformer on the Hungarian / Romanian subsets
        - Use the trained sentence transformer to generate embeddings