Spaces:

somosnlp-hackathon-2022
/

BioMedIA

Build error

App Files Files Community

avacaondata commited on Apr 3, 2022

Commit

93a0690

•

1 Parent(s): 94533e3

modificado art

Browse files

Files changed (1) hide show

article_app.py +52 -5

article_app.py CHANGED Viewed

@@ -4,9 +4,50 @@ article = """
 <p style="text-align: justify;"> This app is developed by the aforementioned members of <a href="https://www.iic.uam.es/">IIC - Instituto de Ingeniería del Conocimiento</a> as part of the <a href="https://somosnlp.org/hackathon">Somos PLN Hackaton 2022.</a>
-The objective of this app is to expand the existing tools regarding long form question answering in Spanish. In fact, multiple novel methods in this language
-have been introduced to build this app.
-The reason for including audio as a possible input and always as an output is because we wanted to make the App much more accessible to people that cannot read or write.
 Below you can find all the pieces that form the system. This section is minimalist so that the user can get a broad view of the general inner working of the app, and then travel through each model and dataset where they will find much more information on each piece of the system.
 <ol>
@@ -26,7 +67,10 @@ Apart from those, this system could not respond in less than a minute on CPU if
 </ul>
 Using this strategy we managed to improve the passages retrieving time to miliseconds. This is key since large generative language models like the ones we use already take too much time on CPU, therefore we alleviate this restriction by reducing the retrieving time.
-On the other hand, we uploaded, and in some cases created, datasets in Spanish to be able to build such a system.
 <ol>
     <li><a href="https://hf.co/datasets/IIC/spanish_biomedical_crawled_corpus">Spanish Biomedical Crawled Corpus</a>. Used for finding answers to questions about biomedicine. (More info in the link.)</li>
@@ -34,6 +78,8 @@ On the other hand, we uploaded, and in some cases created, datasets in Spanish t
     <li><a href="https://hf.co/datasets/squad_es">SQUADES</a>. Used to train the DPR models. (More info in the link.)</li>
     <li><a href="https://hf.co/datasets/IIC/bioasq22_es">BioAsq22-Spanish</a>. Used to train the DPR models. (More info in the link.)</li>
     <li><a href="https://hf.co/datasets/PlanTL-GOB-ES/SQAC">SQAC (Spanish Question Answering Corpus)</a>. Used to train the DPR models. (More info in the link.)</li>
 </ol>
 <h3>
@@ -57,7 +103,8 @@ description = """
     <img src="https://drive.google.com/uc?export=view&id=1HOzvvgDLFNTK7tYAY1dRzNiLjH41fZks"  style="max-width: 100%; max-height: 10%; height: 250px; object-fit: fill">
 </a>
 <h1 font-family: Georgia, serif;> BioMedIA: Abstractive Question Answering for the BioMedical Domain in Spanish </h1>
-<p> Esta aplicación utiliza un avanzado sistema de búsqueda para obtener textos relevantes acerca de tu pregunta, usando toda esa información para tratar de condensarla en una explicación coherente y autocontenida. Más detalles y ejemplos de preguntas en la sección inferior.
 Los miembros del equipo:
 <ul>
     <li>Alejandro Vaca Serrano: <a href="https://huggingface.co/avacaondata">@avacaondata</a></li>

 <p style="text-align: justify;"> This app is developed by the aforementioned members of <a href="https://www.iic.uam.es/">IIC - Instituto de Ingeniería del Conocimiento</a> as part of the <a href="https://somosnlp.org/hackathon">Somos PLN Hackaton 2022.</a>
+<h3 font-family: Georgia, serif;>
+Objectives and Motivation
+</h3>
+It has been shown recently that the research in the Biomedical field is substantial for the sustainability of the society. There is so much information in the Internet about this topic,
+we thought it would be possible to have a big database of biomedical texts to retrieve the most relevant documents for a certain question and, with all that information, generate a concise answer that tries to convey the documents' information while being self-explanatory.
+With such a tool, Biomedical researchers or professionals could use it to quickly identify the key points for answering a question, therefore accelerating their research process. Also, we would put important health-related information in the hands of everyone, which we think can have
+a very good impact on society. Health is a hot topic today but should be always in the top of our priorities, therefore providing quick and easy access to understandable answers that convey complex information into simple explanations is, in our opinion, an action in the right direction.
+We identified the need for strong intelligent information retrieval systems. Imagine a Siri that could generate coherent answers for your questions, instead of simplistic google search for you. That is the technology we envision, to which we would like the Spanish community of
+NLP to get a little step closer.
+The main technical objective of this app is to expand the existing tools regarding long form question answering in Spanish, by introducing new generative methods together with a complete architecture of good performing models, producing interesting results in a variety of examples tried
+In fact, multiple novel methods in Spanish have been introduced to build this app.
+Most of these systems currently rely on Sentence Transformers for passage retrieval (which we wanted to improve by creating Dense Passage Retrieval in Spanish), and use Extractive Question Answering methods. This means that the user needs to look
+into top answers and then form a final answer in their mind that contains all of that information. This is, to the best of our knowledge, the first time Dense Passage Retrievals have been trained in Spanish with large datasets, and the first time a generative question answering model in Spanish
+has been released.
+For doing that, the first restriction we found was the scarcity of datasets for that task, which is exacerbated by the domain gap to the Biomedical domain. We overcomed this restriction by applying translation models from Transformers (specified in each dataset) to translate BioAsq
+to Spanish, and by doing the same with LFQA (more info in the attached datasets). BioAsq is a big Question Answering dataset in English for the BioMedical domain, containing more than 35k question-answer-context triplets for training. We then used our translated version of BioAsq,
+together with SQAC (15k triplets) and SQUAD-ES (87.5k train triplets), which also has a portion related to the BioMedical domain. This was very useful for training extractive QA models to provide for the community (you can find some in https://huggingface.co/IIC),
+but also for building a Dense Passage Retrieval (DPR) dataset to train a DPR model, which is key for our App, as without almost perfect information for answering a question, the generative model will not produce any reliable answer.
+The fragility of the solution we devised, and therefore also the most beautiful side of it when it works, is that every piece must work perfectly for the final answer to be correct. If our Speech2Text system is not
+good enough, the transcripted text will come corrupt to the DPR, therefore no relevant documents will be retrieved, and the answer will be poor. Similarly, if the DPR is not correctly trained and is not able to identify the relevant passages for a query, the result will be bad.
+This also served as a motivation, as the technical difficulty was completely worth it in cased it worked.
+Regarding the Speech2Text, there were existing solutions trained on Commonvoice; however, there were no Spanish models trained with big datasets like MultiLibrispeech-es, which we used following the results reported in Meta's paper (more info in the linked wav2vec2 model above). We also decided
+to train the large version of wav2vec2, as the other ASR models that were available were 300M parameter models, therefore we also wanted to improve on that part, not only on the dataset used. We obtained a WER of 0.073, which is arguably low compared to the rest of the existing models on ASR
+datasets in Spanish. Further research should be made to compare all of these models, however this was out of the scope for this project.
+Another contribution we wanted to make with this project was a good performing ranker in Spanish. This is a piece we include after the DPR to select the top passages for a query to rank passages based on relevance to the query. Although there are multilingual open source solutions, there are no Spanish monolingual models in this regard.
+For that, we trained CrossEncoder, for which we automatically translated <a href="https://microsoft.github.io/msmarco/">MS Marco</a> with Transformer, which has around 200k query-passage pairs, if we take 1 positive to 4 negative rate from the papers. MS Marco is the dataset typically used in English to train crossencoders for ranking.
+Finally, there are certainly not generative question answering datasets in Spanish. For that reason, we used LFQA, as mentioned above. It has over 400k data instances, which we also translated with Transformers.
+Our translation methods needed to work correclty, since the passages were too large for the max sequence length of the translation model and there were 400 x 3 (answer, question, passages) texts to translate.
+We solved those problems with intelligent text splitting and reconstruction and efficient configuration for the translation process. Thanks to this dataset we could train 2 generative models, for which we used our expertise on generative language models in order to train them effectively.
+The reason for including audio as a possible input and output is because we wanted to make the App much more accessible to everyone. With this App we want to put biomedical knowledge in Spanish within everyone's reach.
+<h3 font-family: Georgia, serif;>
+System Architecture
+</h3>
 Below you can find all the pieces that form the system. This section is minimalist so that the user can get a broad view of the general inner working of the app, and then travel through each model and dataset where they will find much more information on each piece of the system.
 <ol>
 </ul>
 Using this strategy we managed to improve the passages retrieving time to miliseconds. This is key since large generative language models like the ones we use already take too much time on CPU, therefore we alleviate this restriction by reducing the retrieving time.
+<h3 font-family: Georgia, serif;>
+Datasets used and created
+</h3>
+We uploaded, and in some cases created, datasets in Spanish to be able to build such a system.
 <ol>
     <li><a href="https://hf.co/datasets/IIC/spanish_biomedical_crawled_corpus">Spanish Biomedical Crawled Corpus</a>. Used for finding answers to questions about biomedicine. (More info in the link.)</li>
     <li><a href="https://hf.co/datasets/squad_es">SQUADES</a>. Used to train the DPR models. (More info in the link.)</li>
     <li><a href="https://hf.co/datasets/IIC/bioasq22_es">BioAsq22-Spanish</a>. Used to train the DPR models. (More info in the link.)</li>
     <li><a href="https://hf.co/datasets/PlanTL-GOB-ES/SQAC">SQAC (Spanish Question Answering Corpus)</a>. Used to train the DPR models. (More info in the link.)</li>
+    <li><a href="https://huggingface.co/datasets/IIC/msmarco_es">MSMARCO-ES</a>. Used to train CrossEncoder in Spanish for Ranker.</li>
+    <li><a href="https://huggingface.co/datasets/multilingual_librispeech">MultiLibrispeech</a>. Used to train the Speech2Text model in Spanish. (More info in the link.)</li>
 </ol>
 <h3>
     <img src="https://drive.google.com/uc?export=view&id=1HOzvvgDLFNTK7tYAY1dRzNiLjH41fZks"  style="max-width: 100%; max-height: 10%; height: 250px; object-fit: fill">
 </a>
 <h1 font-family: Georgia, serif;> BioMedIA: Abstractive Question Answering for the BioMedical Domain in Spanish </h1>
+<p> Esta aplicación utiliza un avanzado sistema de búsqueda para obtener textos relevantes acerca de tu pregunta, usando toda esa información para tratar de condensarla en una explicación coherente y autocontenida. Más detalles y ejemplos de preguntas en la sección inferior.
+Está funcionando entero en CPU
 Los miembros del equipo:
 <ul>
     <li>Alejandro Vaca Serrano: <a href="https://huggingface.co/avacaondata">@avacaondata</a></li>