avacaondata commited on
Commit
26e66bf
1 Parent(s): 9f01174

añadidos datos de contacto, corregidos links, mejoradas expresiones...

Browse files
Files changed (1) hide show
  1. article_app.py +23 -11
article_app.py CHANGED
@@ -53,11 +53,11 @@ Below you can find all the pieces that form the system. This section is minimali
53
  <img src="https://drive.google.com/uc?export=view&id=1_iUdUMPR5u1p9767YVRbCZkobt_fOozD">
54
 
55
  <ol>
56
- <li><a href="https://hf.co/IIC/wav2vec2-spanish-multilibrispeech">Speech2Text</a>: For this we finedtuned a multilingual Wav2Vec2, as explained in the attached link. We use this model to process audio questions.</li>
57
- <li><a href="https://hf.co/IIC/dpr-spanish-passage_encoder-allqa-base">Dense Passage Retrieval (DPR) for Context</a>: Dense Passage Retrieval is a methodology <a href="https://arxiv.org/abs/2004.04906">developed by Facebook</a> which is currently the SoTA for Passage Retrieval, that is, the task of getting the most relevant passages to answer a given question. You can find details about how it was trained on the link attached to the name. </li>
58
- <li><a href="https://hf.co/IIC/dpr-spanish-question_encoder-allqa-base">Dense Passage Retrieval (DPR) for Question</a>: It is actually part of the same thing as the above. For more details, go to the attached link.</li>
59
  <li><a href="https://hf.co/sentence-transformers/distiluse-base-multilingual-cased-v1">Sentence Encoder Ranker</a>: To rerank the candidate contexts retrieved by DPR for the generative model to see. This also selects the top 5 passages for the model to read, it is the final filter before the generative model. For this we used 3 different configurations to human-check (that's us seriously playing with our toy) the answer results, as generated answers depended much on this piece of the puzzle. The first option, before we trained our own crossencoder, was to use a <a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1">multilingual sentence transformer</a>, trained on multilingual MS Marco. This worked more or less fine, although it was noticeable it wasn't specialized in Spanish. We then tried our own CrossEncoder, trained on our [translated version of MS Marco to Spanish](https://huggingface.co/datasets/IIC/msmarco_es). It worked better than the sentence transformer. Then, it occured to us by looking at their ranks distributions for the same passages, that maybe by multiplying their similarity scores element by element, we could obtain a less biased rank for the documents, therefore only those documents both rankers agree are important appear at the top. We tried this and it showed much better results, so we left both systems with the posterior multiplication of similarities.</li>
60
- <li><a href="https://hf.co/IIC/mt5-base-lfqa-es">Generative Long-Form Question Answering Model</a>: For this we used either mT5 (the one attached) or <a href="https://hf.co/IIC/mbart-large-lfqa-es">mBART</a>. This generative model receives the most relevant passages and uses them to generate an answer to the question. In the attached link there are more details about how we trained it etc.</li>
61
  <li><a href="https://huggingface.co/facebook/tts_transformer-es-css10">Text2Speech</a>: For this we used Meta's text2speech service on Huggingface, as text2speech classes are not yet implemented on the main branch of Transformers. This piece was a must to provide a voice to voice service so that it's almost fully accessible. As future work, as soon as text2speech classes are implemented on transformers, we will train our own models to replace this piece.</li>
62
  </ol>
63
 
@@ -75,13 +75,13 @@ Datasets used and created
75
  We uploaded, and in some cases created, datasets in Spanish to be able to build such a system.
76
 
77
  <ol>
78
- <li><a href="https://hf.co/datasets/IIC/spanish_biomedical_crawled_corpus">Spanish Biomedical Crawled Corpus</a>. Used for finding answers to questions about biomedicine. (More info in the link.)</li>
79
- <li><a href="https://hf.co/datasets/IIC/lfqa_spanish">LFQA_Spanish</a>. Used for training the generative model. (More info in the link.)</li>
80
- <li><a href="https://hf.co/datasets/squad_es">SQUADES</a>. Used to train the DPR models. (More info in the link.)</li>
81
- <li><a href="https://hf.co/datasets/IIC/bioasq22_es">BioAsq22-Spanish</a>. Used to train the DPR models. (More info in the link.)</li>
82
- <li><a href="https://hf.co/datasets/PlanTL-GOB-ES/SQAC">SQAC (Spanish Question Answering Corpus)</a>. Used to train the DPR models. (More info in the link.)</li>
83
- <li><a href="https://huggingface.co/datasets/IIC/msmarco_es">MSMARCO-ES</a>. Used to train CrossEncoder in Spanish for Ranker.</li>
84
- <li><a href="https://huggingface.co/datasets/multilingual_librispeech">MultiLibrispeech</a>. Used to train the Speech2Text model in Spanish. (More info in the link.)</li>
85
  </ol>
86
 
87
  <h3>
@@ -93,6 +93,18 @@ We uploaded, and in some cases created, datasets in Spanish to be able to build
93
  <li><a href="https://www.un.org/sustainabledevelopment/es/education/">Educación de calidad</a>: al ofrecer al mundo un sistema avanzado de consulta de información, ayudamos a complementar y mejorar los sistemas de calidad actuales del mundo biomédico, pues los alumnos tienen un sistema para aprender sobre este campo interactuando a través de nuestros modelos con una gran base de conocimiento en este tema.</li>
94
  <li><a href="https://www.un.org/sustainabledevelopment/es/inequality/">Reducción de las desigualdades</a>: Al hacer un sistema end-to-end de voz a voz, en el que no sería necesario usar el teclado (*), promovemos la accesibilidad a la herramienta. Esto tiene la intención de que personas que no puedan o padezcan impedimentos al leer o escribir tengan la oportunidad de interactuar con BioMedIA. Vimos la necesidad de hacer este sistema lo más flexible posible, para que fuera fácil interactuar con él independientemente de las dificultades o limitaciones físicas que pudieran tener las personas. Al incluir una salida de voz, aquellos que tengan problemas de visión también podrán recibir respuestas a sus dudas. Esto reduce las desigualdades de acceso a la herramienta de las personas con alguno de esos impedimentos. Además, generando una herramienta gratuita de acceso al conocimiento disponible en cualquier parte del mundo con acceso a Internet, reducimos las desigualdades de acceso a la información. </li>
95
  </ol>
 
 
 
 
 
 
 
 
 
 
 
 
96
  </p>
97
 
98
  (*) Nótese que en la demo actual del sistema el usuario necesita realizar una mínima interacción por teclado y ratón. Esto es debido a una limitación de diseño de los spaces de Huggingface. No obstante, las tecnologías desarrolladas sí permitirían su integración en un sistema de interacción pura por voz.
 
53
  <img src="https://drive.google.com/uc?export=view&id=1_iUdUMPR5u1p9767YVRbCZkobt_fOozD">
54
 
55
  <ol>
56
+ <li><a href="https://hf.co/IIC/wav2vec2-spanish-multilibrispeech">Speech2Text</a>: For this we finedtuned a multilingual Wav2Vec2, as explained in the attached link. We use this model to process audio questions. More info: https://hf.co/IIC/wav2vec2-spanish-multilibrispeech</li>
57
+ <li><a href="https://hf.co/IIC/dpr-spanish-passage_encoder-allqa-base">Dense Passage Retrieval (DPR) for Context</a>: Dense Passage Retrieval is a methodology <a href="https://arxiv.org/abs/2004.04906">developed by Facebook</a> which is currently the SoTA for Passage Retrieval, that is, the task of getting the most relevant passages to answer a given question. You can find details about how it was trained here: https://hf.co/IIC/dpr-spanish-passage_encoder-allqa-base. </li>
58
+ <li><a href="https://hf.co/IIC/dpr-spanish-question_encoder-allqa-base">Dense Passage Retrieval (DPR) for Question</a>: It is actually part of the same thing as the above. For more details, go to https://hf.co/IIC/dpr-spanish-question_encoder-allqa-base .</li>
59
  <li><a href="https://hf.co/sentence-transformers/distiluse-base-multilingual-cased-v1">Sentence Encoder Ranker</a>: To rerank the candidate contexts retrieved by DPR for the generative model to see. This also selects the top 5 passages for the model to read, it is the final filter before the generative model. For this we used 3 different configurations to human-check (that's us seriously playing with our toy) the answer results, as generated answers depended much on this piece of the puzzle. The first option, before we trained our own crossencoder, was to use a <a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1">multilingual sentence transformer</a>, trained on multilingual MS Marco. This worked more or less fine, although it was noticeable it wasn't specialized in Spanish. We then tried our own CrossEncoder, trained on our [translated version of MS Marco to Spanish](https://huggingface.co/datasets/IIC/msmarco_es). It worked better than the sentence transformer. Then, it occured to us by looking at their ranks distributions for the same passages, that maybe by multiplying their similarity scores element by element, we could obtain a less biased rank for the documents, therefore only those documents both rankers agree are important appear at the top. We tried this and it showed much better results, so we left both systems with the posterior multiplication of similarities.</li>
60
+ <li><a href="https://hf.co/IIC/mt5-base-lfqa-es">Generative Long-Form Question Answering Model</a>: For this we used either mT5 (the one attached) or <a href="https://hf.co/IIC/mbart-large-lfqa-es">mBART</a>. This generative model receives the most relevant passages and uses them to generate an answer to the question. In https://hf.co/IIC/mt5-base-lfqa-es and https://hf.co/IIC/mbart-large-lfqa-es there are more details about how we trained it etc.</li>
61
  <li><a href="https://huggingface.co/facebook/tts_transformer-es-css10">Text2Speech</a>: For this we used Meta's text2speech service on Huggingface, as text2speech classes are not yet implemented on the main branch of Transformers. This piece was a must to provide a voice to voice service so that it's almost fully accessible. As future work, as soon as text2speech classes are implemented on transformers, we will train our own models to replace this piece.</li>
62
  </ol>
63
 
 
75
  We uploaded, and in some cases created, datasets in Spanish to be able to build such a system.
76
 
77
  <ol>
78
+ <li><a href="https://hf.co/datasets/IIC/spanish_biomedical_crawled_corpus">Spanish Biomedical Crawled Corpus</a>. Used for finding answers to questions about biomedicine. (More info in https://hf.co/datasets/IIC/spanish_biomedical_crawled_corpus .)</li>
79
+ <li><a href="https://hf.co/datasets/IIC/lfqa_spanish">LFQA_Spanish</a>. Used for training the generative model. (More info in https://hf.co/datasets/IIC/lfqa_spanish )</li>
80
+ <li><a href="https://hf.co/datasets/squad_es">SQUADES</a>. Used to train the DPR models. (More info in https://hf.co/datasets/squad_es .)</li>
81
+ <li><a href="https://hf.co/datasets/IIC/bioasq22_es">BioAsq22-Spanish</a>. Used to train the DPR models. (More info in https://hf.co/datasets/IIC/bioasq22_es .)</li>
82
+ <li><a href="https://hf.co/datasets/PlanTL-GOB-ES/SQAC">SQAC (Spanish Question Answering Corpus)</a>. Used to train the DPR models. (More info in https://hf.co/datasets/PlanTL-GOB-ES/SQAC .)</li>
83
+ <li><a href="https://huggingface.co/datasets/IIC/msmarco_es">MSMARCO-ES</a>. Used to train CrossEncoder in Spanish for Ranker.(More info in https://huggingface.co/datasets/IIC/msmarco_es .)</li>
84
+ <li><a href="https://huggingface.co/datasets/multilingual_librispeech">MultiLibrispeech</a>. Used to train the Speech2Text model in Spanish. (More info in https://huggingface.co/datasets/multilingual_librispeech .)</li>
85
  </ol>
86
 
87
  <h3>
 
93
  <li><a href="https://www.un.org/sustainabledevelopment/es/education/">Educación de calidad</a>: al ofrecer al mundo un sistema avanzado de consulta de información, ayudamos a complementar y mejorar los sistemas de calidad actuales del mundo biomédico, pues los alumnos tienen un sistema para aprender sobre este campo interactuando a través de nuestros modelos con una gran base de conocimiento en este tema.</li>
94
  <li><a href="https://www.un.org/sustainabledevelopment/es/inequality/">Reducción de las desigualdades</a>: Al hacer un sistema end-to-end de voz a voz, en el que no sería necesario usar el teclado (*), promovemos la accesibilidad a la herramienta. Esto tiene la intención de que personas que no puedan o padezcan impedimentos al leer o escribir tengan la oportunidad de interactuar con BioMedIA. Vimos la necesidad de hacer este sistema lo más flexible posible, para que fuera fácil interactuar con él independientemente de las dificultades o limitaciones físicas que pudieran tener las personas. Al incluir una salida de voz, aquellos que tengan problemas de visión también podrán recibir respuestas a sus dudas. Esto reduce las desigualdades de acceso a la herramienta de las personas con alguno de esos impedimentos. Además, generando una herramienta gratuita de acceso al conocimiento disponible en cualquier parte del mundo con acceso a Internet, reducimos las desigualdades de acceso a la información. </li>
95
  </ol>
96
+
97
+ <h3>
98
+ Contact
99
+ </h3>
100
+
101
+ <ul>
102
+ <li>Alejandro Vaca Serrano. <a href="https://www.linkedin.com/in/alejandro-vaca-serrano/">LinkedIn</a> </li>
103
+ <li>David Betancur Sánchez. <a href="https://www.linkedin.com/in/david-betancur-s%C3%A1nchez-714a79154/">LinkedIn</a> </li>
104
+ <li>Alba Segurado. <a href="https://www.linkedin.com/in/alba-segurado-data-science/">LinkedIn.</a> </li>
105
+ <li>Álvaro Barbero Jiménez. <a href="https://twitter.com/albarjip">Twitter </a></li>
106
+ <li>Guillem García Subies. <a href="https://www.linkedin.com/in/guillemgsubies/">LinkedIn</a> </li>
107
+ </ul>
108
  </p>
109
 
110
  (*) Nótese que en la demo actual del sistema el usuario necesita realizar una mínima interacción por teclado y ratón. Esto es debido a una limitación de diseño de los spaces de Huggingface. No obstante, las tecnologías desarrolladas sí permitirían su integración en un sistema de interacción pura por voz.