avacaondata commited on
Commit
e7cc6f5
1 Parent(s): f8cbd5a

añadidos detalles sabrosones

Browse files
Files changed (1) hide show
  1. article_app.py +3 -3
article_app.py CHANGED
@@ -58,10 +58,10 @@ Below you can find all the pieces that form the system. This section is minimali
58
  <li><a href="https://huggingface.co/IIC/dpr-spanish-question_encoder-allqa-base">Dense Passage Retrieval (DPR) for Question</a>: It is actually part of the same thing as the above. For more details, go to https://huggingface.co/IIC/dpr-spanish-question_encoder-allqa-base .</li>
59
  <li><a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1">Sentence Encoder Ranker</a>: To rerank the candidate contexts retrieved by DPR for the generative model to see. This also selects the top 5 passages for the model to read, it is the final filter before the generative model. For this we used 3 different configurations to human-check (that's us seriously playing with our toy) the answer results, as generated answers depended much on this piece of the puzzle. The first option, before we trained our own crossencoder, was to use a <a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1">multilingual sentence transformer</a>, trained on multilingual MS Marco. This worked more or less fine, although it was noticeable it wasn't specialized in Spanish. We then tried our own CrossEncoder, trained on our translated version of MS Marco to Spanish: https://huggingface.co/datasets/IIC/msmarco_es. It worked better than the sentence transformer. Then, it occured to us by looking at their ranks distributions for the same passages, that maybe by multiplying their similarity scores element by element, we could obtain a less biased rank for the documents, therefore only those documents both rankers agree are important appear at the top. We tried this and it showed much better results, so we left both systems with the posterior multiplication of similarities.</li>
60
  <li><a href="https://huggingface.co/IIC/mt5-base-lfqa-es">Generative Long-Form Question Answering Model</a>: For this we used either mT5 (the one attached) or <a href="https://huggingface.co/IIC/mbart-large-lfqa-es">mBART</a>. This generative model receives the most relevant passages and uses them to generate an answer to the question. In https://huggingface.co/IIC/mt5-base-lfqa-es and https://huggingface.co/IIC/mbart-large-lfqa-es there are more details about how we trained it etc.</li>
61
- <li><a href="https://huggingface.co/facebook/tts_transformer-es-css10">Text2Speech</a>: For this we used Meta's text2speech service on Huggingface, as text2speech classes are not yet implemented on the main branch of Transformers. This piece was a must to provide a voice to voice service so that it's almost fully accessible. As future work, as soon as text2speech classes are implemented on transformers, we will train our own models to replace this piece.</li>
62
  </ol>
63
 
64
- Apart from those, this system could not respond in less than a minute on CPU if we didn't use some indexing tricks on the dataset, by using <a href="https://github.com/facebookresearch/faiss">Faiss</a>. We need to look for relevant passages to answer the questions on over 1.5M of semi-long documents, which means that if we want to compare the question vector as encoded by DPR against all of that vectors, we have to perform over 1.5M comparisons. Instead of that, we created a FAISS index optimized for very fast search, configured as follows:
65
  <ul>
66
  <li> A dimensionality reduction method is applied to to represent each one of the 1.5M documents as a vector of 128 elements, which after some quantization algorithms requires only 32 bytes of memory per vector.</li>
67
  <li>Document vectors are clusted with k-means into about 5K clusters.</li>
@@ -118,7 +118,7 @@ description = """
118
  </a>
119
  <h1 font-family: Georgia, serif;> BioMedIA: Abstractive Question Answering for the BioMedical Domain in Spanish </h1>
120
  <p> Esta aplicación utiliza un avanzado sistema de búsqueda para obtener textos relevantes acerca de tu pregunta, usando toda esa información para tratar de condensarla en una explicación coherente y autocontenida. Más detalles y ejemplos de preguntas en la sección inferior.
121
- El sistema generativo puede tardar entre 20 y 50s en general, por lo que en esos ratos mientras esperas las respuestas, te invitamos a que bucees por el artículo que hemos dejado debajo de los ejemplos de la App, en el que podrás descubrir más detalles acerca de cómo funciona .
122
  Los miembros del equipo:
123
  <ul>
124
  <li>Alejandro Vaca Serrano: <a href="https://huggingface.co/avacaondata">@avacaondata</a></li>
58
  <li><a href="https://huggingface.co/IIC/dpr-spanish-question_encoder-allqa-base">Dense Passage Retrieval (DPR) for Question</a>: It is actually part of the same thing as the above. For more details, go to https://huggingface.co/IIC/dpr-spanish-question_encoder-allqa-base .</li>
59
  <li><a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1">Sentence Encoder Ranker</a>: To rerank the candidate contexts retrieved by DPR for the generative model to see. This also selects the top 5 passages for the model to read, it is the final filter before the generative model. For this we used 3 different configurations to human-check (that's us seriously playing with our toy) the answer results, as generated answers depended much on this piece of the puzzle. The first option, before we trained our own crossencoder, was to use a <a href="https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1">multilingual sentence transformer</a>, trained on multilingual MS Marco. This worked more or less fine, although it was noticeable it wasn't specialized in Spanish. We then tried our own CrossEncoder, trained on our translated version of MS Marco to Spanish: https://huggingface.co/datasets/IIC/msmarco_es. It worked better than the sentence transformer. Then, it occured to us by looking at their ranks distributions for the same passages, that maybe by multiplying their similarity scores element by element, we could obtain a less biased rank for the documents, therefore only those documents both rankers agree are important appear at the top. We tried this and it showed much better results, so we left both systems with the posterior multiplication of similarities.</li>
60
  <li><a href="https://huggingface.co/IIC/mt5-base-lfqa-es">Generative Long-Form Question Answering Model</a>: For this we used either mT5 (the one attached) or <a href="https://huggingface.co/IIC/mbart-large-lfqa-es">mBART</a>. This generative model receives the most relevant passages and uses them to generate an answer to the question. In https://huggingface.co/IIC/mt5-base-lfqa-es and https://huggingface.co/IIC/mbart-large-lfqa-es there are more details about how we trained it etc.</li>
61
+ <li><a href="https://huggingface.co/facebook/tts_transformer-es-css10">Text2Speech</a>: For this we used Meta's text2speech service on Huggingface, as text2speech classes are not yet implemented on the main branch of Transformers. This piece was a must to provide a voice to voice service so that it's almost fully accessible. As future work, as soon as text2speech classes are implemented in transformers, we will train our own models to replace this piece.</li>
62
  </ol>
63
 
64
+ Apart from those, this system could not respond in less than a minute on CPU if we didn't use some indexing tricks on the dataset, by using <a href="https://github.com/facebookresearch/faiss">Faiss</a>. We need to look for relevant passages to answer the questions on over 1.5M of semi-long documents, which means that if we want to compare the question vector as encoded by DPR against all of those vectors, we have to perform over 1.5M comparisons. Instead of that, we created a FAISS index optimized for very fast search, configured as follows:
65
  <ul>
66
  <li> A dimensionality reduction method is applied to to represent each one of the 1.5M documents as a vector of 128 elements, which after some quantization algorithms requires only 32 bytes of memory per vector.</li>
67
  <li>Document vectors are clusted with k-means into about 5K clusters.</li>
118
  </a>
119
  <h1 font-family: Georgia, serif;> BioMedIA: Abstractive Question Answering for the BioMedical Domain in Spanish </h1>
120
  <p> Esta aplicación utiliza un avanzado sistema de búsqueda para obtener textos relevantes acerca de tu pregunta, usando toda esa información para tratar de condensarla en una explicación coherente y autocontenida. Más detalles y ejemplos de preguntas en la sección inferior.
121
+ El sistema generativo puede tardar entre 20 y 50s en general, por lo que en esos ratos mientras esperas las respuestas, te invitamos a que bucees por el artículo que hemos dejado debajo de los ejemplos de la App, en el que podrás descubrir más detalles acerca de cómo funciona &#128214; &#129299;.
122
  Los miembros del equipo:
123
  <ul>
124
  <li>Alejandro Vaca Serrano: <a href="https://huggingface.co/avacaondata">@avacaondata</a></li>