avacaondata commited on
Commit
93171da
1 Parent(s): 851edbd

añadidos últimos (hopefully) cambios

Browse files
Files changed (1) hide show
  1. article_app.py +7 -7
article_app.py CHANGED
@@ -7,13 +7,13 @@ article = """
7
  Objectives and Motivation
8
  </h3>
9
 
10
- It has been shown recently that the research in the Biomedical field is substantial for the sustainability of the society. There is so much information in the Internet about this topic,
11
  we thought it would be possible to have a big database of biomedical texts to retrieve the most relevant documents for a certain question and, with all that information, generate a concise answer that tries to convey the documents' information while being self-explanatory.
12
  With such a tool, Biomedical researchers or professionals could use it to quickly identify the key points for answering a question, therefore accelerating their research process. Also, we would put important health-related information in the hands of everyone, which we think can have
13
  a very good impact on society. Health is a hot topic today but should be always in the top of our priorities, therefore providing quick and easy access to understandable answers that convey complex information into simple explanations is, in our opinion, an action in the right direction.
14
 
15
  We identified the need for strong intelligent information retrieval systems. Imagine a Siri that could generate coherent answers for your questions, instead of simplistic google search for you. That is the technology we envision, to which we would like the Spanish community of
16
- NLP to get a little step closer.
17
 
18
  The main technical objective of this app is to expand the existing tools regarding long form question answering in Spanish, by introducing new generative methods together with a complete architecture of good performing models, producing interesting results in a variety of examples tried.
19
  In fact, multiple novel methods in Spanish have been introduced to build this app.
@@ -29,21 +29,21 @@ but also for building a Dense Passage Retrieval (DPR) dataset to train a DPR mod
29
 
30
  The fragility of the solution we devised, and therefore also the most beautiful side of it when it works, is that every piece must work perfectly for the final answer to be correct. If our Speech2Text system is not
31
  good enough, the transcripted text will come corrupt to the DPR, therefore no relevant documents will be retrieved, and the answer will be poor. Similarly, if the DPR is not correctly trained and is not able to identify the relevant passages for a query, the result will be bad.
32
- This also served as a motivation, as the technical difficulty was completely worth it in cased it worked. Moreover, it would serve for us as a service to the NLP community in Spanish, as for building this app we would use much of what we learned from the private sector in building good performing systems
33
- relying on multiple models to deliver to the community top performing models for Question Answering related tasks, thus participating in the Open Source culture and expansion of knowledge. Another objective we had, then, was to give a practical example sample of good practices,
34
  which fits with the didactic character of both the organization and the Hackaton.
35
 
36
- Regarding the Speech2Text, there were existing solutions trained on Commonvoice; however, there were no Spanish models trained with big datasets like MultiLibrispeech-es, which we used following the results reported in Meta's paper (more info in the linked wav2vec2 model above). We also decided
37
  to train the large version of wav2vec2, as the other ASR models that were available were 300M parameter models, therefore we also wanted to improve on that part, not only on the dataset used. We obtained a WER of 0.073, which is arguably low compared to the rest of the existing models on ASR
38
  datasets in Spanish. Further research should be made to compare all of these models, however this was out of the scope for this project.
39
 
40
  Another contribution we wanted to make with this project was a good performing ranker in Spanish. This is a piece we include after the DPR to select the top passages for a query to rank passages based on relevance to the query. Although there are multilingual open source solutions, there are no Spanish monolingual models in this regard.
41
  For that, we trained CrossEncoder, for which we automatically translated <a href="https://microsoft.github.io/msmarco/">MS Marco</a> with Transformer, which has around 200k query-passage pairs, if we take 1 positive to 4 negative rate from the papers. MS Marco is the dataset typically used in English to train crossencoders for ranking.
42
 
43
- Finally, there are certainly not generative question answering datasets in Spanish. For that reason, we used LFQA, as mentioned above. It has over 400k data instances, which we also translated with Transformers.
44
  Our translation methods needed to work correclty, since the passages were too large for the max sequence length of the translation model and there were 400k x 3 (answer, question, passages) texts to translate.
45
  We solved those problems with intelligent text splitting and reconstruction and efficient configuration for the translation process. Thanks to this dataset we could train 2 generative models, for which we used our expertise on generative language models in order to train them effectively.
46
- The reason for including audio as a possible input and output is because we wanted to make the App much more accessible to everyone. With this App we want to put biomedical knowledge in Spanish within everyone's reach.
47
 
48
  <h3 font-family: Georgia, serif;>
49
  System Architecture
 
7
  Objectives and Motivation
8
  </h3>
9
 
10
+ It has been shown recently that the research in the Biomedical field is substantial for the sustainability of society. There is so much information in the Internet about this topic,
11
  we thought it would be possible to have a big database of biomedical texts to retrieve the most relevant documents for a certain question and, with all that information, generate a concise answer that tries to convey the documents' information while being self-explanatory.
12
  With such a tool, Biomedical researchers or professionals could use it to quickly identify the key points for answering a question, therefore accelerating their research process. Also, we would put important health-related information in the hands of everyone, which we think can have
13
  a very good impact on society. Health is a hot topic today but should be always in the top of our priorities, therefore providing quick and easy access to understandable answers that convey complex information into simple explanations is, in our opinion, an action in the right direction.
14
 
15
  We identified the need for strong intelligent information retrieval systems. Imagine a Siri that could generate coherent answers for your questions, instead of simplistic google search for you. That is the technology we envision, to which we would like the Spanish community of
16
+ NLP to get a little step closer. Hackaton Somos NLP 2022 is actually intended to impulse NLP tools in Spanish, as there is an imbalance between the amount of Spanish speakers and the percentage of Spanish models and datasets in the hub.
17
 
18
  The main technical objective of this app is to expand the existing tools regarding long form question answering in Spanish, by introducing new generative methods together with a complete architecture of good performing models, producing interesting results in a variety of examples tried.
19
  In fact, multiple novel methods in Spanish have been introduced to build this app.
 
29
 
30
  The fragility of the solution we devised, and therefore also the most beautiful side of it when it works, is that every piece must work perfectly for the final answer to be correct. If our Speech2Text system is not
31
  good enough, the transcripted text will come corrupt to the DPR, therefore no relevant documents will be retrieved, and the answer will be poor. Similarly, if the DPR is not correctly trained and is not able to identify the relevant passages for a query, the result will be bad.
32
+ This also served as a motivation, as the technical difficulty was completely worth it in cased it worked. Moreover, it would serve for us as a service to the NLP community in Spanish. For building this app we would use much of what we learned from the private sector in building systems
33
+ relying on multiple models, to deliver to the community top performing models for Question Answering related tasks, thus participating in the Open Source culture and expansion of knowledge. Another objective we had, then, was to give a practical example sample of good practices,
34
  which fits with the didactic character of both the organization and the Hackaton.
35
 
36
+ Regarding the Speech2Text, there were existing solutions trained on Commonvoice; however, there were no Spanish models trained with bigger datasets like MultiLibrispeech-es, which we used following the results reported in Meta's paper (more info in the linked wav2vec2 model above). We also decided
37
  to train the large version of wav2vec2, as the other ASR models that were available were 300M parameter models, therefore we also wanted to improve on that part, not only on the dataset used. We obtained a WER of 0.073, which is arguably low compared to the rest of the existing models on ASR
38
  datasets in Spanish. Further research should be made to compare all of these models, however this was out of the scope for this project.
39
 
40
  Another contribution we wanted to make with this project was a good performing ranker in Spanish. This is a piece we include after the DPR to select the top passages for a query to rank passages based on relevance to the query. Although there are multilingual open source solutions, there are no Spanish monolingual models in this regard.
41
  For that, we trained CrossEncoder, for which we automatically translated <a href="https://microsoft.github.io/msmarco/">MS Marco</a> with Transformer, which has around 200k query-passage pairs, if we take 1 positive to 4 negative rate from the papers. MS Marco is the dataset typically used in English to train crossencoders for ranking.
42
 
43
+ Finally, there are not generative question answering datasets in Spanish. For that reason, we used LFQA, as mentioned above. It has over 400k data instances, which we also translated with Transformers.
44
  Our translation methods needed to work correclty, since the passages were too large for the max sequence length of the translation model and there were 400k x 3 (answer, question, passages) texts to translate.
45
  We solved those problems with intelligent text splitting and reconstruction and efficient configuration for the translation process. Thanks to this dataset we could train 2 generative models, for which we used our expertise on generative language models in order to train them effectively.
46
+ The reason for including audio as a possible input and output is because we wanted to make the App much more accessible to everyone. With this App we want to put biomedical knowledge in Spanish within everyone's reach.
47
 
48
  <h3 font-family: Georgia, serif;>
49
  System Architecture