anakin87 commited on
Commit
16cd190
β€’
1 Parent(s): bcb986c

generate info page from readme

Browse files
README.md CHANGED
@@ -10,23 +10,39 @@ pinned: false
10
  license: apache-2.0
11
  ---
12
 
13
- # Fact Checking rocks!   [![Generic badge](https://img.shields.io/badge/πŸ€—-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/fact-checking-rocks) [![Generic badge](https://img.shields.io/github/stars/anakin87/fact-checking-rocks?label=Github&style=social)](https://github.com/anakin87/fact-checking-rocks)
14
 
15
  ## *Fact checking baseline combining dense retrieval and textual entailment*
16
 
17
  ### Idea πŸ’‘
18
- This project aims to show that a naive and simple baseline for fact checking can be built by combining dense retrieval and a textual entailment task (based on Natural Language Inference models).
 
 
 
 
 
 
 
 
19
 
20
- ### System description
21
- This project is strongly based on [Haystack](https://github.com/deepset-ai/haystack), an open source NLP framework to realize search system. The main components of our system are an indexing pipeline and a search pipeline.
22
 
23
  #### Indexing pipeline
24
  * [Crawling](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/get_wikipedia_data.ipynb): Crawl data from Wikipedia, starting from the page [List of mainstream rock performers](https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers) and using the [python wrapper](https://github.com/goldsmith/Wikipedia)
25
- * [Indexing through Haystack](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/indexing.ipynb)
26
- * Preprocess the downloaded documents into chunks consisting of 2 sentences
27
- * Chunks with less than 10 words are discarded, because not very informative
28
- * Instantiate a [FAISS](https://github.com/facebookresearch/faiss) Document store and store the passages on it
29
- * Create embeddings for the passages, using a Sentence Transformer model and save them in FAISS. It seems that the retrieval task will involve [*asymmetric semantic search*](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search) (statements to be verified are usually shorter than inherent passages), therefore I choose the model `msmarco-distilbert-base-tas-b`.
30
- * Save FAISS index
31
 
32
  #### Search pipeline
 
 
 
 
 
 
 
 
 
 
 
10
  license: apache-2.0
11
  ---
12
 
13
+ # Fact Checking 🎸 Rocks!   [![Generic badge](https://img.shields.io/badge/πŸ€—-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/fact-checking-rocks) [![Generic badge](https://img.shields.io/github/stars/anakin87/fact-checking-rocks?label=Github&style=social)](https://github.com/anakin87/fact-checking-rocks)
14
 
15
  ## *Fact checking baseline combining dense retrieval and textual entailment*
16
 
17
  ### Idea πŸ’‘
18
+ This project aims to show that a *naive and simple baseline* for fact checking can be built by combining dense retrieval and a textual entailment task (based on Natural Language Inference models).
19
+ In a nutshell, the flow is as follows:
20
+ * the users enters a factual statement
21
+ * the relevant passages are retrieved from the knowledge base using dense retrieval
22
+ * the system computes the text entailment between each relevant passage and the statement, using a Natural Language Inference model
23
+ * the entailment scores are aggregated to produce a summary score.
24
+
25
+ ### System description πŸͺ„
26
+ This project is strongly based on [πŸ”Ž Haystack](https://github.com/deepset-ai/haystack), an open source NLP framework to realize search system. The main components of our system are an indexing pipeline and a search pipeline.
27
 
 
 
28
 
29
  #### Indexing pipeline
30
  * [Crawling](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/get_wikipedia_data.ipynb): Crawl data from Wikipedia, starting from the page [List of mainstream rock performers](https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers) and using the [python wrapper](https://github.com/goldsmith/Wikipedia)
31
+ * [Indexing](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/indexing.ipynb)
32
+ * preprocess the downloaded documents into chunks consisting of 2 sentences
33
+ * chunks with less than 10 words are discarded, because not very informative
34
+ * instantiate a [FAISS](https://github.com/facebookresearch/faiss) Document store and store the passages on it
35
+ * create embeddings for the passages, using a Sentence Transformer model and save them in FAISS. The retrieval task will involve [*asymmetric semantic search*](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search) (statements to be verified are usually shorter than inherent passages), therefore I choose the model `msmarco-distilbert-base-tas-b`.
36
+ * save FAISS index
37
 
38
  #### Search pipeline
39
+
40
+ * the user enters a factual statement
41
+ * compute the embedding of the user statement using the same Sentence Transformer (`msmarco-distilbert-base-tas-b`)
42
+ * retrieve the K most relevant text passages stored in FAISS (along with their relevance scores)
43
+ * **text entailment task**: compute the text entailment between each text passage (premise) and the user statement (hypotesis), using a Natural Language Inference model (`microsoft/deberta-v2-xlarge-mnli`). For every text passage, we have 3 scores (summing to 1): entailment, contradiction, neutral. *(For this task, I developed a custom Haystack node: `EntailmentChecker`)*
44
+ * aggregate the text entailment scores: compute the weighted average of them, where the weight is the relevance score. **Now it is possible to tell if the knowledge base confirms, is neutral or disproves the user statement.**
45
+ * *empirical consideration: if in the first N documents (N<K), there is a strong evidence of entailment/contradiction (partial aggregate scores > 0.5), it is better not to consider less relevant documents*
46
+
47
+ ### Limits and possible improvements ✨
48
+
Rock_fact_checker.py CHANGED
@@ -30,7 +30,7 @@ def main():
30
  set_state_if_absent("raw_json", None)
31
  set_state_if_absent("random_statement_requested", False)
32
 
33
- st.write("# Fact checking 🎸 Rocks!")
34
  st.write()
35
  st.markdown(
36
  """
 
30
  set_state_if_absent("raw_json", None)
31
  set_state_if_absent("random_statement_requested", False)
32
 
33
+ st.write("# Fact Checking 🎸 Rocks!")
34
  st.write()
35
  st.markdown(
36
  """
app_utils/frontend_utils.py CHANGED
@@ -11,7 +11,7 @@ entailment_html_messages = {
11
 
12
  def build_sidebar():
13
  sidebar="""
14
- <h1 style='text-align: center'>Fact checking 🎸 Rocks!</h1>
15
  <div style='text-align: center'>
16
  <i>Fact checking baseline combining dense retrieval and textual entailment</i>
17
  <p><br/><a href='https://github.com/anakin87/fact-checking-rocks'>Github project</a> - Based on <a href='https://github.com/deepset-ai/haystack'>Haystack</a></p>
 
11
 
12
  def build_sidebar():
13
  sidebar="""
14
+ <h1 style='text-align: center'>Fact Checking 🎸 Rocks!</h1>
15
  <div style='text-align: center'>
16
  <i>Fact checking baseline combining dense retrieval and textual entailment</i>
17
  <p><br/><a href='https://github.com/anakin87/fact-checking-rocks'>Github project</a> - Based on <a href='https://github.com/deepset-ai/haystack'>Haystack</a></p>
pages/Info.py CHANGED
@@ -1,4 +1,9 @@
1
  import streamlit as st
2
  from app_utils.frontend_utils import build_sidebar
3
 
4
- build_sidebar()
 
 
 
 
 
 
1
  import streamlit as st
2
  from app_utils.frontend_utils import build_sidebar
3
 
4
+ build_sidebar()
5
+
6
+ with open('README.md','r') as fin:
7
+ readme = fin.read().rpartition('---')[-1]
8
+
9
+ st.markdown(readme, unsafe_allow_html=True)