anakin87 commited on
Commit
bcb986c
•
1 Parent(s): 321ba78

progress in readme

Browse files
README.md CHANGED
@@ -11,3 +11,22 @@ license: apache-2.0
11
  ---
12
 
13
  # Fact Checking rocks!   [![Generic badge](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/fact-checking-rocks) [![Generic badge](https://img.shields.io/github/stars/anakin87/fact-checking-rocks?label=Github&style=social)](https://github.com/anakin87/fact-checking-rocks)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
  # Fact Checking rocks!   [![Generic badge](https://img.shields.io/badge/🤗-Open%20in%20Spaces-blue.svg)](https://huggingface.co/spaces/anakin87/fact-checking-rocks) [![Generic badge](https://img.shields.io/github/stars/anakin87/fact-checking-rocks?label=Github&style=social)](https://github.com/anakin87/fact-checking-rocks)
14
+
15
+ ## *Fact checking baseline combining dense retrieval and textual entailment*
16
+
17
+ ### Idea 💡
18
+ This project aims to show that a naive and simple baseline for fact checking can be built by combining dense retrieval and a textual entailment task (based on Natural Language Inference models).
19
+
20
+ ### System description
21
+ This project is strongly based on [Haystack](https://github.com/deepset-ai/haystack), an open source NLP framework to realize search system. The main components of our system are an indexing pipeline and a search pipeline.
22
+
23
+ #### Indexing pipeline
24
+ * [Crawling](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/get_wikipedia_data.ipynb): Crawl data from Wikipedia, starting from the page [List of mainstream rock performers](https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers) and using the [python wrapper](https://github.com/goldsmith/Wikipedia)
25
+ * [Indexing through Haystack](https://github.com/anakin87/fact-checking-rocks/blob/321ba7893bbe79582f8c052493acfda497c5b785/notebooks/indexing.ipynb)
26
+ * Preprocess the downloaded documents into chunks consisting of 2 sentences
27
+ * Chunks with less than 10 words are discarded, because not very informative
28
+ * Instantiate a [FAISS](https://github.com/facebookresearch/faiss) Document store and store the passages on it
29
+ * Create embeddings for the passages, using a Sentence Transformer model and save them in FAISS. It seems that the retrieval task will involve [*asymmetric semantic search*](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search) (statements to be verified are usually shorter than inherent passages), therefore I choose the model `msmarco-distilbert-base-tas-b`.
30
+ * Save FAISS index
31
+
32
+ #### Search pipeline
app_utils/frontend_utils.py CHANGED
@@ -10,11 +10,15 @@ entailment_html_messages = {
10
  }
11
 
12
  def build_sidebar():
13
- st.sidebar.markdown('# Fact checking 🎸 Rocks!')
14
- st.sidebar.markdown('*Fact checking baseline combining dense retrieval and textual entailment*')
15
- st.sidebar.markdown('[Github project](https://github.com/anakin87/fact-checking-rocks) - Based on [Haystack](https://github.com/deepset-ai/haystack)')
16
- st.sidebar.markdown('<small>Data crawled from [Wikipedia](https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers).</small>', unsafe_allow_html=True)
17
-
 
 
 
 
18
 
19
  def set_state_if_absent(key, value):
20
  if key not in st.session_state:
 
10
  }
11
 
12
  def build_sidebar():
13
+ sidebar="""
14
+ <h1 style='text-align: center'>Fact checking 🎸 Rocks!</h1>
15
+ <div style='text-align: center'>
16
+ <i>Fact checking baseline combining dense retrieval and textual entailment</i>
17
+ <p><br/><a href='https://github.com/anakin87/fact-checking-rocks'>Github project</a> - Based on <a href='https://github.com/deepset-ai/haystack'>Haystack</a></p>
18
+ <p><small><a href='https://en.wikipedia.org/wiki/List_of_mainstream_rock_performers'>Data crawled from Wikipedia</a></small></p>
19
+ </div>
20
+ """
21
+ st.sidebar.markdown(sidebar, unsafe_allow_html=True)
22
 
23
  def set_state_if_absent(key, value):
24
  if key not in st.session_state:
data/statements.txt CHANGED
@@ -45,4 +45,5 @@ The Cure made dark songs
45
  Cannibal Corpse is a pop punk band
46
  Slipknot wear masks
47
  Toto have sold many records
48
- The verve were a British band
 
 
45
  Cannibal Corpse is a pop punk band
46
  Slipknot wear masks
47
  Toto have sold many records
48
+ The verve were a British band
49
+ Psychokiller is a hit by Talking Heads
pages/Info.py CHANGED
@@ -1 +1,4 @@
1
  import streamlit as st
 
 
 
 
1
  import streamlit as st
2
+ from app_utils.frontend_utils import build_sidebar
3
+
4
+ build_sidebar()