Spaces:

fabiochiu
/

semantic-search-medium

Runtime error

App Files Files Community

semantic-search-medium / app.py

fabiochiu

Upload app.py

774b169 almost 2 years ago

raw history blame contribute delete

No virus

7.8 kB

	import streamlit as st
	import pandas as pd
	import numpy as np
	import pickle
	from huggingface_hub import hf_hub_download
	from sentence_transformers import SentenceTransformer, util
	from langdetect import detect
	import plotly.express as px
	from collections import Counter

	# sidebar
	with st.sidebar:
	st.header("What is a Semantic Search Engine?")
	st.markdown("[Semantic Search](https://medium.com/nlplanet/semantic-search-with-few-lines-of-code-490df1d53fd6) allows retrieving documents from a corpus using a search query in a semantic way. This means that the search engine looks not only for exact text matches, but also for overlapping semantic meaning (e.g. synonyms and periphrases).")
	st.markdown("This is different from a text-matching search engine, which looks for exact text matches only.")
	st.header("How does semantic search work?")
	st.markdown("The idea behind semantic search is to [embed](https://machinelearningmastery.com/what-are-word-embeddings/) all the entries in your corpus, which can be sentences, paragraphs, or documents, into a vector space. At search time, the query is embedded into the same vector space and the [closest vectors](https://en.wikipedia.org/wiki/Cosine_similarity) from your corpus are found.")
	st.header("Useful libraries")
	st.markdown("""
	- [`sentence-transformers`](https://sbert.net/): Allows to easily use great pre-trained models for semantic search and has a fast implementation for finding nearest neighbors by cosine similarity.
	- [`faiss`](https://github.com/facebookresearch/faiss): Allows efficient similarity search and clustering of dense vectors.
	""")
	st.header("Useful links")
	st.markdown("""
	- [Semantic Search with Sentence Transformers](https://medium.com/nlplanet/semantic-search-with-few-lines-of-code-490df1d53fd6)
	- [Sentence Transformers cheatsheet](https://medium.com/nlplanet/two-minutes-nlp-sentence-transformers-cheat-sheet-2e9865083e7a)
	""")
	st.header("Who made this?")
	st.markdown("Hi, I'm Fabio Chiusano. You can contact me on [LinkedIn](https://www.linkedin.com/in/fabio-chiusano-b6a3b311b/).")

	# main content
	st.header("Semantic Search Engine on [Medium](https://medium.com/) articles")
	st.markdown("This is a small demo project of a semantic search engine over a dataset of ~190k Medium articles.")

	st_placeholder_loading = st.empty()
	st_placeholder_loading.text('Loading medium articles data...')

	@st.cache(allow_output_mutation=True)
	def load_data():
	df_articles = pd.read_csv(hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles_no_text.csv"))
	corpus_embeddings = pickle.load(open(hf_hub_download("fabiochiu/medium-articles", repo_type="dataset", filename="medium_articles_embeddings.pickle"), "rb"))
	embedder = SentenceTransformer('all-MiniLM-L6-v2')
	return df_articles, corpus_embeddings, embedder

	df_articles, corpus_embeddings, embedder = load_data()
	st_placeholder_loading.empty()

	n_top_tags = 20
	@st.cache()
	def load_chart_top_tags():
	# Occurrences of the top 50 tags
	print("we")
	all_tags = [tag for tags_list in df_articles["tags"] for tag in eval(tags_list)]
	d_tags_counter = Counter(all_tags)
	tags, frequencies = list(zip(*d_tags_counter.most_common(n=n_top_tags)))
	fig = px.bar(x=tags, y=frequencies)
	fig.update_xaxes(title="tags")
	fig.update_yaxes(title="frequencies")
	return fig

	fig_top_tags = load_chart_top_tags()

	# collapse option to see more info about the data
	with st.expander("See more info about data"):
	st.markdown("### Where can I find the data")
	st.markdown("You can find the data as a Hugging Face dataset [here](https://huggingface.co/datasets/fabiochiu/medium-articles).")
	st.markdown(f"### The {n_top_tags} most occurring tags and their frequencies")
	st.plotly_chart(fig_top_tags, use_container_width=True)
	st.markdown(f"### Dataset creation")
	st.markdown("The articles have been scraped with Python and the [requests](https://pypi.org/project/requests/) library. Because of the scraping process, scraped articles are coming from a not uniform publication date distribution. This means that there are articles published in 2016 and in 2022, but the number of articles in this dataset published in 2016 is not the same as the number of articles published in 2022. In particular, there is a strong prevalence of articles published in 2020. Have a look at the [accompanying notebook](https://www.kaggle.com/code/fabiochiusano/medium-articles-simple-data-analysis) to see the distribution of the publication dates.")

	# collapse option to see a comparison between different search engine types
	with st.expander("Semantic search engine vs Text match search engine"):
	st.markdown("""
	Here's a brief comparison between them:
	- Generally, a semantic search engine works better than a text-matching search engine, as the latter (1) looks for only exact text matches between the articles and the query after some [text normalization](https://towardsdatascience.com/text-normalization-for-natural-language-processing-nlp-70a314bfa646) and (2) it doesn't take into account synonyms, etc.
	- The quality difference is higher if the corpus of articles is small (e.g. hundreds or thousands), because a text-matching search engine may return zero-or-few results for some queries, while a semantic search engine always returns an ordered list of articles.
	- On the other hand, a semantic search engine needs all the documents in the corpus to be embedded (i.e. transformed into semantic vectors thanks to a machine learning model) as a setup step, but this has to be done only once so it's not really a problem.
	- Using appropriate data structures that implement [fast approximate nearest neighbors algorithms](https://towardsdatascience.com/comprehensive-guide-to-approximate-nearest-neighbors-algorithms-8b94f057d6b6), both types of search engines can have low latencies.
	""")

	st_query = st.text_input("Write your query here", max_chars=100)

	def on_click_search():
	if st_query != "":
	query_embedding = embedder.encode(st_query, convert_to_tensor=True)
	top_k = 10
	hits = util.semantic_search(query_embedding, corpus_embeddings, top_k=top_k*2)[0]
	article_dicts = []
	for hit in hits:
	score = hit['score']
	article_row = df_articles.iloc[hit['corpus_id']]
	try:
	detected_lang = detect(article_row["title"])
	except:
	detected_lang = ""
	if detected_lang == "en" and len(article_row["title"]) >= 10:
	article_dicts.append({
	"title": article_row['title'],
	"url": article_row['url'],
	"score": score
	})
	if len(article_dicts) >= top_k:
	break
	st.session_state.article_dicts = article_dicts
	st.session_state.empty_query = False
	else:
	st.session_state.article_dicts = []
	st.session_state.empty_query = True
	st.button("Search", on_click=on_click_search)
	if st_query != "":
	st.session_state.empty_query = False
	on_click_search()
	else:
	st.session_state.empty_query = True

	if not st.session_state.empty_query:
	st.markdown("### Results")
	st.markdown("Scores between parentheses represent the similarity between the article and the query.")
	for article_dict in st.session_state.article_dicts:
	st.markdown(f"""- [{article_dict['title'].capitalize()}]({article_dict['url']}) ({article_dict['score']:.2f})""")
	elif st.session_state.empty_query and "article_dicts" in st.session_state:
	st.markdown("Please write a query and then press the search button.")