Spaces:
Runtime error
Runtime error
File size: 9,675 Bytes
34a00fe |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 |
# -*- coding: utf-8 -*-
"""multilingual_Semantic_Search.ipynb
Automatically generated by Colaboratory.
Original file is located at
https://colab.research.google.com/drive/1Wg8tD1NJqY0lnvSnsZQhB66pAvxSu65h
# Multilingual Semantic Search
Language models give computers the ability to search by meaning and go beyond searching by matching keywords. This capability is called semantic search.
![Searching an archive using sentence embeddings](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/basic-semantic-search-overview.png?3)
In this notebook, we'll build a simple semantic search engine. The applications of semantic search go beyond building a web search engine. They can empower a private search engine for internal documents or records. It can also be used to power features like StackOverflow's "similar questions" feature.
1. Get the archive of questions
2. [Embed](https://docs.cohere.ai/embed-reference/) the archive
3. Search using an index and nearest neighbor search
4. Visualize the archive based on the embeddings
"""
# Install Cohere for embeddings, Umap to reduce embeddings to 2 dimensions,
# Altair for visualization, Annoy for approximate nearest neighbor search
!pip install cohere umap-learn altair annoy datasets tqdm
"""Get your Cohere API key by [signing up here](https://os.cohere.ai/register). Paste it in the cell below."""
pip install umap
#@title Import libraries (Run this cell to execute required code) {display-mode: "form"}
import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)
"""You'll need your API key for this next cell. [Sign up to Cohere](https://os.cohere.ai/) and get one if you haven't yet."""
# Paste your API key here. Remember to not share publicly
api_key = 'twdqnY8kzEsMnu3N0bTX2JsqFUWybVczDDNZTjpd'
# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)
"""## 1. Get The Archive of Questions
We'll use the [trec](https://www.tensorflow.org/datasets/catalog/trec) dataset which is made up of questions and their categories.
"""
# # Get dataset
# dataset = load_dataset("trec", split="train")
# # Import into a pandas dataframe, take only the first 1000 rows
# df = pd.DataFrame(dataset)[:1000]
# # Preview the data to ensure it has loaded correctly
# df.head(10)
import pandas as pd
# Get dataset
# dataset = load_dataset("trec", split="train")
# https://www.shanelynn.ie/pandas-csv-error-error-tokenizing-data-c-error-eof-inside-string-starting-at-line/
df = pd.read_excel("/content/news_articles_dataset.xlsx")
df.head()
df.columns
# combine columns , 'summary'
cols = ['Title ', 'News']
df['text'] = df[cols].apply(lambda row: ' \n '.join(row.values.astype(str)), axis=1)
df['text'].head()
"""## 2. Embed the archive
The next step is to embed the text of the questions.
![embedding archive texts](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-embed-text-archive.png)
To get a thousand embeddings of this length should take about fifteen seconds.
"""
# Get the embeddings
embeds = co.embed(texts=list(df['text']),model="multilingual-22-12",truncate="LEFT").embeddings
# Check the dimensions of the embeddings
embeds = np.array(embeds)
print(embeds.shape)
print(embeds)
print(df['text'][0])
print(embeds[0])
print(embeds.shape)
"""## 3. Search using an index and nearest neighbor search
![Building the search index from the embeddings](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-index.png)
Let's now use [Annoy](https://github.com/spotify/annoy) to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts (other options include [Faiss](https://github.com/facebookresearch/faiss), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), and [PyNNDescent](https://github.com/lmcinnes/pynndescent)).
After building the index, we can use it to retrieve the nearest neighbors either of existing questions (section 3.1), or of new questions that we embed (section 3.2).
"""
# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
print(search_index)
# Add all the vectors to the search index
for i in range(len(embeds)):
search_index.add_item(i, embeds[i])
print(search_index)
search_index.build(10) # 10 trees
search_index.save('test.ann')
"""### 3.1. Find the neighbors of an example from the dataset
If we're only interested in measuring the distance between the questions in the dataset (no outside queries), a simple way is to calculate the distance between every pair of embeddings we have.
"""
# Choose an example (we'll retrieve others similar to it)
example_id = 5
# Retrieve nearest neighbors
similar_item_ids = search_index.get_nns_by_item(example_id,10,
include_distances=True)
# Format and print the text and distances
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'],
'distance': similar_item_ids[1]}).drop(example_id)
print(f"Question:'{df.iloc[example_id]['text']}'\nNearest neighbors:")
results
"""### 3.2. Find the neighbors of a user query
We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.
"""
# query = "skin care ayurveda"
# query = "how much money did skin care ayurveda raise"
# query = "semelso wife arrest"
# query = "avatar 2 movie collection"
# query = "బాలయ్య మాస్ ట్రీట్"
def multilingual_semantic_search(query):
# query = "is messi the best footballer of all time?"
# Get the query's embedding
query_embed = co.embed(texts=[query],
model="multilingual-22-12",
truncate="LEFT").embeddings
# Retrieve the nearest neighbors
similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
include_distances=True)
# Format the results
# results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'],
# 'distance': similar_item_ids[1]})
results = pd.DataFrame(data={'title': df.iloc[similar_item_ids[0]]['Title '],
'news': df.iloc[similar_item_ids[0]]['News'],
'distance': similar_item_ids[1]})
response = {}
# JSON response
# for i in similar_item_ids[0]:
# # print(i)
# response[i] = \
# { \
# "title": df.iloc[i]['Title '], \
# "news": df.iloc[i]['News']
# }
response = """ """
for i in similar_item_ids[0]:
# print(i)
response += "Title: " + df.iloc[i]['Title '] + " \n " +"Short News: "+ df.iloc[i]['News'] + "\n\n"
# print(similar_item_ids)
# print(similar_item_ids[0])
# print(similar_item_ids[1])
# print(f"Query:'{query}'\nNearest neighbors:")
# print(results)
# print("----------------------")
# print(type(response))
print(response)
return response
multilingual_semantic_search("is messi the best footballer of all time?")
!pip install gradio
import gradio as gr
# demo = gr.Interface(fn=multilingual_semantic_search, inputs="text", outputs="text")
with gr.Blocks() as demo:
gr.Markdown("🌍 This app uses a multilingual semantic model from COhere to 🚀 revolutionize the media and news industry in multilingual markets like India, allowing anyone to track 📰 regional news in real-time without the need for translation or understanding of other regional languages. 🙌")
name = gr.Textbox(label="*Semantic search enable! Search for a news...")
output = gr.Textbox(label="Semantic search results")
greet_btn = gr.Button("Search")
theme="darkpeach"
greet_btn.click(fn=multilingual_semantic_search, inputs=name, outputs=output)
demo.launch()
#!pip install gradio
"""## 4. Visualizing the archive
Finally, let's plot out all the questions onto a 2D chart so you're able to visualize the semantic similarities of this dataset!
"""
#@title Plot the archive {display-mode: "form"}
# UMAP reduces the dimensions from 1024 to 2 dimensions that we can plot
reducer = umap.UMAP(n_neighbors=20)
umap_embeds = reducer.fit_transform(embeds)
# Prepare the data to plot and interactive visualization
# using Altair
df_explore = pd.DataFrame(data={'text': df['text']})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]
# Plot
chart = alt.Chart(df_explore).mark_circle(size=60).encode(
x=#'x',
alt.X('x',
scale=alt.Scale(zero=False)
),
y=
alt.Y('y',
scale=alt.Scale(zero=False)
),
tooltip=['text']
).properties(
width=700,
height=400
)
chart.interactive()
"""Hover over the points to read the text. Do you see some of the patterns in clustered points? Similar questions, or questions asking about similar topics?
This concludes this introductory guide to semantic search using sentence embeddings. As you continue the path of building a search product additional considerations arise (like dealing with long texts, or finetuning to better improve the embeddings for a specific use case).
We can’t wait to see what you start building! Share your projects or find support at [community.cohere.ai](https://community.cohere.ai).
""" |