File size: 9,675 Bytes
34a00fe
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
# -*- coding: utf-8 -*-
"""multilingual_Semantic_Search.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1Wg8tD1NJqY0lnvSnsZQhB66pAvxSu65h

# Multilingual  Semantic Search
Language models give computers the ability to search by meaning and go beyond searching by matching keywords. This capability is called semantic search. 

![Searching an archive using sentence embeddings](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/basic-semantic-search-overview.png?3)

In this notebook, we'll build a simple semantic search engine. The applications of semantic search go beyond building a web search engine. They can empower a private search engine for internal documents or records. It can also be used to power features like StackOverflow's "similar questions" feature.

1. Get the archive of questions
2. [Embed](https://docs.cohere.ai/embed-reference/) the archive
3. Search using an index and nearest neighbor search
4. Visualize the archive based on the embeddings
"""

# Install Cohere for embeddings, Umap to reduce embeddings to 2 dimensions, 
# Altair for visualization, Annoy for approximate nearest neighbor search
!pip install cohere umap-learn altair annoy datasets tqdm

"""Get your Cohere API key by [signing up here](https://os.cohere.ai/register). Paste it in the cell below."""

pip install umap

#@title Import libraries (Run this cell to execute required code) {display-mode: "form"}

import cohere
import numpy as np
import re
import pandas as pd
from tqdm import tqdm
from datasets import load_dataset
import umap
import altair as alt
from sklearn.metrics.pairwise import cosine_similarity
from annoy import AnnoyIndex
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)

"""You'll need your API key for this next cell. [Sign up to Cohere](https://os.cohere.ai/) and get one if you haven't yet."""

# Paste your API key here. Remember to not share publicly
api_key = 'twdqnY8kzEsMnu3N0bTX2JsqFUWybVczDDNZTjpd'

# Create and retrieve a Cohere API key from os.cohere.ai
co = cohere.Client(api_key)

"""## 1. Get The Archive of Questions
We'll use the [trec](https://www.tensorflow.org/datasets/catalog/trec) dataset which is made up of questions and their categories.
"""

# # Get dataset
# dataset = load_dataset("trec", split="train")

# # Import into a pandas dataframe, take only the first 1000 rows
# df = pd.DataFrame(dataset)[:1000]

# # Preview the data to ensure it has loaded correctly
# df.head(10)

import pandas as pd

# Get dataset
# dataset = load_dataset("trec", split="train")
# https://www.shanelynn.ie/pandas-csv-error-error-tokenizing-data-c-error-eof-inside-string-starting-at-line/
df = pd.read_excel("/content/news_articles_dataset.xlsx")

df.head()

df.columns

# combine columns , 'summary'
cols = ['Title ', 'News']
df['text'] = df[cols].apply(lambda row: ' \n '.join(row.values.astype(str)), axis=1)
df['text'].head()

"""## 2. Embed the archive
The next step is to embed the text of the questions.

![embedding archive texts](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-embed-text-archive.png)

To get a thousand embeddings of this length should take about fifteen seconds.
"""

# Get the embeddings
embeds = co.embed(texts=list(df['text']),model="multilingual-22-12",truncate="LEFT").embeddings

# Check the dimensions of the embeddings
embeds = np.array(embeds)
print(embeds.shape)
print(embeds)
print(df['text'][0])
print(embeds[0])

print(embeds.shape)

"""## 3. Search using an index and nearest neighbor search
![Building the search index from the embeddings](https://github.com/cohere-ai/notebooks/raw/main/notebooks/images/semantic-search-index.png)
Let's now use [Annoy](https://github.com/spotify/annoy) to build an index that stores the embeddings in a way that is optimized for fast search. This approach scales well to a large number of texts (other options include [Faiss](https://github.com/facebookresearch/faiss), [ScaNN](https://github.com/google-research/google-research/tree/master/scann), and [PyNNDescent](https://github.com/lmcinnes/pynndescent)).

After building the index, we can use it to retrieve the nearest neighbors either of existing questions (section 3.1), or of new questions that we embed (section 3.2).
"""

# Create the search index, pass the size of embedding
search_index = AnnoyIndex(embeds.shape[1], 'angular')
print(search_index)
# Add all the vectors to the search index
for i in range(len(embeds)):
    search_index.add_item(i, embeds[i])
    print(search_index)


search_index.build(10) # 10 trees
search_index.save('test.ann')

"""### 3.1. Find the neighbors of an example from the dataset
If we're only interested in measuring the distance between the questions in the dataset (no outside queries), a simple way is to calculate the distance between every pair of embeddings we have.
"""

# Choose an example (we'll retrieve others similar to it)
example_id = 5

# Retrieve nearest neighbors
similar_item_ids = search_index.get_nns_by_item(example_id,10,
                                                include_distances=True)
# Format and print the text and distances
results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'], 
                             'distance': similar_item_ids[1]}).drop(example_id)

print(f"Question:'{df.iloc[example_id]['text']}'\nNearest neighbors:")
results

"""### 3.2. Find the neighbors of a user query
We're not limited to searching using existing items. If we get a query, we can embed it and find its nearest neighbors from the dataset.
"""

# query = "skin care ayurveda"
# query = "how much money did skin care ayurveda raise"
# query = "semelso wife arrest"
# query = "avatar 2 movie collection"
# query = "బాలయ్య మాస్ ట్రీట్"

def multilingual_semantic_search(query):
  # query = "is messi the best footballer of all time?"

  # Get the query's embedding
  query_embed = co.embed(texts=[query],
                    model="multilingual-22-12",
                    truncate="LEFT").embeddings

  # Retrieve the nearest neighbors
  similar_item_ids = search_index.get_nns_by_vector(query_embed[0],10,
                                                  include_distances=True)
  # Format the results
  # results = pd.DataFrame(data={'texts': df.iloc[similar_item_ids[0]]['text'], 
  #                              'distance': similar_item_ids[1]})

  results = pd.DataFrame(data={'title': df.iloc[similar_item_ids[0]]['Title '], 
                              'news': df.iloc[similar_item_ids[0]]['News'],
                              'distance': similar_item_ids[1]})

  response = {}

  # JSON response
  # for i in similar_item_ids[0]:
  #   # print(i)
  #   response[i] = \
  #             {     \
  #                 "title": df.iloc[i]['Title '], \
  #                 "news": df.iloc[i]['News'] 
  #             }

  response = """ """
  for i in similar_item_ids[0]:
  # print(i)
    response += "Title: " + df.iloc[i]['Title '] + " \n " +"Short News: "+ df.iloc[i]['News'] + "\n\n"

  # print(similar_item_ids)
  # print(similar_item_ids[0])
  # print(similar_item_ids[1])

  # print(f"Query:'{query}'\nNearest neighbors:")
  # print(results)
  # print("----------------------")
  # print(type(response))

  print(response)
  return response

multilingual_semantic_search("is messi the best footballer of all time?")

!pip install gradio
import gradio as gr
# demo = gr.Interface(fn=multilingual_semantic_search, inputs="text", outputs="text")
with gr.Blocks() as demo:
    gr.Markdown("🌍 This app uses a multilingual semantic model from COhere to 🚀 revolutionize the media and news industry in multilingual markets like India, allowing anyone to track 📰 regional news in real-time without the need for translation or understanding of other regional languages. 🙌")
    name = gr.Textbox(label="*Semantic search enable! Search for a news...")
    output = gr.Textbox(label="Semantic search results")
    greet_btn = gr.Button("Search")
    theme="darkpeach"
    greet_btn.click(fn=multilingual_semantic_search, inputs=name, outputs=output)
demo.launch()

#!pip install gradio

"""## 4. Visualizing the archive
Finally, let's plot out all the questions onto a 2D chart so you're able to visualize the semantic similarities of this dataset!
"""

#@title Plot the archive {display-mode: "form"}

# UMAP reduces the dimensions from 1024 to 2 dimensions that we can plot
reducer = umap.UMAP(n_neighbors=20) 
umap_embeds = reducer.fit_transform(embeds)
# Prepare the data to plot and interactive visualization
# using Altair
df_explore = pd.DataFrame(data={'text': df['text']})
df_explore['x'] = umap_embeds[:,0]
df_explore['y'] = umap_embeds[:,1]

# Plot
chart = alt.Chart(df_explore).mark_circle(size=60).encode(
    x=#'x',
    alt.X('x',
        scale=alt.Scale(zero=False)
    ),
    y=
    alt.Y('y',
        scale=alt.Scale(zero=False)
    ),
    tooltip=['text']
).properties(
    width=700,
    height=400
)
chart.interactive()

"""Hover over the points to read the text. Do you see some of the patterns in clustered points? Similar questions, or questions asking about similar topics?

This concludes this introductory guide to semantic search using sentence embeddings. As you continue the path of building a search product additional considerations arise (like dealing with long texts, or finetuning to better improve the embeddings for a specific use case). 


We can’t wait to see what you start building! Share your projects or find support at [community.cohere.ai](https://community.cohere.ai).

"""