Honglin Zhang commited on
Commit
c6a1f8c
β€’
1 Parent(s): c4d8fc9

topic cluster and code cluster

Browse files
README.md CHANGED
@@ -27,18 +27,19 @@ RepoSnipy is a neural search engine built with [streamlit](https://github.com/st
27
 
28
  Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below:
29
  * It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories.
30
- * Multi-level embeddings --- code, docstring, readme, requirement, and repository.
31
  * It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
32
- * Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
33
- * It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on multi-level embeddings and cluster.
 
34
  More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above.
35
  The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.
36
 
37
- We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).
38
 
39
 
40
  ## Dataset
41
- As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset, [KMeans](data/kmeans_model_scibert.pkl) model and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
42
 
43
 
44
  ## License
@@ -51,4 +52,9 @@ The model and the fine-tuning dataset used:
51
 
52
  * [UniXCoder](https://arxiv.org/abs/2203.03850)
53
  * [AdvTest](https://arxiv.org/abs/1909.09436)
54
- * [SciBERT](https://arxiv.org/abs/1903.10676)
 
 
 
 
 
 
27
 
28
  Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below:
29
  * It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories.
30
+ * Multi-level embeddings --- code, doc, readme, requirement, and repository.
31
  * It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
32
+ * Transfer multiple topics into one cluster --- it uses a KMeans model ([kmeans_model_topic_scibert](data/kmeans_model_topic_scibert.pkl)) to analyse topic embeddings and to cluster repositories based on topics.
33
+ * Clustering by code snippets --- it uses a KMeans model ([kmeans_model_code_unixcoder](data/kmeans_model_code_unixcoder.pkl)) to analyse code embeddings and to cluster repositories based on code snippets.
34
+ * It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on repository-level embeddings and cluster (topic or code cluster number).
35
  More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above.
36
  The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.
37
 
38
+ We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of March 2024. The accordingly generated clusters were putted in two json datasets ([repo_topic_clusters](data/repo_topic_clusters.json) and [repo_code_clusters](data/repo_code_clusters.json)) (stored repo-cluster as key-values accordingly).
39
 
40
 
41
  ## Dataset
42
+ As mentioned above, RepoSnipy needs [vector](data/index.bin), clusters json dataset ([repo_topic_clusters](data/repo_topic_clusters.json) and [repo_code_clusters](data/repo_code_clusters.json)), KMeans models ([kmeans_model_topic_scibert](data/kmeans_model_topic_scibert.pkl) and [kmeans_model_code_unixcoder](data/kmeans_model_code_unixcoder.pkl)) and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
43
 
44
 
45
  ## License
 
52
 
53
  * [UniXCoder](https://arxiv.org/abs/2203.03850)
54
  * [AdvTest](https://arxiv.org/abs/1909.09436)
55
+ * [SciBERT](https://arxiv.org/abs/1903.10676)
56
+ * [RepoSnipy (old version)](https://github.com/RepoAnalysis/RepoSnipy)
57
+ * [RepoSnipy HuggingFace Spaces (old version)](https://huggingface.co/spaces/Lazyhope/RepoSnipy)
58
+ * [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py)
59
+ * [SimilarityCal](https://github.com/RepoMining/SimilarityCal)
60
+ * [RepoSnipy](https://github.com/RepoMining/RepoSnipy)
app.py CHANGED
@@ -12,13 +12,16 @@ from docarray import DocList
12
  from docarray.index import InMemoryExactNNIndex
13
  from transformers import pipeline
14
  from transformers import AutoTokenizer, AutoModel
15
- from data.repo_doc import RepoDoc
16
- from data.pair_classifier import PairClassifier
17
  from nltk.stem import WordNetLemmatizer
18
 
19
  nltk.download("wordnet")
20
- KMEANS_MODEL_PATH = Path(__file__).parent.joinpath("data/kmeans_model_scibert.pkl")
 
21
  SIMILARITY_CAL_MODEL_PATH = Path(__file__).parent.joinpath("data/SimilarityCal_model_NO1.pt")
 
 
22
  device = (
23
  "cuda"
24
  if torch.cuda.is_available()
@@ -29,14 +32,13 @@ device = (
29
 
30
  # 1. Product environment
31
  # INDEX_PATH = Path(__file__).parent.joinpath("data/index.bin")
32
- # CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_clusters.json")
33
- SCIBERT_MODEL_PATH = "allenai/scibert_scivocab_uncased"
34
-
35
 
36
  # 2. Developing environment
37
  INDEX_PATH = Path(__file__).parent.joinpath("data/index_test.bin")
38
- CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_clusters_test.json")
39
- # SCIBERT_MODEL_PATH = Path(__file__).parent.joinpath("data/scibert_scivocab_uncased") # Download locally
40
 
41
 
42
  @st.cache_resource(show_spinner="Loading repositories basic information...")
@@ -60,16 +62,28 @@ def load_index():
60
  return InMemoryExactNNIndex[RepoDoc](index_file_path=INDEX_PATH), default_doc
61
 
62
 
63
- @st.cache_resource(show_spinner="Loading repositories clusters...")
64
- def load_repo_clusters():
 
 
 
 
 
 
 
 
 
 
 
 
65
  """
66
- The function to load the repo-clusters file
67
- :return: a dictionary with the repo-clusters
68
  """
69
- with open(CLUSTER_PATH, "r") as file:
70
- repo_clusters = json.load(file)
71
 
72
- return repo_clusters
73
 
74
 
75
  @st.cache_resource(show_spinner="Loading RepoSim4Py pipeline model...")
@@ -99,16 +113,26 @@ def load_scibert_model():
99
  """
100
  tokenizer = AutoTokenizer.from_pretrained(SCIBERT_MODEL_PATH)
101
  scibert_model = AutoModel.from_pretrained(SCIBERT_MODEL_PATH).to(device)
 
102
  return tokenizer, scibert_model
103
 
104
 
105
- @st.cache_resource(show_spinner="Loading KMeans model...")
106
- def load_kmeans_model():
 
 
 
 
 
 
 
 
 
107
  """
108
- The function to load KMeans model
109
- :return: a KMeans model
110
  """
111
- return joblib.load(KMEANS_MODEL_PATH)
112
 
113
 
114
  @st.cache_resource(show_spinner="Loading SimilarityCal model...")
@@ -117,6 +141,7 @@ def load_similaritycal_model():
117
  sim_cal_model.load_state_dict(torch.load(SIMILARITY_CAL_MODEL_PATH, map_location=device))
118
  sim_cal_model = sim_cal_model.to(device)
119
  sim_cal_model = sim_cal_model.eval()
 
120
  return sim_cal_model
121
 
122
 
@@ -130,6 +155,7 @@ def generate_scibert_embedding(tokenizer, scibert_model, text):
130
  outputs = scibert_model(**inputs)
131
  # Use mean pooling for sentence representation
132
  embeddings = outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()
 
133
  return embeddings
134
 
135
 
@@ -150,8 +176,10 @@ def run_pipeline_model(_model, repo_name, github_token):
150
  if not extracted_infos:
151
  return None
152
 
 
153
  with st.spinner(f"Generating embeddings for {repo_name}..."):
154
- repo_info = _model.forward(extracted_infos)[0]
 
155
 
156
  return repo_info
157
 
@@ -175,36 +203,50 @@ def run_index_search(index, query, search_field, limit):
175
  return search_results
176
 
177
 
178
- def run_cluster_search(repo_clusters, repo_name_list):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
179
  """
180
- The function to search cluster number for such repositories.
181
- :param repo_clusters: dictionary with repo-clusters
182
  :param repo_name_list: list or array represent repository names
183
- :return: cluster number list
184
  """
185
- clusters = []
186
  for repo_name in repo_name_list:
187
- clusters.append(repo_clusters[repo_name])
188
- return clusters
 
189
 
190
 
191
- def run_similaritycal_search(index, repo_clusters, model, query_doc, query_cluster_number, limit, same_cluster=True):
192
  """
193
  The function to run SimilarityCal model.
194
  :param index: index file
195
- :param repo_clusters: repo-clusters json file
196
  :param model: SimilarityCal model
197
  :param query_doc: query repo doc
198
- :param query_cluster_number: query repo cluster number
199
  :param limit: limit
200
- :param same_cluster: whether searching for same cluster
201
  :return: result dataframe
202
  """
203
  docs = index._docs
204
  input_embeddings_list = []
205
  result_dl = DocList[RepoDoc]()
206
  for doc in docs:
207
- if same_cluster and query_cluster_number != repo_clusters[doc.name]:
208
  continue
209
  if doc.name != query_doc.name:
210
  e1, e2 = (torch.Tensor(query_doc.repository_embedding),
@@ -219,17 +261,24 @@ def run_similaritycal_search(index, repo_clusters, model, query_doc, query_clust
219
  similarity_scores = softmax(model_output)[:, 1].cpu().detach().numpy()
220
  df = result_dl.to_dataframe()
221
  df["scores"] = similarity_scores
222
- return df.sort_values(by='scores', ascending=False).reset_index(drop=True).head(limit)
 
 
 
 
 
223
 
224
 
225
  if __name__ == "__main__":
226
  # Loading dataset and models
227
  index, default_doc = load_index()
228
- repo_clusters = load_repo_clusters()
 
229
  pipeline_model = load_pipeline_model()
230
  lemmatizer = WordNetLemmatizer()
231
  tokenizer, scibert_model = load_scibert_model()
232
- kmeans = load_kmeans_model()
 
233
  sim_cal_model = load_similaritycal_model()
234
 
235
  # Setting the sidebar
@@ -254,8 +303,8 @@ if __name__ == "__main__":
254
 
255
  st.multiselect(
256
  label="Display columns",
257
- options=["scores", "name", "topics", "cluster number", "stars", "license"],
258
- default=["scores", "name", "topics", "cluster number", "stars", "license"],
259
  help="Select columns to display in the search results",
260
  key="display_columns",
261
  )
@@ -291,10 +340,11 @@ if __name__ == "__main__":
291
  records = index.filter({"name": {"$eq": repo_name}})
292
  # 1) Building the query information
293
  query_doc = default_doc.copy() if not records else records[0]
294
- # 2) Recording the cluster number
295
- cluster_number = -1 if not records else repo_clusters[repo_name]
 
296
 
297
- # Importance 1 ---- situation need to update repository information and cluster number
298
  if st.session_state.update_index or not records:
299
  # 1) Updating repository information by using RepoSim4Py pipeline
300
  repo_info = run_pipeline_model(pipeline_model, repo_name, st.session_state.github_token)
@@ -317,13 +367,18 @@ if __name__ == "__main__":
317
  query_doc.repository_embedding = None if np.all(repo_info["mean_repo_embedding"] == 0) else repo_info[
318
  "mean_repo_embedding"].reshape(-1)
319
 
320
- # 2) Updating cluster number
321
  topics_text = ' '.join(
322
  [lemmatizer.lemmatize(topic.lower().replace('-', ' ')) for topic in query_doc.topics])
323
  topic_embeddings = generate_scibert_embedding(tokenizer, scibert_model, topics_text)
324
- cluster_number = int(kmeans.predict(topic_embeddings)[0])
 
 
 
 
 
325
 
326
- # Importance 2 ---- update index file and repository clusters file
327
  if st.session_state.update_index:
328
  if not query_doc.license:
329
  st.warning(
@@ -337,19 +392,24 @@ if __name__ == "__main__":
337
  )
338
  else:
339
  index.index(query_doc)
340
- repo_clusters[query_doc.name] = cluster_number
 
341
 
342
- with st.spinner("Persisting the index and repository clusters..."):
343
  index.persist(str(INDEX_PATH))
344
- with open(CLUSTER_PATH, "w") as file:
345
- json.dump(repo_clusters, file, indent=4)
 
 
346
  st.success("Repository updated to the index!")
347
 
348
  load_index.clear()
349
- load_repo_clusters.clear()
 
350
 
351
  st.session_state["query_doc"] = query_doc
352
- st.session_state["cluster_number"] = cluster_number
 
353
 
354
  # 2. Repository cannot be matched
355
  else:
@@ -358,7 +418,8 @@ if __name__ == "__main__":
358
  # Starting to query
359
  if "query_doc" in st.session_state:
360
  query_doc = st.session_state.query_doc
361
- cluster_number = st.session_state.cluster_number
 
362
  limit = st.session_state.search_results_limit
363
 
364
  # Showing the query repository information
@@ -368,7 +429,8 @@ if __name__ == "__main__":
368
  {
369
  "name": query_doc.name,
370
  "topics": query_doc.topics,
371
- "cluster number": cluster_number,
 
372
  "stars": query_doc.stars,
373
  "license": query_doc.license,
374
  }
@@ -377,15 +439,18 @@ if __name__ == "__main__":
377
  )
378
 
379
  display_columns = st.session_state.display_columns
380
- code_sim_tab, doc_sim_tab, readme_sim_tab, requirement_sim_tab, repo_sim_tab, cluster_tab, same_cluster_tab, = st.tabs(
 
381
  ["Code_sim", "Docstring_sim", "Readme_sim", "Requirement_sim",
382
- "Repository_sim", "Cluster_sim", "Same_cluster_sim"])
383
 
384
  with code_sim_tab:
385
  if query_doc.code_embedding is not None:
386
  code_sim_res = run_index_search(index, query_doc, "code_embedding", limit)
387
- cluster_numbers = run_cluster_search(repo_clusters, code_sim_res["name"])
388
- code_sim_res["cluster number"] = cluster_numbers
 
 
389
  st.dataframe(code_sim_res[display_columns])
390
  else:
391
  st.error("No function code was extracted for this repository!")
@@ -393,8 +458,10 @@ if __name__ == "__main__":
393
  with doc_sim_tab:
394
  if query_doc.doc_embedding is not None:
395
  doc_sim_res = run_index_search(index, query_doc, "doc_embedding", limit)
396
- cluster_numbers = run_cluster_search(repo_clusters, doc_sim_res["name"])
397
- doc_sim_res["cluster number"] = cluster_numbers
 
 
398
  st.dataframe(doc_sim_res[display_columns])
399
  else:
400
  st.error("No function docstring was extracted for this repository!")
@@ -402,8 +469,10 @@ if __name__ == "__main__":
402
  with readme_sim_tab:
403
  if query_doc.readme_embedding is not None:
404
  readme_sim_res = run_index_search(index, query_doc, "readme_embedding", limit)
405
- cluster_numbers = run_cluster_search(repo_clusters, readme_sim_res["name"])
406
- readme_sim_res["cluster number"] = cluster_numbers
 
 
407
  st.dataframe(readme_sim_res[display_columns])
408
  else:
409
  st.error("No readme file was extracted for this repository!")
@@ -411,8 +480,10 @@ if __name__ == "__main__":
411
  with requirement_sim_tab:
412
  if query_doc.requirement_embedding is not None:
413
  requirement_sim_res = run_index_search(index, query_doc, "requirement_embedding", limit)
414
- cluster_numbers = run_cluster_search(repo_clusters, requirement_sim_res["name"])
415
- requirement_sim_res["cluster number"] = cluster_numbers
 
 
416
  st.dataframe(requirement_sim_res[display_columns])
417
  else:
418
  st.error("No requirement file was extracted for this repository!")
@@ -421,31 +492,34 @@ if __name__ == "__main__":
421
  if query_doc.repository_embedding is not None:
422
  # Repo Sim tab
423
  repo_sim_res = run_index_search(index, query_doc, "repository_embedding", limit)
424
- cluster_numbers = run_cluster_search(repo_clusters, repo_sim_res["name"])
425
- repo_sim_res["cluster number"] = cluster_numbers
 
 
426
  st.dataframe(repo_sim_res[display_columns])
427
  else:
428
  st.error("No such useful information was extracted for this repository!")
429
 
430
- with cluster_tab:
431
  if query_doc.repository_embedding is not None:
432
- cluster_df = run_similaritycal_search(index, repo_clusters, sim_cal_model,
433
- query_doc, cluster_number, limit,
434
- same_cluster=False)
435
- cluster_numbers = run_cluster_search(repo_clusters, cluster_df["name"])
436
- cluster_df["cluster number"] = cluster_numbers
437
- st.dataframe(cluster_df[display_columns])
 
438
  else:
439
  st.error("No such useful information was extracted for this repository!")
440
 
441
- with same_cluster_tab:
442
  if query_doc.repository_embedding is not None:
443
- # Cluster tab and same cluster tab
444
- same_cluster_df = run_similaritycal_search(index, repo_clusters, sim_cal_model,
445
- query_doc, cluster_number, limit,
446
- same_cluster=True)
447
- same_cluster_numbers = run_cluster_search(repo_clusters, same_cluster_df["name"])
448
- same_cluster_df["cluster number"] = same_cluster_numbers
449
- same_cluster_tab.dataframe(same_cluster_df[display_columns])
450
  else:
451
- same_cluster_tab.error("No such useful information was extracted for this repository!")
 
12
  from docarray.index import InMemoryExactNNIndex
13
  from transformers import pipeline
14
  from transformers import AutoTokenizer, AutoModel
15
+ from common.repo_doc import RepoDoc
16
+ from common.pair_classifier import PairClassifier
17
  from nltk.stem import WordNetLemmatizer
18
 
19
  nltk.download("wordnet")
20
+ KMEANS_TOPIC_MODEL_PATH = Path(__file__).parent.joinpath("data/kmeans_model_topic_scibert.pkl")
21
+ KMEANS_CODE_MODEL_PATH = Path(__file__).parent.joinpath("data/kmeans_model_code_unixcoder.pkl")
22
  SIMILARITY_CAL_MODEL_PATH = Path(__file__).parent.joinpath("data/SimilarityCal_model_NO1.pt")
23
+ SCIBERT_MODEL_PATH = "allenai/scibert_scivocab_uncased"
24
+ # SCIBERT_MODEL_PATH = Path(__file__).parent.joinpath("data/scibert_scivocab_uncased") # Download locally
25
  device = (
26
  "cuda"
27
  if torch.cuda.is_available()
 
32
 
33
  # 1. Product environment
34
  # INDEX_PATH = Path(__file__).parent.joinpath("data/index.bin")
35
+ # TOPIC_CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_topic_clusters.json")
36
+ # CODE_CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_code_clusters.json")
 
37
 
38
  # 2. Developing environment
39
  INDEX_PATH = Path(__file__).parent.joinpath("data/index_test.bin")
40
+ TOPIC_CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_topic_clusters_test.json")
41
+ CODE_CLUSTER_PATH = Path(__file__).parent.joinpath("data/repo_code_clusters_test.json")
42
 
43
 
44
  @st.cache_resource(show_spinner="Loading repositories basic information...")
 
62
  return InMemoryExactNNIndex[RepoDoc](index_file_path=INDEX_PATH), default_doc
63
 
64
 
65
+ @st.cache_resource(show_spinner="Loading repositories topic clusters...")
66
+ def load_repo_topic_clusters():
67
+ """
68
+ The function to load the repo-topic_clusters file
69
+ :return: a dictionary with the repo-topic_clusters
70
+ """
71
+ with open(TOPIC_CLUSTER_PATH, "r") as file:
72
+ repo_topic_clusters = json.load(file)
73
+
74
+ return repo_topic_clusters
75
+
76
+
77
+ @st.cache_resource(show_spinner="Loading repositories code clusters...")
78
+ def load_repo_code_clusters():
79
  """
80
+ The function to load the repo-code_clusters file
81
+ :return: a dictionary with the repo-code_clusters
82
  """
83
+ with open(CODE_CLUSTER_PATH, "r") as file:
84
+ repo_code_clusters = json.load(file)
85
 
86
+ return repo_code_clusters
87
 
88
 
89
  @st.cache_resource(show_spinner="Loading RepoSim4Py pipeline model...")
 
113
  """
114
  tokenizer = AutoTokenizer.from_pretrained(SCIBERT_MODEL_PATH)
115
  scibert_model = AutoModel.from_pretrained(SCIBERT_MODEL_PATH).to(device)
116
+
117
  return tokenizer, scibert_model
118
 
119
 
120
+ @st.cache_resource(show_spinner="Loading KMeans model (topic clusters)...")
121
+ def load_topic_kmeans_model():
122
+ """
123
+ The function to load KMeans model (topic clusters)
124
+ :return: a KMeans model (topic clusters)
125
+ """
126
+ return joblib.load(KMEANS_TOPIC_MODEL_PATH)
127
+
128
+
129
+ @st.cache_resource(show_spinner="Loading KMeans model (code clusters)...")
130
+ def load_code_kmeans_model():
131
  """
132
+ The function to load KMeans model (code clusters)
133
+ :return: a KMeans model (code clusters)
134
  """
135
+ return joblib.load(KMEANS_CODE_MODEL_PATH)
136
 
137
 
138
  @st.cache_resource(show_spinner="Loading SimilarityCal model...")
 
141
  sim_cal_model.load_state_dict(torch.load(SIMILARITY_CAL_MODEL_PATH, map_location=device))
142
  sim_cal_model = sim_cal_model.to(device)
143
  sim_cal_model = sim_cal_model.eval()
144
+
145
  return sim_cal_model
146
 
147
 
 
155
  outputs = scibert_model(**inputs)
156
  # Use mean pooling for sentence representation
157
  embeddings = outputs.last_hidden_state.mean(dim=1).cpu().detach().numpy()
158
+
159
  return embeddings
160
 
161
 
 
176
  if not extracted_infos:
177
  return None
178
 
179
+ st_proress_bar = st.progress(0.0)
180
  with st.spinner(f"Generating embeddings for {repo_name}..."):
181
+ repo_info = _model.forward(extracted_infos, st_progress=st_proress_bar)[0]
182
+ st_proress_bar.empty()
183
 
184
  return repo_info
185
 
 
203
  return search_results
204
 
205
 
206
+ def run_topic_cluster_search(repo_topic_clusters, repo_name_list):
207
+ """
208
+ The function to search topic cluster number for such repositories.
209
+ :param repo_topic_clusters: dictionary with repo-topic_clusters
210
+ :param repo_name_list: list or array represent repository names
211
+ :return: topic cluster number list
212
+ """
213
+ topic_clusters = []
214
+ for repo_name in repo_name_list:
215
+ topic_clusters.append(repo_topic_clusters[repo_name])
216
+
217
+ return topic_clusters
218
+
219
+
220
+ def run_code_cluster_search(repo_code_clusters, repo_name_list):
221
  """
222
+ The function to search code cluster number for such repositories.
223
+ :param repo_code_clusters: dictionary with repo-code_clusters
224
  :param repo_name_list: list or array represent repository names
225
+ :return: code cluster number list
226
  """
227
+ code_clusters = []
228
  for repo_name in repo_name_list:
229
+ code_clusters.append(repo_code_clusters[repo_name])
230
+
231
+ return code_clusters
232
 
233
 
234
+ def run_similaritycal_search(index, repo_clusters, model, query_doc, query_cluster_number, limit):
235
  """
236
  The function to run SimilarityCal model.
237
  :param index: index file
238
+ :param repo_clusters: repo-clusters (topic_cluster or code_cluster) json file
239
  :param model: SimilarityCal model
240
  :param query_doc: query repo doc
241
+ :param query_cluster_number: query repo cluster number (code or topic)
242
  :param limit: limit
 
243
  :return: result dataframe
244
  """
245
  docs = index._docs
246
  input_embeddings_list = []
247
  result_dl = DocList[RepoDoc]()
248
  for doc in docs:
249
+ if query_cluster_number != repo_clusters[doc.name]:
250
  continue
251
  if doc.name != query_doc.name:
252
  e1, e2 = (torch.Tensor(query_doc.repository_embedding),
 
261
  similarity_scores = softmax(model_output)[:, 1].cpu().detach().numpy()
262
  df = result_dl.to_dataframe()
263
  df["scores"] = similarity_scores
264
+
265
+ sorted_df = df.sort_values(by='scores', ascending=False).reset_index(drop=True).head(limit)
266
+ sorted_df["rankings"] = sorted_df["scores"].rank(ascending=False).astype(int)
267
+ sorted_df.drop(columns="scores", inplace=True)
268
+
269
+ return sorted_df
270
 
271
 
272
  if __name__ == "__main__":
273
  # Loading dataset and models
274
  index, default_doc = load_index()
275
+ repo_topic_clusters = load_repo_topic_clusters()
276
+ repo_code_clusters = load_repo_code_clusters()
277
  pipeline_model = load_pipeline_model()
278
  lemmatizer = WordNetLemmatizer()
279
  tokenizer, scibert_model = load_scibert_model()
280
+ topic_kmeans = load_topic_kmeans_model()
281
+ code_kmeans = load_code_kmeans_model()
282
  sim_cal_model = load_similaritycal_model()
283
 
284
  # Setting the sidebar
 
303
 
304
  st.multiselect(
305
  label="Display columns",
306
+ options=["scores", "name", "topics", "code cluster", "topic cluster", "stars", "license"],
307
+ default=["scores", "name", "topics", "code cluster", "topic cluster", "stars", "license"],
308
  help="Select columns to display in the search results",
309
  key="display_columns",
310
  )
 
340
  records = index.filter({"name": {"$eq": repo_name}})
341
  # 1) Building the query information
342
  query_doc = default_doc.copy() if not records else records[0]
343
+ # 2) Recording the topic and code cluster numbers
344
+ topic_cluster_number = -1 if not records else repo_topic_clusters[repo_name]
345
+ code_cluster_number = -1 if not records else repo_code_clusters[repo_name]
346
 
347
+ # Importance 1 ---- situation need to update repository information and cluster numbers
348
  if st.session_state.update_index or not records:
349
  # 1) Updating repository information by using RepoSim4Py pipeline
350
  repo_info = run_pipeline_model(pipeline_model, repo_name, st.session_state.github_token)
 
367
  query_doc.repository_embedding = None if np.all(repo_info["mean_repo_embedding"] == 0) else repo_info[
368
  "mean_repo_embedding"].reshape(-1)
369
 
370
+ # 2) Updating topic cluster number
371
  topics_text = ' '.join(
372
  [lemmatizer.lemmatize(topic.lower().replace('-', ' ')) for topic in query_doc.topics])
373
  topic_embeddings = generate_scibert_embedding(tokenizer, scibert_model, topics_text)
374
+ topic_cluster_number = int(topic_kmeans.predict(topic_embeddings)[0])
375
+
376
+ # 3) Updating code cluster number
377
+ code_embeddings = np.zeros((768,),
378
+ dtype=np.float32) if query_doc.code_embedding is None else query_doc.code_embedding
379
+ code_cluster_number = int(code_kmeans.predict(code_embeddings.reshape(1, -1))[0])
380
 
381
+ # Importance 2 ---- update index file and repository clusters (topic and code) files
382
  if st.session_state.update_index:
383
  if not query_doc.license:
384
  st.warning(
 
392
  )
393
  else:
394
  index.index(query_doc)
395
+ repo_topic_clusters[query_doc.name] = topic_cluster_number
396
+ repo_code_clusters[query_doc.name] = code_cluster_number
397
 
398
+ with st.spinner("Persisting the index and repository clusters (topic and code)..."):
399
  index.persist(str(INDEX_PATH))
400
+ with open(TOPIC_CLUSTER_PATH, "w") as file:
401
+ json.dump(repo_topic_clusters, file, indent=4)
402
+ with open(CODE_CLUSTER_PATH, "w") as file:
403
+ json.dump(repo_code_clusters, file, indent=4)
404
  st.success("Repository updated to the index!")
405
 
406
  load_index.clear()
407
+ load_repo_topic_clusters.clear()
408
+ load_repo_code_clusters.clear()
409
 
410
  st.session_state["query_doc"] = query_doc
411
+ st.session_state["topic_cluster_number"] = topic_cluster_number
412
+ st.session_state["code_cluster_number"] = code_cluster_number
413
 
414
  # 2. Repository cannot be matched
415
  else:
 
418
  # Starting to query
419
  if "query_doc" in st.session_state:
420
  query_doc = st.session_state.query_doc
421
+ topic_cluster_number = st.session_state.topic_cluster_number
422
+ code_cluster_number = st.session_state.code_cluster_number
423
  limit = st.session_state.search_results_limit
424
 
425
  # Showing the query repository information
 
429
  {
430
  "name": query_doc.name,
431
  "topics": query_doc.topics,
432
+ "topic cluster": topic_cluster_number,
433
+ "code cluster": code_cluster_number,
434
  "stars": query_doc.stars,
435
  "license": query_doc.license,
436
  }
 
439
  )
440
 
441
  display_columns = st.session_state.display_columns
442
+ modified_display_columns = ["rankings" if col == "scores" else col for col in display_columns]
443
+ code_sim_tab, doc_sim_tab, readme_sim_tab, requirement_sim_tab, repo_sim_tab, code_cluster_tab, topic_cluster_tab, = st.tabs(
444
  ["Code_sim", "Docstring_sim", "Readme_sim", "Requirement_sim",
445
+ "Repository_sim", "Code_cluster_sim", "Topic_cluster_sim"])
446
 
447
  with code_sim_tab:
448
  if query_doc.code_embedding is not None:
449
  code_sim_res = run_index_search(index, query_doc, "code_embedding", limit)
450
+ topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, code_sim_res["name"])
451
+ code_sim_res["topic cluster"] = topic_cluster_numbers
452
+ code_cluster_numbers = run_code_cluster_search(repo_code_clusters, code_sim_res["name"])
453
+ code_sim_res["code cluster"] = code_cluster_numbers
454
  st.dataframe(code_sim_res[display_columns])
455
  else:
456
  st.error("No function code was extracted for this repository!")
 
458
  with doc_sim_tab:
459
  if query_doc.doc_embedding is not None:
460
  doc_sim_res = run_index_search(index, query_doc, "doc_embedding", limit)
461
+ topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, doc_sim_res["name"])
462
+ doc_sim_res["topic cluster"] = topic_cluster_numbers
463
+ code_cluster_numbers = run_code_cluster_search(repo_code_clusters, doc_sim_res["name"])
464
+ doc_sim_res["code cluster"] = code_cluster_numbers
465
  st.dataframe(doc_sim_res[display_columns])
466
  else:
467
  st.error("No function docstring was extracted for this repository!")
 
469
  with readme_sim_tab:
470
  if query_doc.readme_embedding is not None:
471
  readme_sim_res = run_index_search(index, query_doc, "readme_embedding", limit)
472
+ topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, readme_sim_res["name"])
473
+ readme_sim_res["topic cluster"] = topic_cluster_numbers
474
+ code_cluster_numbers = run_code_cluster_search(repo_code_clusters, readme_sim_res["name"])
475
+ readme_sim_res["code cluster"] = code_cluster_numbers
476
  st.dataframe(readme_sim_res[display_columns])
477
  else:
478
  st.error("No readme file was extracted for this repository!")
 
480
  with requirement_sim_tab:
481
  if query_doc.requirement_embedding is not None:
482
  requirement_sim_res = run_index_search(index, query_doc, "requirement_embedding", limit)
483
+ topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, requirement_sim_res["name"])
484
+ requirement_sim_res["topic cluster"] = topic_cluster_numbers
485
+ code_cluster_numbers = run_code_cluster_search(repo_code_clusters, requirement_sim_res["name"])
486
+ requirement_sim_res["code cluster"] = code_cluster_numbers
487
  st.dataframe(requirement_sim_res[display_columns])
488
  else:
489
  st.error("No requirement file was extracted for this repository!")
 
492
  if query_doc.repository_embedding is not None:
493
  # Repo Sim tab
494
  repo_sim_res = run_index_search(index, query_doc, "repository_embedding", limit)
495
+ topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, repo_sim_res["name"])
496
+ repo_sim_res["topic cluster"] = topic_cluster_numbers
497
+ code_cluster_numbers = run_code_cluster_search(repo_code_clusters, repo_sim_res["name"])
498
+ repo_sim_res["code cluster"] = code_cluster_numbers
499
  st.dataframe(repo_sim_res[display_columns])
500
  else:
501
  st.error("No such useful information was extracted for this repository!")
502
 
503
+ with code_cluster_tab:
504
  if query_doc.repository_embedding is not None:
505
+ cluster_df = run_similaritycal_search(index, repo_code_clusters, sim_cal_model,
506
+ query_doc, code_cluster_number, limit)
507
+ code_cluster_numbers = run_code_cluster_search(repo_code_clusters, cluster_df["name"])
508
+ cluster_df["code cluster"] = code_cluster_numbers
509
+ topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, cluster_df["name"])
510
+ cluster_df["topic cluster"] = topic_cluster_numbers
511
+ st.dataframe(cluster_df[modified_display_columns])
512
  else:
513
  st.error("No such useful information was extracted for this repository!")
514
 
515
+ with topic_cluster_tab:
516
  if query_doc.repository_embedding is not None:
517
+ cluster_df = run_similaritycal_search(index, repo_topic_clusters, sim_cal_model,
518
+ query_doc, topic_cluster_number, limit)
519
+ topic_cluster_numbers = run_topic_cluster_search(repo_topic_clusters, cluster_df["name"])
520
+ cluster_df["topic cluster"] = topic_cluster_numbers
521
+ code_cluster_numbers = run_code_cluster_search(repo_code_clusters, cluster_df["name"])
522
+ cluster_df["code cluster"] = code_cluster_numbers
523
+ st.dataframe(cluster_df[modified_display_columns])
524
  else:
525
+ topic_cluster_tab.error("No such useful information was extracted for this repository!")
assets/Repository-Code Cluster Assignments.png ADDED
assets/Repository-Topic Cluster Assignments.png ADDED
common/__init__.py ADDED
File without changes
{data β†’ common}/pair_classifier.py RENAMED
File without changes
{data β†’ common}/repo_doc.py RENAMED
File without changes
data/{kmeans_model_scibert.pkl β†’ kmeans_model_code_unixcoder.pkl} RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7b561ee3342b0b8646533e6b7ffd451234d76ce3695862fd17fad18787a3b47c
3
  size 967215
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bb534645bce9fb19975873003be27e0b386df7550693caed46ee0f1822b16533
3
  size 967215
data/kmeans_model_topic_scibert.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48272b4172b3dba079348462044f72f19a004ff65d6cd9222ef424468261f1fb
3
+ size 967215
data/repo_code_clusters.json ADDED
The diff for this file is too large to render. See raw diff
 
data/repo_code_clusters_test.json ADDED
The diff for this file is too large to render. See raw diff
 
data/{repo_clusters.json β†’ repo_topic_clusters.json} RENAMED
File without changes
data/{repo_clusters_test.json β†’ repo_topic_clusters_test.json} RENAMED
File without changes
requirements.txt CHANGED
@@ -9,4 +9,5 @@ tqdm
9
  scikit-learn
10
  nltk
11
  plotly
12
- joblib
 
 
9
  scikit-learn
10
  nltk
11
  plotly
12
+ joblib
13
+ matplotlib