RepoSnipy / README.md
Honglin Zhang
topic cluster and code cluster
c6a1f8c
metadata
title: RepoSnipy
emoji: πŸ‰
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: 1.31.1
app_file: app.py
pinned: true
license: mit

RepoSnipy πŸ‰

Open in Spaces Neural search engine for discovering semantically similar Python repositories on GitHub.

Demo

TODO --- Update the gif file!!!

Searching an indexed repository:

Search Indexed Repo Demo

About

RepoSnipy is a neural search engine built with streamlit and docarray. You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it.

Compared to the previous generation of RepoSnipy, the latest version has such new features below:

  • It uses the RepoSim4Py, which is based on RepoSim4Py pipeline, to create multi-level embeddings for Python repositories.
  • Multi-level embeddings --- code, doc, readme, requirement, and repository.
  • It uses the SciBERT model to analyse repository topics and to generate embeddings for topics.
  • Transfer multiple topics into one cluster --- it uses a KMeans model (kmeans_model_topic_scibert) to analyse topic embeddings and to cluster repositories based on topics.
  • Clustering by code snippets --- it uses a KMeans model (kmeans_model_code_unixcoder) to analyse code embeddings and to cluster repositories based on code snippets.
  • It uses the SimilarityCal model, which is a binary classifier to calculate cluster similarity based on repository-level embeddings and cluster (topic or code cluster number). More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above. The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.

We have created a vector dataset (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of March 2024. The accordingly generated clusters were putted in two json datasets (repo_topic_clusters and repo_code_clusters) (stored repo-cluster as key-values accordingly).

Dataset

As mentioned above, RepoSnipy needs vector, clusters json dataset (repo_topic_clusters and repo_code_clusters), KMeans models (kmeans_model_topic_scibert and kmeans_model_code_unixcoder) and SimilarityCal model when you start up it. For your convenience, we have uploaded them in the folder data of this repository.

License

Distributed under the MIT License. See LICENSE for more information.

Acknowledgments

The model and the fine-tuning dataset used: