--- title: RepoSnipy emoji: 🐉 colorFrom: blue colorTo: green sdk: streamlit sdk_version: 1.31.1 app_file: app.py pinned: true license: mit --- # RepoSnipy 🐉 [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/Henry65/RepoSnipy) Neural search engine for discovering semantically similar Python repositories on GitHub. ## Demo **TODO --- Update the gif file!!!** Searching an indexed repository: ![Search Indexed Repo Demo](assets/search.gif) ## About RepoSnipy is a neural search engine built with [streamlit](https://github.com/streamlit/streamlit) and [docarray](https://github.com/docarray/docarray). You can query a public Python repository hosted on GitHub and find popular repositories that are semantically similar to it. Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalysis/RepoSnipy), the latest version has such new features below: * It uses the [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py), which is based on [RepoSim4Py pipeline](https://huggingface.co/Henry65/RepoSim4Py), to create multi-level embeddings for Python repositories. * Multi-level embeddings --- code, doc, readme, requirement, and repository. * It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics. * Transfer multiple topics into one cluster --- it uses a KMeans model ([kmeans_model_topic_scibert](data/kmeans_model_topic_scibert.pkl)) to analyse topic embeddings and to cluster repositories based on topics. * Clustering by code snippets --- it uses a KMeans model ([kmeans_model_code_unixcoder](data/kmeans_model_code_unixcoder.pkl)) to analyse code embeddings and to cluster repositories based on code snippets. * It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on repository-level embeddings and cluster (topic or code cluster number). More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above. The output of SimilarityCal model are scores of how similar or dissimilar two repositories are. We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of March 2024. The accordingly generated clusters were putted in two json datasets ([repo_topic_clusters](data/repo_topic_clusters.json) and [repo_code_clusters](data/repo_code_clusters.json)) (stored repo-cluster as key-values accordingly). ## Dataset As mentioned above, RepoSnipy needs [vector](data/index.bin), clusters json dataset ([repo_topic_clusters](data/repo_topic_clusters.json) and [repo_code_clusters](data/repo_code_clusters.json)), KMeans models ([kmeans_model_topic_scibert](data/kmeans_model_topic_scibert.pkl) and [kmeans_model_code_unixcoder](data/kmeans_model_code_unixcoder.pkl)) and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository. ## License Distributed under the MIT License. See [LICENSE](LICENSE) for more information. ## Acknowledgments The model and the fine-tuning dataset used: * [UniXCoder](https://arxiv.org/abs/2203.03850) * [AdvTest](https://arxiv.org/abs/1909.09436) * [SciBERT](https://arxiv.org/abs/1903.10676) * [RepoSnipy (old version)](https://github.com/RepoAnalysis/RepoSnipy) * [RepoSnipy HuggingFace Spaces (old version)](https://huggingface.co/spaces/Lazyhope/RepoSnipy) * [RepoSim4Py](https://github.com/RepoMining/RepoSim4Py) * [SimilarityCal](https://github.com/RepoMining/SimilarityCal) * [RepoSnipy](https://github.com/RepoMining/RepoSnipy)