HenryStephen commited on
Commit
eea0bde
β€’
1 Parent(s): dd1a4ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -72
README.md CHANGED
@@ -1,16 +1,5 @@
1
- ---
2
- title: RepoSnipy
3
- emoji: πŸ‰
4
- colorFrom: green
5
- colorTo: yellow
6
- sdk: streamlit
7
- sdk_version: 1.31.1
8
- app_file: app.py
9
- pinned: true
10
- license: mit
11
- ---
12
-
13
  # RepoSnipy πŸ‰
 
14
  Neural search engine for discovering semantically similar Python repositories on GitHub.
15
 
16
  ## Demo
@@ -29,71 +18,16 @@ Compared to the previous generation of [RepoSnipy](https://github.com/RepoAnalys
29
  * Multi-level embeddings --- code, docstring, readme, requirement, and repository.
30
  * It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
31
  * Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
32
- * **SimilarityCal --- TODO update!!!**
 
 
33
 
34
  We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).
35
 
36
- ## Installation
37
-
38
- ### Prerequisites
39
- * Python 3.11
40
- * pip
41
-
42
- ### Installation with code
43
- We recommend to install first a [conda](https://conda.io/projects/conda/en/latest/index.html) environment with `python 3.11`. Then, you can download the repository. See below:
44
- ```bash
45
- conda create --name py311 python=3.11
46
- conda activate py311
47
- git clone https://github.com/RepoMining/RepoSnipy
48
- ```
49
- After downloading the repository, you need install the required package. **Make sure the python and pip you used are both from conda environment!**
50
- For the following:
51
- ```bash
52
- cd RepoSnipy
53
- pip install -r requirements.txt
54
- ```
55
-
56
- ### Usage
57
- Then run the app on your local machine using:
58
- ```bash
59
- streamlit run app.py
60
- ```
61
- or
62
- ```bash
63
- python -m streamlit run app.py
64
- ```
65
- Importantly, to avoid unnecessary conflict (like version conflict, or package location conflict), you should ensure that **streamlit you used is from conda environment**!
66
-
67
- ### Dataset
68
- As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset and [KMeans](data/kmeans_model_scibert.pkl) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
69
-
70
- To provide research-oriented meaning, we have provided the following scripts for you to recreate them:
71
- ```bash
72
- cd data
73
- python create_index.py # For creating vector dataset (binary files)
74
- python generate_cluster.py # For creating useful cluster model and information (KMeans model and json files representing repo-clusters)
75
- ```
76
-
77
- More details can refer to these two scripts above. When you run scripts above, you will get the following files:
78
- 1. Generated by [create_index.py](data/create_index.py):
79
- ```bash
80
- repositories.txt # the original repositories file
81
- invalid_repositories.txt # the invalid repositories file, including invalid repositories
82
- filtered_repositories.txt # the final repositories file, removing duplicated and invalid repositories
83
- index{i}_{i * target_sub_length}.bin # the sub-index files, where i means number of sub-repositories and target_sub_length means sub-repositories length
84
- index.bin # the index file merged by sub-index files and removed numpy zero arrays
85
- ```
86
- 2. Generated by [generate_cluster.py](data/generate_cluster.py):
87
- ```
88
- repo_clusters.json # a json file representing repo-cluster dictionary
89
- kmeans_model_scibert.pkl # a pickle file for storing kmeans model based on topic embeddings generated by scibert model
90
- ```
91
-
92
 
93
- ## Evaluation
94
- **TODO ---- update!!!**
95
 
96
- The [evaluation script](evaluate.py) finds all combinations of repository pairs in the dataset and calculates the cosine similarity between their embeddings. It also checks if they share at least one topic (except for `python` and `python3`). Then we compare them and use ROC AUC score to evaluate the embeddings performance. The resultant dataframe containing all pairs of cosine similarity and topics similarity can be downloaded from [here](https://huggingface.co/datasets/Lazyhope/RepoSnipy_eval/tree/main), including both code embeddings and docstring embeddings evaluations. The resultant ROC AUC score of code embeddings is around 0.84, and the docstring embeddings is around 0.81.
97
 
98
  ## License
99
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # RepoSnipy πŸ‰
2
+ [![Open in Spaces](https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-md-dark.svg)](https://huggingface.co/spaces/Henry65/RepoSnipy)
3
  Neural search engine for discovering semantically similar Python repositories on GitHub.
4
 
5
  ## Demo
 
18
  * Multi-level embeddings --- code, docstring, readme, requirement, and repository.
19
  * It uses the [SciBERT](https://arxiv.org/abs/1903.10676) model to analyse repository topics and to generate embeddings for topics.
20
  * Transfer multiple topics into one cluster --- it uses a [KMeans](data/kmeans_model_scibert.pkl) model to analyse topic embeddings and to cluster repositories based on topics.
21
+ * It uses the [SimilarityCal](data/SimilarityCal_model_NO1.pt) model, which is a binary classifier to calculate cluster similarity based on multi-level embeddings and cluster.
22
+ More generally, SimilarityCal model seem repositories with same cluster as label 1, otherwise as label 0. The input features of SimilarityCal model are two repositories' embeddings concatenation, and the binary labels are mentioned above.
23
+ The output of SimilarityCal model are scores of how similar or dissimilar two repositories are.
24
 
25
  We have created a [vector dataset](data/index.bin) (stored as docarray index) of approximate 9700 GitHub Python repositories that has license and over 300 stars by the time of February 2024. The accordingly generated clusters were putted in a [json dataset](data/repo_clusters.json) (stored repo-cluster as key-values).
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
 
28
+ ## Dataset
29
+ As mentioned above, RepoSnipy needs [vector](data/index.bin), [json](data/repo_clusters.json) dataset, [KMeans](data/kmeans_model_scibert.pkl) model and [SimilarityCal](data/SimilarityCal_model_NO1.pt) model when you start up it. For your convenience, we have uploaded them in the folder [data](data) of this repository.
30
 
 
31
 
32
  ## License
33