Lazyhope commited on
Commit
df6d013
1 Parent(s): 16c089c

Update the model card and make the pipeline public

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md CHANGED
@@ -1,3 +1,85 @@
1
  ---
2
  license: mit
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ tags:
4
+ - code-understanding
5
+ - unixcoder
6
  ---
7
+
8
+ # RepoSim
9
+
10
+ An approach to compare semantic similarities between Python repositories.
11
+
12
+ ## Model Details
13
+
14
+ **RepoSim** is a pipeline used to create embeddings for specified Python repositories on GitHub. For each repository, it extracts and encodes all functions' source code and docstrings into embeddings, then average them to get the mean code embeddings and the mean docstring embeddings, which can be used to perform various tasks such as cosine similarity comparison.
15
+
16
+ ### Model Description
17
+
18
+ The model used by **RepoSim** is **UniXcoder** fine-tuned on [code search task](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search), using the [AdvTest](https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv) dataset.
19
+
20
+ - **Pipeline developed by:** [Lazyhope](https://huggingface.co/Lazyhope)
21
+ - **Repository:** [RepoSim](https://github.com/RepoAnalysis/RepoSim)
22
+ - **Model type:** **code understanding**
23
+ - **Language(s):** **Python**
24
+ - **License:** **MIT**
25
+
26
+ ### Model Sources
27
+
28
+ - **Repository:** [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder)
29
+ - **Paper:** [UniXcoder: Unified Cross-Modal Pre-training for Code Representation](https://arxiv.org/pdf/2203.03850.pdf)
30
+
31
+ ## Uses
32
+
33
+ Below is an example of how to use the RepoSim pipeline to easily generate embeddings for GitHub Python repositories.
34
+
35
+ First, initialise the pipeline:
36
+ ```python
37
+ from transformers import pipeline
38
+
39
+ model = pipeline(model="Lazyhope/RepoSim", trust_remote_code=True)
40
+ ```
41
+ Then specify one (or multiple repositories in a tuple) as input and get the result as a list of dictionaries:
42
+ ```python
43
+ repo_infos = model("lazyhope/python-hello-world")
44
+ print(repo_infos)
45
+ ```
46
+ Output (Long tensor outputs are omitted):
47
+ ```python
48
+ [{'name': 'lazyhope/python-hello-world',
49
+ 'topics': [],
50
+ 'license': 'MIT',
51
+ 'stars': 0,
52
+ 'code_embeddings': [["def main():\n print('Hello World!')",
53
+ [-2.0755109786987305,
54
+ 2.813878297805786,
55
+ 2.352170467376709, ...]]],
56
+ 'mean_code_embedding': [-2.0755109786987305,
57
+ 2.813878297805786,
58
+ 2.352170467376709, ...],
59
+ 'doc_embeddings': [['Prints hello world',
60
+ [-2.3749449253082275,
61
+ 0.5409570336341858,
62
+ 2.2958014011383057, ...]]],
63
+ 'mean_doc_embedding': [-2.3749449253082275,
64
+ 0.5409570336341858,
65
+ 2.2958014011383057, ...]}]
66
+ ```
67
+
68
+ ## Training Details
69
+
70
+ Please follow the original [UniXcoder](https://github.com/microsoft/CodeBERT/tree/master/UniXcoder/downstream-tasks/code-search) page for details of fine-tuning it on code search task.
71
+
72
+ ## Evaluation
73
+
74
+ We used the [awesome-python](https://github.com/vinta/awesome-python) list which contains over 500 Python repositories categorized in different topics, in order to label similar repositories.
75
+ The evaluation metrics and results can be found in the RepoSim repository, under the [notebooks](https://github.com/RepoAnalysis/RepoSim/tree/main/notebooks) folder.
76
+
77
+ ## Acknowledgements
78
+ Many thanks to authors of the UniXcoder model and the AdvTest dataset, as well as the awesome python list for providing a useful baseline.
79
+ - **UniXcoder** (https://github.com/microsoft/CodeBERT/tree/master/UniXcoder)
80
+ - **AdvTest** (https://github.com/microsoft/CodeXGLUE/tree/main/Text-Code/NL-code-search-Adv)
81
+ - **awesome-python** (https://github.com/vinta/awesome-python)
82
+
83
+ ## Authors
84
+ - **Zihao Li** (https://github.com/lazyhope)
85
+ - **Rosa Filgueira** (https://www.rosafilgueira.com)