UAE-Code-Large-V1 / README.md
SeanLee97's picture
Update README.md
c601ffc verified
metadata
license: mit
datasets:
  - WhereIsAI/github-issue-similarity
language:
  - en
library_name: sentence-transformers
pipeline_tag: feature-extraction

WhereIsAI/UAE-Code-Large-V1

📢 WhereIsAI/UAE-Code-Large-V1 is licensed under MIT. Feel free to use it in any scenario. If you use it for academic papers, we would greatly appreciate it if you could cite us. 👉 citation info.

This model builds upon WhereIsAI/UAE-Large-V1 and is fine-tuned on the GIS: Github Issue Similarity dataset using AnglE loss (https://arxiv.org/abs/2309.12871). It can be used to measure code/issue similarity.

Results (test set):

  • Spearman correlation: 71.19
  • Accuracy: 84.37

Usage

1. angle-emb

You can use it via angle-emb as follows:

install:

python -m pip install -U angle-emb

example:

from scipy import spatial
from angle_emb import AnglE

model = AnglE.from_pretrained('WhereIsAI/UAE-Code-Large-V1').cuda()

quick_sort = '''# Approach 2: Quicksort using list comprehension

def quicksort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        left = [x for x in arr[1:] if x < pivot]
        right = [x for x in arr[1:] if x >= pivot]
        return quicksort(left) + [pivot] + quicksort(right)
 
# Example usage
arr = [1, 7, 4, 1, 10, 9, -2]
sorted_arr = quicksort(arr)
print("Sorted Array in Ascending Order:")
print(sorted_arr)'''


bubble_sort = '''def bubblesort(elements):
    # Looping from size of array from last index[-1] to index [0]
    for n in range(len(elements)-1, 0, -1):
        swapped = False
        for i in range(n):
            if elements[i] > elements[i + 1]:
                swapped = True
                # swapping data if the element is less than next element in the array
                elements[i], elements[i + 1] = elements[i + 1], elements[i]
        if not swapped:
            # exiting the function if we didn't make a single swap
            # meaning that the array is already sorted.
            return

elements = [39, 12, 18, 85, 72, 10, 2, 18]

print("Unsorted list is,")
print(elements)
bubblesort(elements)
print("Sorted Array is, ")
print(elements)'''

vecs = model.encode([
    'def echo(): print("hello world")',
    quick_sort,
    bubble_sort
])


print('cos sim (0, 1):', 1 - spatial.distance.cosine(vecs[0], vecs[1]))
print('cos sim (0, 2)', 1 - spatial.distance.cosine(vecs[0], vecs[2]))
print('cos sim (1, 2):', 1 - spatial.distance.cosine(vecs[1], vecs[2]))

output:

cos sim (0, 1): 0.34329649806022644
cos sim (0, 2) 0.3627094626426697
cos sim (1, 2): 0.6972219347953796

sentence-transformers

You can also use it via sentence-transformers

from scipy import spatial
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('WhereIsAI/UAE-Code-Large-V1').cuda()

quick_sort = '''# Approach 2: Quicksort using list comprehension

def quicksort(arr):
    if len(arr) <= 1:
        return arr
    else:
        pivot = arr[0]
        left = [x for x in arr[1:] if x < pivot]
        right = [x for x in arr[1:] if x >= pivot]
        return quicksort(left) + [pivot] + quicksort(right)
 
# Example usage
arr = [1, 7, 4, 1, 10, 9, -2]
sorted_arr = quicksort(arr)
print("Sorted Array in Ascending Order:")
print(sorted_arr)'''


bubble_sort = '''def bubblesort(elements):
    # Looping from size of array from last index[-1] to index [0]
    for n in range(len(elements)-1, 0, -1):
        swapped = False
        for i in range(n):
            if elements[i] > elements[i + 1]:
                swapped = True
                # swapping data if the element is less than next element in the array
                elements[i], elements[i + 1] = elements[i + 1], elements[i]
        if not swapped:
            # exiting the function if we didn't make a single swap
            # meaning that the array is already sorted.
            return

elements = [39, 12, 18, 85, 72, 10, 2, 18]

print("Unsorted list is,")
print(elements)
bubblesort(elements)
print("Sorted Array is, ")
print(elements)'''

vecs = model.encode([
    'def echo(): print("hello world")',
    quick_sort,
    bubble_sort
])


print('cos sim (0, 1):', 1 - spatial.distance.cosine(vecs[0], vecs[1]))
print('cos sim (0, 2)', 1 - spatial.distance.cosine(vecs[0], vecs[2]))
print('cos sim (1, 2):', 1 - spatial.distance.cosine(vecs[1], vecs[2]))

output:

cos sim (0, 1): 0.34329649806022644
cos sim (0, 2) 0.3627094626426697
cos sim (1, 2): 0.6972219347953796

Citation

@article{li2023angle,
  title={AnglE-optimized Text Embeddings},
  author={Li, Xianming and Li, Jing},
  journal={arXiv preprint arXiv:2309.12871},
  year={2023}
}