Transformers
PyTorch
code
custom_code
Inference Endpoints
tomaarsen HF staff commited on
Commit
95afd22
1 Parent(s): 9705163

Update code snippet to use sentence-level embeddings; update title

Browse files

Hello!

## Pull Request overview
* Use sentence level embeddings (pooler output) in code snippet
* Use "CodeSage-Base" in the title rather than "CodeSage-Large"

## Details
Before, the snippet gave only token-level embeddings, which are not very useful in practice generally. I think the pooler output will be more useful.

- Tom Aarsen

Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -7,7 +7,7 @@ language:
7
  - code
8
  ---
9
 
10
- ## CodeSage-Large
11
 
12
  ### Model description
13
  CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
@@ -24,7 +24,7 @@ This checkpoint is first trained on code data via masked language modeling (MLM)
24
  ### How to use
25
  This checkpoint consists of an encoder (356M model), which can be used to extract code embeddings of 1024 dimension. It can be easily loaded using the AutoModel functionality and employs the Starcoder tokenizer (https://arxiv.org/pdf/2305.06161.pdf).
26
 
27
- ```
28
  from transformers import AutoModel, AutoTokenizer
29
 
30
  checkpoint = "codesage/codesage-base"
@@ -33,10 +33,10 @@ device = "cuda" # for GPU usage or "cpu" for CPU usage
33
  tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
34
  model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
35
 
36
- inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
37
- embedding = model(inputs)[0]
38
- print(f'Dimension of the embedding: {embedding[0].size()}')
39
- # Dimension of the embedding: torch.Size([13, 1024])
40
  ```
41
 
42
  ### BibTeX entry and citation info
 
7
  - code
8
  ---
9
 
10
+ ## CodeSage-Base
11
 
12
  ### Model description
13
  CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
 
24
  ### How to use
25
  This checkpoint consists of an encoder (356M model), which can be used to extract code embeddings of 1024 dimension. It can be easily loaded using the AutoModel functionality and employs the Starcoder tokenizer (https://arxiv.org/pdf/2305.06161.pdf).
26
 
27
+ ```python
28
  from transformers import AutoModel, AutoTokenizer
29
 
30
  checkpoint = "codesage/codesage-base"
 
33
  tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True)
34
  model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
35
 
36
+ inputs = tokenizer("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
37
+ embedding = model(**inputs).pooler_output
38
+ print(f'Dimension of the embedding: {embedding.size()}')
39
+ # Dimension of the embedding: torch.Size([1, 1024])
40
  ```
41
 
42
  ### BibTeX entry and citation info