hustcw/clap-asm · Problems with last_hidden

May 13, 2024

•

edited May 13, 2024

Hello

I am currently using the CLAP model from the Transformers library to compare embeddings between assembly code descriptions and text prompts. However, I'm encountering an issue where I can't access the last_hidden_state attribute from the model's output.

Is anyone else facing the same problem?

Thanks a lot.

hustcw

Owner May 13, 2024

Hi

I think this is because the model output defined in clap_modeling.py is a tensor. If you want to use CLAP model to compare embeddings between assembly code descriptions and text prompts, you can refer to the provided sample code in model card without accessing last_hidden_state.

import torch.multiprocessing
import torch
import json
from transformers import AutoModel, AutoTokenizer

device = torch.device("cuda")

asm_tokenizer       = AutoTokenizer.from_pretrained("hustcw/clap-asm", trust_remote_code=True)
text_tokenizer      = AutoTokenizer.from_pretrained("hustcw/clap-text", trust_remote_code=True)
asm_encoder         = AutoModel.from_pretrained("hustcw/clap-asm", trust_remote_code=True).to(device)
text_encoder        = AutoModel.from_pretrained("hustcw/clap-text", trust_remote_code=True).to(device)

bubble_output       = "./CaseStudy/bubblesort.json"

# load bubblesort.json
with open(bubble_output) as fp:
    asm = json.load(fp)

prompts = [
    "This is a function related to bubble sort ",
    "This is a function related to selection sort",
    "This is a function related to insertion sort",
    "This is a function related to merge sort",
    "This is a function related to quick sort",
    "This is a function related to radix sort",
    "This is a function related to shell sort",
    "This is a function related to counting sort",
    "This is a function related to bucket sort",
    "This is a function related to heap sort",
]

with torch.no_grad():
    asm_input = asm_tokenizer([asm], padding=True, pad_to_multiple_of=8, return_tensors="pt", verbose=False)
    asm_input = asm_input.to(device)
    asm_embedding = asm_encoder(**asm_input)

with torch.no_grad():
    text_input = text_tokenizer(prompts, padding=True, truncation=True, return_tensors='pt')
    text_input = text_input.to(device)
    text_embeddings = text_encoder(**text_input)

logits = torch.einsum("nc,ck->nk", [asm_embedding, text_embeddings.T])
_, preds = torch.max(logits, dim=1)
preds = torch.softmax(logits / 0.07, dim=1).squeeze(0).tolist()

print("bubblesort zeroshot:")
for i in range(len(prompts)):
    print(f"Probability: {round(preds[i]*100, 3)}%, Text: {prompts[i]}")

joaogomes24

May 14, 2024

I am grateful for your quick answer, which proved to be of major help. I had assumed that the last_hidden_state would be the most effective means of obtaining the most accurate results.

Furthermore, I am currently engaged in the development of AI tools for the correction of assembly code.
I would be grateful if you could spare some time to discuss this with me some ideas.

hustcw

Owner May 16, 2024

Sure, we can use this channel for discussion or email me :)

hustcw
/

clap-asm

Problems with last_hidden_state