Remove input id tokens

#87
by satsat - opened

Hi,

Scenario: I have got 5 files , which has code in it. Now I am trying to evaluate the files and get some recommendations via starcoder model.

Challenge: I am able to iterate thru all files and get recommendations independently. But when running in a single flow in a loop, after the first file is encoded and decoded, for the second file, the input_ids of the previous file remains. How to remove the input_ids tokens of the previous file.

for each file
input_ids: torch.Tensor = self.tokenizer.encode(query, max_length=7000, return_tensors='pt', truncation=True).to(self.device)
print(len(input_ids[0]))

For example:
1st file: Len of input IDs is , 1111
2nd file[2nd iteration]: Len of input IDs is, 3018 [but it should 1907]

Please help with a solution for this. Thanks.

Sign up or log in to comment