TREC DL 19 Metric Mismatch

#1
by krypticmouse - opened

Hi I was trying to benchmark this model on TREC DL19 track, I was mainly rerank a ranking I got from another model. Though the metrics seem low and not even matching the metric of original ranking:-

NDGC@5: 0.6620280234158401
NDGC@10: 0.6530871918417204

I'm using pytrec_eval to evaluate, what could be the issue?

Castorini org

Hi @krypticmouse
just fyi I have seen your comments. I am looking into it now, will get back to you asap.

Thanks a lot! Attaching the inference code I used. Maxlen is 180 and bsize is 1 and model is castorini/rankllama-v1-7b-lora-passage:-

        tokenizer = AutoTokenizer.from_pretrained('meta-llama/Llama-2-7b-hf', token="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")
        tokenizer.add_special_tokens({'pad_token': '[PAD]'})
        model = self.get_model(self.model)
        
        assert len(qids) == len(pids), (len(qids), len(pids))

        scores = []

        model.eval()
        with torch.inference_mode():
            with torch.cuda.amp.autocast():
                for offset in tqdm.tqdm(range(0, len(qids), self.bsize), disable=(not show_progress)):
                    endpos = offset + self.bsize

                    queries_ = [f'query: {self.queries[qid]}</s>' for qid in qids[offset:endpos]]
                    passages_ = [f'document: {self.collection[pid]}</s>' for pid in pids[offset:endpos]]

                    features = tokenizer(queries_, passages_, padding='longest', truncation=True,
                                return_tensors='pt', max_length=self.maxlen).to(self.device)

                    batch_scores = model(**features).logits.view(-1, ).float()

                    scores.append(batch_scores)

        scores = torch.tensor(scores)
        scores = scores.tolist()

Hi, @krypticmouse
could you try to change these two lines to:

queries_ = [f'query: {self.queries[qid]}' for qid in qids[offset:endpos]
passages_ = [f'document: {self.collection[pid]}' for pid in pids[offset:endpos]]

I just noticed in our implementation we eventually didn't use the </s>'s representation to compute relevant scores for the cross-encoder reranker.
as it gives error like https://github.com/microsoft/DeepSpeed/issues/4017, when fine-tuning on V100 machine with fp16. (this issue doesn't happened to bi-encoder)

We instead use the last input sequence token, i.e. the last token of the document to compute the score. I will make updates accordingly.

Oh alright! Will try it out, thanks a lot!

Do you use this exact prompt to benchmark in the paper or was it different?

Castorini org

same prompt.

another potential difference, the msmarco passage corpus we use is the 'with title' version.
https://huggingface.co/datasets/Tevatron/msmarco-passage-corpus.
https://arxiv.org/pdf/2304.12904.pdf

Oh right, Thanks a lot! I went ahead with the thing and it seemed to have made it better. Thanks for the help :)

Anything about how to do batch inference?

Anything about how to do batch inference?

I am working on that, will get back in a day.

Thanks a ton for the help! Really appreciate it :)

Castorini org

Hi @MrLight , @krypticmouse ,
I was able to reproduce the performance, but the running time seems quite low. Running time for dl19 was fair, but when it scales to large collections like NQ or FEVER in BEIR benchmark, the running time estimates to 48+ hrs. Any thoughts on this?

Hi @cramraj8 , yes, a limitation of the LLM embedding is the inference time. Would have to speed up the corpus encoding by using multi gpu to run in parallel.
and bf16, flash attention2 should have some help on speeding up.

ah sorry, are you running reranking or retrieval? how many gpus are you running with?

Got it. I was doing reranking only with 1 GPU. Earlier I was using the given example code in the model card page, which was doing each query wise re-ranking. That was very slow. But the batch inference codebase was pretty fast.

Sign up or log in to comment