Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1020

Can You consider testing for pretraining on the test data with min k prob%

#369

by VatsaDev - opened Nov 10, 2023

Discussion

VatsaDev

Nov 10, 2023

It would be nice if you guys used https://github.com/swj0419/detect-pretrain-code?scrlybrkr=4c9c022b to check models for pre-training data, as it can check if a models trained on something, and would make it much easier to check models for to flag for test data

SaylorTwift

Open LLM Leaderboard org Nov 12, 2023

Thanks for the resources ! We will look into it. Data contamination is indeed an issue when ranking models on a known set of benchmarks.

Though, I'm wondering if this can be used with every dataset ? If yes, would you be interested in running this tool on known models in the leaderboard ? For example falcon, llama or mistral and the GSM8K dataset ?

VatsaDev

Nov 12, 2023

While I'm not the authors of the paper, from the main page it appears they used this with a variety of book content, and there are no limitations on the page, so I would say you could run this with the benchmark datasets.

For running models unquantized, I could do anything that fits in 14gb of VRAM, so probably less than 7b. Perhaps you could run this for the older benchmarks first, those more likely to be trained on

lapp0

Nov 29, 2023

•

edited Nov 29, 2023

@VatsaDev I recommend testing Min K % w/ MMLU on Trurl-7B, Llama 2 7B, and a trusted finetune such as Orca 2.

Trurl-7B was trained on the MMLU test set per https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/202

Min K % seems pretty easy to integrate into Eleuther by changing the HF loglikelihood function to return all logits then calculating Min K % in the task

In lm_eval/models/huggingface.py

-                target_logits = torch.gather(
-                    log_softmax, 1, target_tokens.unsqueeze(-1)
-                ).squeeze(-1)
-                answer = (float(target_logits.sum()), bool(max_equal))
-                results.append(answer)
+                if top_k is not None:
+                    # Extract top k logits and their indices
+                    top_values, top_indices = torch.topk(log_softmax, top_k, dim=1)
+                    top_k_results = []
+                    for values, indices in zip(top_values, top_indices):
+                        # Convert indices to strings (assuming a method 'index_to_string')
+                        token_strings = [self.index_to_string(idx) for idx in indices]
+                        top_k_results.extend(zip(values.tolist(), token_strings))
+
+                    results.append(top_k_results)
+                else:
+                    target_logits = torch.gather(
+                        log_softmax, 1, target_tokens.unsqueeze(-1)
+                    ).squeeze(-1)
+                    answer = (float(target_logits.sum()), bool(max_equal))
+                    results.append(answer)
+

clefourrier

Open LLM Leaderboard org Jan 5

Hi!
Most of the discussions about contamination techniques have been moved to this conversation, closing this one to make the issues more readable. Feel free to port your comments over there!

clefourrier changed discussion status to closed Jan 5

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment