Can You consider testing for pretraining on the test data with min k prob%

#369
by VatsaDev - opened

It would be nice if you guys used https://github.com/swj0419/detect-pretrain-code?scrlybrkr=4c9c022b to check models for pre-training data, as it can check if a models trained on something, and would make it much easier to check models for to flag for test data

Open LLM Leaderboard org

Thanks for the resources ! We will look into it. Data contamination is indeed an issue when ranking models on a known set of benchmarks.

Though, I'm wondering if this can be used with every dataset ? If yes, would you be interested in running this tool on known models in the leaderboard ? For example falcon, llama or mistral and the GSM8K dataset ?

While I'm not the authors of the paper, from the main page it appears they used this with a variety of book content, and there are no limitations on the page, so I would say you could run this with the benchmark datasets.

For running models unquantized, I could do anything that fits in 14gb of VRAM, so probably less than 7b. Perhaps you could run this for the older benchmarks first, those more likely to be trained on

@VatsaDev I recommend testing Min K % w/ MMLU on Trurl-7B, Llama 2 7B, and a trusted finetune such as Orca 2.

Trurl-7B was trained on the MMLU test set per https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/202

Min K % seems pretty easy to integrate into Eleuther by changing the HF loglikelihood function to return all logits then calculating Min K % in the task

In lm_eval/models/huggingface.py

-                target_logits = torch.gather(
-                    log_softmax, 1, target_tokens.unsqueeze(-1)
-                ).squeeze(-1)
-                answer = (float(target_logits.sum()), bool(max_equal))
-                results.append(answer)
+                if top_k is not None:
+                    # Extract top k logits and their indices
+                    top_values, top_indices = torch.topk(log_softmax, top_k, dim=1)
+                    top_k_results = []
+                    for values, indices in zip(top_values, top_indices):
+                        # Convert indices to strings (assuming a method 'index_to_string')
+                        token_strings = [self.index_to_string(idx) for idx in indices]
+                        top_k_results.extend(zip(values.tolist(), token_strings))
+
+                    results.append(top_k_results)
+                else:
+                    target_logits = torch.gather(
+                        log_softmax, 1, target_tokens.unsqueeze(-1)
+                    ).squeeze(-1)
+                    answer = (float(target_logits.sum()), bool(max_equal))
+                    results.append(answer)
+
Open LLM Leaderboard org

Hi!
Most of the discussions about contamination techniques have been moved to this conversation, closing this one to make the issues more readable. Feel free to port your comments over there!

clefourrier changed discussion status to closed

Sign up or log in to comment