Combating Evaluation Data Contamination in LLMs: Strategies for High-Quality Finetuning and Model Merging

Community Article Published December 20, 2023

Large Language Models (LLMs) play a pivotal role in Natural Language Processing (NLP) tasks; however, achieving success necessitates stringent adherence to quality control measures and astute decision-making regarding model fusion strategies. One major challenge lies in avoiding evaluation data contamination, which introduces bias and compromises the credibility of performance metrics. Fortunately, tools such as https://github.com/cg123/mergekit and https://github.com/swj0419/detect-pretrain-code-contamination facilitate navigating these obstacles by streamlining the process of model merging and identification of tainted training data. Moreover, examining exceptional LLMs like CatPPT sheds light on optimal configurations and fine-tuning strategies. This comprehensive guide explores key aspects of publishing top-performing LLMs through rigorous validation processes and innovative techniques.

Section I: The Importance of Clean Training Data

Creating robust LLMs starts with collecting and vetting high-quality training datasets. When preparing your model for supervised finetuning or Directed Preference Optimization (DPO) training, it's crucial to ensure that your evaluation dataset doesn't overlap with your training data. Any shared instances will skew performance metrics artificially upward, rendering them invalid indicators of true proficiency. Implement the following guidelines to minimize contamination risk:

  • Curate your training datasets conscientiously, eliminating any content derived from the same sources as your evaluation datasets, e.g., ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, or GSM8K.
  • Periodically review your training datasets to catch accidental inclusions of evaluation data. Utilize resources like detect-pretrain-code-contamination to pinpoint unwanted overlaps efficiently.
  • Outsource data validation efforts to external parties or crowdsource checks to ensure impartial scrutiny and reduce confirmation bias.

Section II: Advanced Methods for Model Merging

The latest advances in LLM development entail fusing multiple pretrained models to amplify overall performance. Although conventional weight averaging remains prevalent, novel methods like Spherical Linear Interpolation (SLERP) offer tangible benefits. These include seamless transition management, enhanced feature preservation, and refined combinations based on geometrical and rotational properties. Delve deeper into why SLERP surpasses basic weight averaging below:

  • Smooth Transitions - SLERP promotes smoother adjustments among model weights, even in high-dimensional contexts, diminishing jarring alterations potentially harmful to downstream performance.
  • Improved Feature Retention - Distinctive traits aren't lost due to oversimplified averaging thanks to SLERP's capacity to preserve the curvature and uniqueness of parent models in complex vector spaces.
  • Nuanced Combinations - SLERP accounts for the innate shapes and orientations of each contributing model within the vector landscape, yielding balanced amalgams mirroring the desired attributes inherited from both progenitors.

Section III: Verification Priorities Before Combination

To prevent introducing contamination through ill-informed model selections, investigate each candidate thoroughly prior to commencing the fusion process. Exemplary hybrids, such as CatPPT, demonstrate the prowess attainable through judicious parent model choices and painstaking verification protocols. Follow suit by heeding the advice below:

  • Confirm the provenance of each prospective model to circumvent incorporating corrupted entities.
  • Validate compatibility across chosen candidates to foster harmonious blends capable of capitalizing on complementary strengths.
  • Establish clear criteria outlining ideal attributes sought in ensuing LLMs, guiding informed decisions concerning suitable parent models.

Section IV: Showcasing Success - Rise of CatPPT

One remarkable specimen embodying masterful LLM assembly is rishiraj/CatPPT, hosted on Hugging Face. Conceived from the union of Intel/neural-chat-7b-v3-3 and openchat/openchat-3.5-1210, followed by targeted conversation-focused fine-tuning à la HuggingFaceH4/no_robots, CatPPT reigns supreme as the premier 7B-parameter model free of discernible evaluation data contamination. Its achievement underscores the merits of thoughtful parent model selection, stringent validation procedures, and purposeful fine-tuning tactics. Parts of this blog are also generated from this model. Emulate this triumph by embracing analogous strategies throughout your own LLM endeavors.

Model Average ARC HellaSwag MMLU TruthfulQA Winogrande GSM8K
rishiraj/CatPPT 72.32 68.09 86.69 65.16 61.55 81.61 70.81
Intel/neural-chat-7b-v3-3 69.83 66.89 85.26 63.07 63.01 79.64 61.11
openchat/openchat-3.5-1210 68.89 64.93 84.92 64.62 52.15 80.74 65.96
meta-math/MetaMath-Mistral-7B 65.78 60.67 82.58 61.95 44.89 75.77 68.84
Deci/DeciLM-7B-instruct 63.19 61.01 82.37 60.24 49.75 79.72 46.02
mistralai/Mistral-7B-Instruct-v0.2 65.71 63.14 84.88 60.78 68.26 77.19 40.03
mistralai/Mixtral-8x7B-Instruct-v0.1 72.62 70.22 87.63 71.16 64.58 81.37 60.73
meta-llama/Llama-2-70b-hf 67.87 67.32 87.33 69.83 44.92 83.74 54.06
tiiuae/falcon-180B 67.85 69.45 88.86 70.5 45.47 86.9 45.94

Conclusion:

Producing peerless LLMs involves commitment to quality control, innovation in model merging, and strategic fine-tuning. Steer clear of pernicious evaluation data contamination by exercising vigilance during each phase of design and execution – from primary data gathering to ultimate model integration. Leverage cutting-edge tools like https://github.com/cg123/mergekit and https://github.com/swj0419/detect-pretrain-code-contamination to bolster confidence in your workflow. Lastly, draw inspiration from standout LLMs such as CatPPT to formulate winning combinations tailor-made for your NLP objectives. Armed with knowledge and determination, embark on your journey toward constructing world-class LLMs today and follow me on Hugging Face if this blog was useful!