VAGOsolutions
/

SauerkrautLM-Mixtral-8x7B

Text Generation

Mixture of Experts

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

DavidGF commited on Dec 20, 2023

Commit

82dc0ab

•

1 Parent(s): b9c2d93

Update README.md

Files changed (1) hide show

README.md +18 -0

README.md CHANGED Viewed

@@ -61,7 +61,25 @@ as chosen answers and [Sauerkraut-7b-HerO](https://huggingface.co/VAGOsolutions/
 We found, that only a simple translation of training data can lead to unnatural German phrasings.
 Data augmentation techniques were used to grant grammatical, syntactical correctness and a more natural German wording in our training data.
 ### Prompt Template:
 ```

 We found, that only a simple translation of training data can lead to unnatural German phrasings.
 Data augmentation techniques were used to grant grammatical, syntactical correctness and a more natural German wording in our training data.
+### Data Contamination Test Results
+Some models on the HuggingFace leaderboard had problems with wrong data getting mixed in.
+We checked our SauerkrautLM-DPO dataset with a special test [1] on a smaller model for this problem.
+The HuggingFace team used the same methods [2, 3].
+Our results, with `result < 0.1, %:` being well below 0.9, indicate that our dataset is free from contamination.
+*The data contamination test results of HellaSwag and Winograde will be added once [1] supports them.*
+| Dataset                        | ARC   | MMLU | TruthfulQA | GSM8K |
+|------------------------------|-------|-------|-------|-------|
+| **SauerkrautLM-DPO**| result < 0.1, %: 0.0 |result < 0.1, %: 0.09 | result < 0.1, %: 0.13 | result < 0.1, %: 0.16 |
+[1] https://github.com/swj0419/detect-pretrain-code-contamination
+[2] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/474#657f2245365456e362412a06
+[3] https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/265#657b6debf81f6b44b8966230
 ### Prompt Template:
 ```