From your experience what would be a good methodology for using a 1048k model for filtering pre-training data

#8
by TylerRoost - opened

Idea: use long context windows to select the best document from a set of documents that fit in its context window as a proxy for high quality pretraining data.
Secondary idea: use long context windows to order the documents in a set of documents that fit in its context window as a curriculum for high quality pretraining data

Your thoughts?

Sign up or log in to comment