From your experience what would be a good methodology for using a 1048k model for filtering pre-training data
#8
by
TylerRoost
- opened
Idea: use long context windows to select the best document from a set of documents that fit in its context window as a proxy for high quality pretraining data.
Secondary idea: use long context windows to order the documents in a set of documents that fit in its context window as a curriculum for high quality pretraining data
Your thoughts?