Question about DFlash design choices: 8 layers and block size 8

#3
by K-Compression - opened

I noticed that most DFlash models use 5 layers and a block size of 16, while the GPT-OSS DFlash model uses 8 layers and a block size of 8. Was there a model-specific reason for this choice? In particular, did the relatively smaller hidden size or intermediate size of GPT-OSS make larger block sizes less effective, leading to the use of more DFlash layers and smaller blocks instead? I'd also be curious to know whether you evaluated alternative configurations and what trade-offs (e.g., quality, acceptance rate, or efficiency) influenced the final design

Sign up or log in to comment