Why does throughput increase with longer context window?
Hello there,
Thanks for releasing the Jamba model! As I was reading the paper, I've a question regarding picture 3(b). Why does the throughput of all models increase with longer context window? Can't make sense of it and appreciate it if you can explain. Thank you.
Hi @jingyu-q ,
As explained in the paper, the throughput in the graph is measured as token/second, so it's on a per token basis. Generation time increases with longer context, but normalized to a per token basis, it's faster, and much faster for Jamba as context window get big.
Hi @roicohennn ,
If it is on per token basis, and as the setting says the output fixed at 512 tokens, and the generation time increases with longer context, shouldn't throughput (=512tokens/time_taken) be smaller with longer context?
the only thing i can imagine looking at the graph is that it's output tokens per second times number of input tokens divided by some constant. i suggest the graph be labeled more clearly as to what throughput (t/s) actually means