Why does falcon-7b have 71 attention heads?

#100
by alpindale - opened

This makes tensor parallelism impossible, as it needs a symmetrical number of attention heads. This design choice doesn't make any sense, 71 is a prime number.

Would be interested to know as well.

Sign up or log in to comment