EleutherAI/pythia-70m-deduped · Model init does not correspond to paper's (and pre-trained weights') initialization scheme

The Eleuther trained models seem to rely on initialization schemes defined here. Meanwhile, the HF version does initialization here -- these are very different. In practice, the Eleuther initialization scheme seems far better on my data. It would be great if the HF version could be updated to correspond to the model's intended initialization scheme.

The significance of this difference can be seen by comparing model performance between:
model = GPTNeoXForCausalLM(config).to("cuda")
and
model = GPTNeoXForCausalLM.from_pretrained(
'EleutherAI/pythia-70m',
revision="step0",
).to("cuda")