GPT2A-Pile-Test-285M

Use

requires: transformers, einops

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "crumb/gpt2a-pile-test-285m"
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  trust_remote_code = True,
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

input_ids = tokenizer("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.", return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=128, temperature=0.7, do_sample=True, top_p=0.95, repetition_penalty=1.1)
print(tokenizer.batch_decode(outputs)[0])
"""
<s> In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. It turns out that only a few people believed their father’s name had been made public by their mother and her father.

As the study finds, scientists discovered the secretive organic matter in the Andes mountains: “In a forest surrounded by a stream of lakes, an unseen rock formed with a series of tuberous orbits.”

They found that the mysterious bodies were buried in some parts of the mountain, known as the Andes mountain. The researchers then searched for the body, which they identified with the Butterfly.

The discovery of the body is the result of
"""

(it's... a little undertrained! thats okay!)

Parameter count

param calculation	params
model	809,579,521
model - model.transformer.wte	539,045,889
model - model.transformer.wte[0] (llama2-70b embeddings without projection)	547,435,521
model - model.transformer.wte - model.lm_head	268,505,089
model - model.transformer.wte[0] - model.lm_head[1] (minus all params taken from llama2-70b)	285,291,521

Details

This model is utilizing a custom architecture built for fast and efficient pretraining. Despite being grossly undertrained (~1 billion tokens from the Pile) this model achieves an estimated Pile test loss of 2.484 and an estimated Pile BPB of 1.313 (compare to GPT-2-1.5B at 1.2253). This only took 12 hours on a 2xA6000 machine to train, and further longer runs are to be expected. To aid in efficiency, the token embedding and language modelling head from llama-2-70b are taken and adapted with linear projections into the embedding-space of the model (from 8,192 to 1,024 dimensions). I expect leveraging pretrained embeddings in this way to become more prevalent as more people learn about the modules in transformers, but it has long been a practice with some circles of hobbyists. The embedding and language modelling head are also commonly the two trainable modules with the most parameters (e.g. the embedding weight of llama-2-70b is a 32,000x8192 tensor), so freezing them aids further in speed and memory gains.

This model is a proof-of-concept to show that the methods behind it can work, it is not meant for production environments but for research. It may also create harmful, offensive, and untrue content, reflecting the biases of the training data.

crumb
/

gpt2a-pile-test-285m

GPT2A-Pile-Test-285M

Use

Parameter count

Details

Dataset used to train crumb/gpt2a-pile-test-285m