GPT2A-Pile-Test-285M

Use

requires: transformers, einops

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "crumb/gpt2a-pile-test-285m"
model = AutoModelForCausalLM.from_pretrained(
  model_id,
  trust_remote_code = True,
  device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

input_ids = tokenizer("In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.", return_tensors="pt").input_ids.cuda()
outputs = model.generate(input_ids, max_new_tokens=128, temperature=0.7, do_sample=True, top_p=0.95, repetition_penalty=1.1)
print(tokenizer.batch_decode(outputs)[0])
"""
<s> In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. It turns out that only a few people believed their father’s name had been made public by their mother and her father.

As the study finds, scientists discovered the secretive organic matter in the Andes mountains: “In a forest surrounded by a stream of lakes, an unseen rock formed with a series of tuberous orbits.”

They found that the mysterious bodies were buried in some parts of the mountain, known as the Andes mountain. The researchers then searched for the body, which they identified with the Butterfly.

The discovery of the body is the result of
"""

(it's... a little undertrained! thats okay!)

Parameter count

param calculation params
model 809,579,521
model - model.transformer.wte 539,045,889
model - model.transformer.wte[0] (llama2-70b embeddings without projection) 547,435,521
model - model.transformer.wte - model.lm_head 268,505,089
model - model.transformer.wte[0] - model.lm_head[1] (minus all params taken from llama2-70b) 285,291,521

Details

This model is utilizing a custom architecture built for fast and efficient pretraining. Despite being grossly undertrained (~1 billion tokens from the Pile) this model achieves an estimated Pile test loss of 2.484 and an estimated Pile BPB of 1.313 (compare to GPT-2-1.5B at 1.2253). This only took 12 hours on a 2xA6000 machine to train, and further longer runs are to be expected. To aid in efficiency, the token embedding and language modelling head from llama-2-70b are taken and adapted with linear projections into the embedding-space of the model (from 8,192 to 1,024 dimensions). I expect leveraging pretrained embeddings in this way to become more prevalent as more people learn about the modules in transformers, but it has long been a practice with some circles of hobbyists. The embedding and language modelling head are also commonly the two trainable modules with the most parameters (e.g. the embedding weight of llama-2-70b is a 32,000x8192 tensor), so freezing them aids further in speed and memory gains.

This model is a proof-of-concept to show that the methods behind it can work, it is not meant for production environments but for research. It may also create harmful, offensive, and untrue content, reflecting the biases of the training data.

Downloads last month
23
Safetensors
Model size
810M params
Tensor type
F32
·
Inference Examples
Inference API (serverless) does not yet support model repos that contain custom code.

Dataset used to train crumb/gpt2a-pile-test-285m