gpt2-arxiv

A gpt2 powered predictive keyboard trained on ~1.6M manuscript abstracts from the ArXiv. This model uses https://www.kaggle.com/datasets/Cornell-University/arxiv

from transformers import pipeline
from transformers import GPT2TokenizerFast

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
llm = pipeline('text-generation',model='pearsonkyle/gpt2-arxiv', tokenizer=tokenizer)

texts = llm("Directly imaged exoplanets probe", 
             max_length=50, do_sample=True, num_return_sequences=5, 
             penalty_alpha=0.65, top_k=40, repetition_penalty=1.25,
             temperature=0.95)

for i in range(5):
    print(texts[i]['generated_text']+'\n')

The reflectance of Earth's vegetation suggests that large, deciduous forest fires are composed of mostly dry, unprocessed material that is distributed in a nearly patchy fashion. The distributions of these fires are correlated with temperature, and also with vegetation...
Directly imaged exoplanets probe the atmospheres of giant planets. The detection of such planets requires high-quality imaging with high contrast and angular resolution, as well as
We can remotely sense an atmosphere by observing its reflected, transmitted, or emitted light in varying geometries. This light will contain information on the planetary conditions including atmospheric temperature and cloud properties, which is essential for understanding how the planet interacts with the atmosphere and how it affects the climate. The primary science objective of this paper is to develop a methodology that can be applied to any kind of observation and measurement data, and to provide a framework that enables the detection and characterization of the atmospheres of exoplanets

Model description

GPT-2: 12-layer, 768-hidden, 12-heads, 117M parameters

Intended uses & limitations

Coming soon...

Predictive Keyboard using text generation
Realtime reference recommendations using nearest neighbors of embeddings

Be careful when generating a lot of text or when changing the sampling mode of the language model. It can sometimes produce things that are not truthful, e.g.,

The surface of Mars is composed of a thin layer of water ice, that was discovered by the Cassini spacecraft after its impact on the Earth's surface.

Training procedure

~49 hours on a 3090 training for 1.25M iterations

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 16
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 10

Framework versions

Transformers 4.25.1
Pytorch 1.13.1
Tokenizers 0.13.2