Conditional Pretraining of Large Language Models

Large language models (LLMs), such as OpenAI's ChatGPT and similar chatbot products from other organizations, have recently gained widespread adoption. These models can extend text or respond to instructions in a natural and helpful manner. Despite the core technologies behind LLMs, namely the transformer architecture and the GPT decoder-only causal language model, remaining relatively unchanged for over five years, the surge in popularity of ChatGPT can be largely attributed to recent approaches that better align the output of LLMs with users' and service providers' intentions.

Primary Approaches for Aligning LLMs with Human Expectations

Supervised finetuning (SFT) on natural instructions
Reinforcement learning from human feedback (RLHF)

Conditional Pretraining: A Third Approach

Utilizes a large number of pretraining examples tagged with human-understandable classifiers
Leverages content tagging found in many online environments
Examples of commonly used tags:
- Suitable for work (SFW) and not suitable for work (NSFW)
- G, PG, PG-13, and R for television and movie content

Converting Existing Pretraining Data into Conditional Pretraining Data

Traditional pretraining involves predicting the subsequent word in minimally processed text.
Conditional pretraining prepends training examples with descriptive tags and a brief synopsis.

Transparency and Accountability

Current LLMs have proprietary instructions and reward models, which can hinder public review and discussions on sensitive topics.
Conditional pretraining tags are transparent and easily understood by auditors or end users.

Conditional pretraining example

An example output from this conditional tagging model for a recent news article about LAION.

Article Here is below. To generate these document tags only text from the body of the article was used.

[ artificial intelligence, open source, ai, open letter, open source ai, ai research]

# This article explains the importance of a CERN-like organization to coordinate efforts on the transparency of large-scale AI research and provides information about LAION.

How to use the model

Format your inputs like this:

[ tag1, tag2, tag3, tag_n]

# This is a short synopsis of what kind of text I want to generate.

Acknowledgement

Thank you to LAION and Stability.ai for support and compute resources to experiment with conditional pretraining.

Disclaimer

Conditional pretraining helps the user control the outputs of the model.
However, these models (and all language models) can still generate undesirable content.
So please enjoy and use with care!