2023, year of open LLMs

Published December 18, 2023

2023 has seen a surge of public interest in Large Language Models (LLMs), and now that most people have an idea of what they are and can do, the public debates around open versus closed source have reached a wide audience as well. At Hugging Face, we follow open models with great interest, as they allow research to be reproducible, empower the community to participate in the development of AI models, permit the easier scrutiny of model biases and limitations, and lower the overall carbon impact of our field by favoring checkpoint reuse (among many other benefits).

So let's do a retrospective of the year in open LLMs!

To keep this document manageable in length, we won't look at code models.

🍜 Recipe for a pretrained Large Language Model

First, how do you get a Large Language Model? (Feel free to skim this section if you already know!)

The model architecture (its code) describes its specific implementation and mathematical shape: it is a list of all its parameters, as well as how they interact with inputs. At the moment, most highly performing LLMs are variations on the "decoder-only" Transformer architecture (more details in the original transformers paper).

The training dataset contains all examples and documents on which the model is trained (aka the parameters are learned), therefore, the specific patterns learned. Most of the time, these documents contain text, either in natural language (ex: French, English, Chinese), a programming language (ex: Python, C), or any kind of structured data expressible as text (ex: tables in markdown or latex, equations, ...).

A tokenizer defines how the text from the training dataset is converted to numbers (as a model is a mathematical function and therefore needs numbers as inputs). Tokenization is done by transforming text into sub-units called tokens (which can be words, sub-words, or characters, depending on tokenization methods). The vocabulary size of the tokenizer indicates how many different tokens it knows, typically between 32k and 200k. The size of a dataset is often measured as the number of tokens it contains once split in a sequence of these individual, "atomistic" units, and these days range from several hundred billion tokens to several trillion tokens!

Training hyperparameters then define how the model is trained. How much should the parameters change to fit each new example? How fast should the model be updated?

Once these parameters have been selected, you only need 1) a lot of computing power to train the model and 2) competent (and kind) people to run and monitor the training. The training itself will consist in instantiating the architecture (creating the matrices on the hardware used for training) and running the training algorithm on the training dataset with the above mentioned hyperparameters. The result is a set of model weights. These are the model parameters after learning and what most people mean when discussing access to an open pretrained model. These weights can then be used for inference, i.e. for prediction on new inputs, for instance to generate text.

Pretrained LLMs can also be specialized or adapted for a specific task after pretraining, particularly when the weights are openly released. They are then used as a starting point for use cases and applications through a process called fine-tuning. Fine-tuning involves applying additional training steps on the model on a different –often more specialized and smaller– dataset to optimize it for a specific application. Even though this step has a cost in terms of compute power needed, it is usually much less costly than training a model from scratch, both financially and environmentally. This is one reason high-quality open-source pretrained models are very interesting, as they can be freely used and built upon by the community even when the practitioners have only access to a limited computing budget.

🗝️ 2022, from a race for size to a race for data

What open models were available to the community before 2023?

Until early 2022, the trend in machine learning was that the bigger a model was (i.e. the more parameters it had), the better its performance. In particular, it seemed that models going above specific size thresholds jumped in capabilities, two concepts which were dubbed emergent abilities and scaling laws. Pretrained open-source model families published in 2022 mostly followed this paradigm.

BLOOM (BigScience Large Open-science Open-access Multilingual Language Model) BLOOM is a family of models released by BigScience, a collaborative effort including 1000 researchers across 60 countries and 250 institutions, coordinated by Hugging Face, in collaboration with the French organizations GENCI and IDRIS. These models use decoder-only transformers, with minor modifications (post embedding normalization,[^1] and the use of ALiBi positional embeddings [^2]). The biggest model of this family is a 176B parameters model, trained on 350B tokens of multilingual data in 46 human languages and 13 programming languages. Most of the training data was released, and details of its sources, curation, and processing were published. It is the biggest open source massively multilingual model to date.
OPT (Open Pre-trained Transformer) The OPT model family was released by Meta. These models use a decoder-only transformers architecture, following the tricks of the GPT-3 paper (a specific weights initialization, pre-normalization), with some changes to the attention mechanism (alternating dense and locally banded attention layers). The biggest model of this family is a 175B parameters model trained on 180B tokens of data from mostly public sources (books, social data through Reddit, news, Wikipedia, and other various internet sources). This model family was of comparable performance to GPT-3 models, using coding optimization to make it less compute-intensive.
GLM-130B (General Language Model) GLM-130B was released by Tsinghua University and Zhipu.AI. It uses a full transformer architecture with some changes (post-layer-normalisation with DeepNorm, rotary embeddings). The 130B parameters model was trained on 400B tokens of English and Chinese internet data (The Pile, Wudao Corpora, and other Chinese corpora). It was also of comparable performance to GPT-3 models.
Smaller or more specialized open LLM Smaller open-source models were also released, mostly for research purposes: Meta released the Galactica series, LLM of up to 120B parameters, pre-trained on 106B tokens of scientific literature, and EleutherAI released the GPT-NeoX-20B model, an entirely open source (architecture, weights, data included) decoder transformer model trained on 500B tokens (using RoPE and some changes to attention and initialization), to provide a full artifact for scientific investigations.

These huge models were exciting but also very expensive to run! When performing inference (computing predictions from a model), the model needs to be loaded in memory, but a 100B parameters model will typically require 220GB of memory to be loaded (we explain this process below), which is very large, and not accessible to most organization and practitioners!

However, in March 2022, a new paper by DeepMind came out, investigating what the optimal ratio of tokens to model parameters is for a given compute budget. In other words, if you only have an amount X of money to spend on model training, what should the respective model and data sizes be? The authors found out that, overall, for the average compute budget being spent on LLMs, models should be smaller but trained on considerably more data. Their own model, Chinchilla (not open source), was a 70B parameters model (a third of the size of the above models) but trained on 1.4T tokens of data (between 3 and 4 times more data). It had similar or better performance than its bigger counterparts, both open and closed source.

This paradigm shift, while probably already known in closed labs took the open science community by storm.

🌊 2023, a year of open releases

The rise of small Large Language Models

2023 saw a wave of decoder style transformers arise, with new pretrained models released every month, and soon every week or even day: LLaMA (by Meta) in February, StableLM (by StabilityAI) and Pythia (by Eleuther AI) in April, MPT (by MosaicML) in May, X-GEN (by Salesforce) and Falcon (by TIIUAE) in June, Llama 2 (by Meta) in July, StableLM v2 (by StabilityAI) in August, Qwen (by Alibaba) and Mistral (by Mistral.AI) in September, Yi (by 01-ai) in November, DeciLM (by Deci), Phi-2, and SOLAR (by Upstage) in December.

All these releases a) included model weights (under varyingly open licenses) and b) had good performance for models on the smaller side (between 3B and 70B parameters), and therefore, they were instantly adopted by the community. Almost all of these models use the decoder transformer architecture, with various tweaks (ALiBi or RoPE, RMS pre-normalization, SwiGLU), as well as some changes to the attention functions (Flash-Attention, GQA, sliding windows) and different code base implementations to optimize for training or inference speed. These tweaks are likely to affect the performance and training speed to some extent; however, as all the architectures have been released publicly with the weights, the core differences that remain are the training data and the licensing of the models.

The first model family in this series was the LLaMA family, released by Meta AI. The explicit objective of the researchers was to train a set of models of various sizes with the best possible performances for a given computing budget. For one of the first times, the research team explicitly decided to consider not only the training budget but also the inference cost (for a given performance objective, how much does it cost to run inference with the model). In this perspective, they decided to train smaller models on even more data and for more steps than was usually done, thereby reaching higher performances at a smaller model size (the trade-off being training compute efficiency). The biggest model in the Llama 1 family is a 65B parameters model trained on 1.4T tokens, while the smaller models (resp. 6 and 13B parameters) were trained on 1T tokens. The small 13B LLaMA model outperformed GPT-3 on most benchmarks, and the biggest LLaMA model was state of the art when it came out. The weights were released with a non-commercial license though, limiting the adoption by the community.

The Pythia models were released by the open-source non-profit lab Eleuther AI, and were a suite of LLMs of different sizes, trained on completely public data, provided to help researchers to understand the different steps of LLM training.

The MPT models, which came out a couple of months later, released by MosaicML, were close in performance but with a license allowing commercial use, and the details of their training mix. The first MPT model was a 7B model, followed up by 30B versions in June, both trained on 1T tokens of English and code (using data from C4, CommonCrawl, The Stack, S2ORC).

The MPT models were quickly followed by the 7 and 30B models from the Falcon series, released by TIIUAE, and trained on 1 to 1.5T tokens of English and code (RefinedWeb, Project Gutemberg, Reddit, StackOverflow, Github, arXiv, Wikipedia, among other sources) - later in the year, a gigantic 180B model was also released. The Falcon models, data, and training process were detailed in a technical report and a later research paper.

Inheriting from the GPT-Neo-X model, StabilityAI released the StableLM-Base-Alpha models, a small (3B and 7B) pre-trained series using 1.5T tokens of an experimental dataset built on ThePile, followed by a v2 series with a data mix including RefinedWeb, RedPajama, ThePile, and undisclosed internal datasets, and lastly by a very small 3B model, the StableLM-3B-4e1T, complete with a detailed technical report.

Where previous models were mostly public about their data, from then on, following releases gave close to no information about what was used to train the models, and their efforts cannot be reproduced - however, they provide starting points for the community through the weights released.

Early in the summer came the X-Gen models from Salesforce, 7B parameters models trained on 1.5T tokens of "natural language and code", in several steps, following a data scheduling system (not all data is introduced at the same time to the model).

X-Gen was a bit over-shadowed by the much visible new LLaMA-2 family from Meta, a range of 7 to 70B models trained on 2T tokens "from publicly available sources", with a permissive community license and an extensive process of finetuning from human-preferences (RLHF), so-called alignment procedure.

A couple of months later, the first model from the newly created startup Mistral, the so-called Mistral-7B was released, trained on an undisclosed number of tokens from data "extracted from the open Web". The end of 2023 was busy with model releases with a second larger model from Mistral (Mixtral 8x7B), a first impressive model from Deci.AI called DeciLM as well as a larger merge of models from upstage, SOLAR also trained on undisclosed amount and sources of data. All these models carried steady increases on the leaderboards and open benchmarks.

In parallel, a notable event of the end of the year 2023 was the rise of performances and a number of models trained in China and openly released. Two bilingual English-Chinese model series were released: Qwen, from Alibaba, models of 7 to 70B parameters trained on 2.4T tokens, and Yi, from 01-AI, models of 6 to 34B parameters, trained on 3T tokens. The performance of these models was a step ahead of previous models both on open leaderboards like the Open LLM leaderboard and some of the most difficult benchmarks like Skill-Mix. Another strong contender from late 2023 was the DeepSeek coding model from DeepSeek AI trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese (mostly a code model).

Dialog models everywhere

Compared to 2022, almost all pretrained models released in 2023 came with both a pre-trained version and a dialog-finetuned version, using one of several existing approaches. While approaches for adapting models to chat-setting were developed in 2022 and before, wide adoption of these techniques really took off in 2023, emphasizing the growing use of these chat models by the general public as well as the growing manual evaluation of the models by chatting with them ("vibe-check" evaluation). We detail the most well-known approaches to adapt pretrained models for chat here, but many variations exist!

Chat-based fine-tuning is a variant of supervised fine-tuning, where the annotated data is chat data (multiturn dialogue-like data, much like what you would find on social media) that you fine-tune your model on. You use the same technique as when training your model: for decoder transformers, you teach your model to predict the next words one by one (called an auto-regressive approach).

Instruction fine-tuning (IFT) follows the same approach but with instruction datasets, which contain a collection of query-like prompts plus answers (with optional additional input if needed). These datasets teach the models how to follow an instruction and can be human or LLM-generated. Using large-scale model-outputs synthetic datasets (datasets which are composed of model generations, e.g., generations from GPT-4 either from instructions of from interactions between users and said model) is one of the ways to accomplish instruction and chat finetuning. This is often called distillation as it involves taking the knowledge from a high-performing model to train or fine-tune a smaller model.

Both these methods are relatively easy to implement: you just need to find or generate related datasets and then fine-tune your model using the same technique as when training. A great number of instruct datasets were published last year, which improved model performance in dialogue-like setups. For more information on this topic, you can read an intro blog here. However, the models, though better, can still not match what humans expect.

Reinforcement learning from human feedback (RLHF) is a specific approach that aims to align what the model predicts to what humans like best (depending on specific criteria). It was (at the beginning of the year) a new technique for fine-tuning. From a given prompt, the model generates several possible answers; humans rank these answers; the rankings are used to train what is called a preference model (which learns to give a score reflecting human preference for answers); the preference model is then used to fine-tune the language model using reinforcement learning. For more detailed information, see this blog post, the original RLHF paper, or the Anthropic paper on RLHF. It's a costly method (annotating/ranking + training a new model + fine-tuning is quite expensive) that has been mostly used to align models for safety objectives. A less costly variation of this method has been developed that uses a high-quality LLM to rank model outputs instead of humans: reinforcement learning from AI feedback (RLAIF).

Direct preference optimization (DPO) is another variation of RLHF, but does not require the training and use of a separate preference model - the method requires the same human or AI ranking dataset but uses this data to update the model directly by looking at the difference between its original policy (way of predicting) and the optimal one (which would predict the best-ranked answers). In other words, the aligned model is also the preference model, which makes the optimization procedure a lot simpler while giving what seems to be equivalent final performances.

So, to come back to our wave of small open weights models from (mostly) private companies, a lot of them were released with fine-tuned counterparts: MPT-7B also came with an instruct and a chat version, instruct-tuned versions of Falcon and XGen models were released at the end of the year, Llama-2, Qwen and Yi were released with chat versions and DeciLM with an instruct version. The release of Llama-2 was particularly notable due to the strong focus on safety, both in the pretraining and fine-tuning models.

What about the community?

While chat models and instruction fine-tuned models were usually provided directly with new model releases, the community and researchers didn't take this for granted: a wide and healthy community of model fine-tuners bloomed over the fruitful grounds provided by these base models, with discussions spontaneously occurring on Reddit, Discord, the Hugging Face Hub, and Twitter. Community model releases were frequent, in parallel with the creation of new interesting datasets (also used to finetune models to ascertain their good performances and quality).

At the beginning of 2023, a few datasets for instruction/chat finetuning were already released. For instance, for human preferences, the WebGPT dataset by OpenAI, HH-RLHF dataset by Anthropic, and Summarize by OpenAI were pioneer in this direction. Examples of instruction datasets are the Public Pool of Prompts by BigScience, FLAN 1 and 2 by Google, Natural Instructions by AllenAI, Self Instruct, a framework to generate automatic instructions by researchers from different affiliations, SuperNatural instructions, an expert created instruction benchmark sometimes used as fine-tuning data, Unnatural instructions, an automatically generated instruction dataset by Tel Aviv University and Meta, among others.

❄️ Winter 2022/2023: In January this year, the Human ChatGPT Instruction corpus (HC3) was released by Chinese researchers from various institutions, and contained humans versus model answers to various questions. March was filled with releases: Stanford opened the Alpaca model, which was the first instruction-following LLaMA model (7B), and the associated dataset, 52K instructions generated with an LLM. LAION (a non profit open source lab) released the Open Instruction Generalist (OIG) dataset, 43M instructions both created with data augmentation and compiled from other pre-existing data sources. The same month, LMSYS org (at UC Berkeley) released Vicuna, also a LLaMA fine-tune (13B), this time on chat data: conversations between users and ChatGPT, shared publicly by the users themselves on ShareGPT. The Guanaco dataset, an extension of the Alpaca dataset (containing an added 500K entries in more languages), was also released, as well as the associated LLaMA-7B fine-tune.

🌱 Spring: In April, BAIR (Berkeley AI Research lab) released Koala, a chat-tuned LLaMA model, using several of the previous datasets (Alpaca, HH-RLHF, WebGPT, ShareGPT), and DataBricks released the Dolly dataset, a great human effort of 15K manually generated instructions as well as the associated model, a Pythia fine-tune. In May, Tsinghua University released UltraChat, a dataset of 1.5M conversations containing instructions, and UltraLLaMA, a fine-tune on said dataset. Microsoft then released the GPT4-LLM dataset/framework to generate instructions with GPT4, and in June, Microsoft research shared a new method, Orca, to construct instruction datasets by using the reasoning trace of larger models (which explain their step by step reasoning) - it was soon reproduced by the community (notably Alignmentlab.ai), who created Open Orca datasets, several million of entries, then used to fine-tune a number of models (Llama, Mistral, ...). In May and June, Camel-AI released a number of instruction or chat datasets on different topics (more than 20K examples in each domain, physics, biology, chemistry, ...) obtained with GPT4. In June, too, the Airoboros framework to fine-tune models using model-generated data (following the self-instruct approach) was released, along with a number of instruct datasets.

🌻Summer: In August, UltraLM (a high-performing chat fine-tune of LLaMA) was released by OpenBMB, a Chinese non-profit, and in September, they released the associated preference dataset UltraFeedback, a feedback dataset of inputs compared by GPT4 (with annotations). Throughout the summer, NousResearch, a collective, released several fine-tunes (notably the Hermes and Capybara collections) based on several private and public instruct datasets. In September, a student team from Tsinghua University released OpenChat, a LLaMA fine-tune using a new RL finetuning strategy, and Intel released an Orca style DPO dataset.

🍂 Autumn: In October, Hugging Face released Zephyr, a Mistral fine-tune using DPO and AIF on UltraChat and UltraFeedback, and community members released OpenHermes 2, a Mistral-7B fine-tuned on 900K entries either from the web or generated with Axolotl. Lmsys released LMSYS-Chat-1M, real-life user conversations with 25 LLMs. In November, OpenBuddy released OpenBuddy-Zephyr, a Zephyr fine-tuned on multi-turn dialogue data, and Argilla released Notus, a DPO fine-tune of Zephyr. NVIDIA released HelpSteer, an alignment fine-tuning dataset providing prompts, associated model responses, and grades of said answers on several criteria, while Microsoft Research released the Orca-2 model, a Llama 2 fine-tuned on a new synthetic reasoning dataset and Intel Neural Chat, a Mistral fine-tune on Orca and with DPO. In December, Berkeley released Starling, a RLAIF fine-tuned of Open-Chat, and the associated dataset, Nectar, 200K entries of comparison data.

As we can see, this whole year's development relies both on the creation of new datasets through the use of high-quality pretrained LLMs, as well as on all the open models released by the community, making the field go forward by leaps and bounds! And if you now see one of these names in a model name, you'll be able to get an idea of where it's coming from 🤗

Note: Some more specialized datasets (such as MetaMath or MathInstruct math problem fine-tuning datasets, Evol-Instruct, math and code instructions, CodeAlpaca and CodeCapybara code instructions) were also released, but we won't cover them in detail here, though they have also been used to improve model performance on specific tasks. You can also see the awesome instructions dataset for a compilation of other relevant datasets.

Democratizing access

Note: A number of tools also emerged to support inference and deployment for more beginner users, such as llama.cpp, ollama, text-generation-inference, vllm, among others. They are out of scope for this document.

Merging: Extreme customization

In a typical open-source fashion, one of the landmark of the community is model/data merging. With each merge/commit, it can be more difficult to trace both the data used (as a number of released datasets are compilations of other datasets) and the models' history, as highly performing models are fine-tuned versions of fine-tuned versions of similar models (see Mistral's "child models tree" here). In this summary, we haven't had the time yet to talk about this amazing technique, so let's spend a couple of final words on it.

But what does it mean to merge a model?

Model merging is a way to fuse the weights of different models together in a single model to (ideally) combine the respective strengths of each model in a unified single model. A few techniques exist to do so that have been extended and often published mostly in community forums, a striking case of fully decentralized research happening all over the world between a community of practitioners, researchers, and hobbyists. One of the simplest published methods consists in averaging the parameters of a set of models sharing a common architecture (example 1, example 2) but more complex parameter combinations exist, such as determining which parameters are the most influential in each model for a given task (weighted averaging), or considering parameters interference between models before selecting which parameters to keep when merging (ties merging). For a good overview of the litterature, you can check this cool paper collection!

These techniques allow anybody to easily generate combinations of models and are made especially easy by the fact that most models are nowadays variations on the same architecture. That's the reason some models submitted to the open LLM leaderboard have names such as llama2-zephyr-orca-ultra. This particular example is likely a merge of llama2 and zephyr models, fine-tuned on orca and ultra datasets. Usually, more details are to be found in the respective model card on the Hugging Face hub.

PEFT: Personalization at the tip of your fingers

Sometimes, you may want more controlled personalization, without enough memory to load a whole model in memory to fine tune it. Did you know that you don't need to use an entire model when fine-tuning?

You might want to use what is called parameter efficient fine-tuning (PEFT). This technique first freezes up the parameters of your pretrained model of interest, then adds a number of new parameters on top of it, called the adapters. What you then fine-tune on your task are only the (lightweight) adapter weights, considerably smaller than the original model. You then just need to share your small adapter weights (and the base model)! You'll find a list of interesting approaches for PEFT here.

Quantization: Models running everywhere

We've seen that well-performing models now come in all shapes and sizes… but even then, it doesn't mean that they are accessible to all! A 30B parameters model can require more than 66G of RAM just to load in memory (not even use), and not everyone in the community has the hardware necessary to do so.

That's where quantization comes in! Quantization is a special technique which reduces a model's size by changing the precision of its parameters.

What does it mean?

In a computer, numbers are stored with a given precision (such as float32, float16, int8, and so forth). A precision indicates both the number type (is it a floating point number or an integer) as well as on how much memory the number is stored: float32 stores floating point numbers on 32 bits. For a more in-depth explanation, see this link. So, the higher the precision, the more physical memory a number takes, as it will be stored on more bits.

So, if you reduce the precision, you reduce the memory each model parameter takes in storage, therefore reducing the model size! This also means that you reduce... the actual precision of the computations, which can reduce the model's performance. However, we found out that on bigger models, this performance degradation is actually very limited.

To go back to our above example, our 30B parameters model in float16 requires a bit less than 66G of RAM, in 8bit it only requires half that, so 33G of RAM, and it 4bit we reach even half of this, so around 16G of RAM, making it considerably more accessible.

There are many ways to go from one precision to another, with many different "translation" schemes existing, each with its own benefits and drawbacks. Popular approaches include bitsandbytes, GPTQ, and AWQ. Some users, such as TheBloke, are even converting popular models to make them accessible to the community. All are very recent and still developing, and we hope to see even more progress on this as time goes on.

What's next?

The year is not over yet! And these final ~~months~~ ~~days~~ hours have already come with the share of surprises: will a new architecture finally overperform the simple and efficient Transformer?

New releases include

A mixture of experts:
- Mixtral, the model is made of 8 sub-models (transformer decoders), and for each input, a router picks the 2 best sub-models and sums their outputs.
Several state space models (models that map input to output through a latent space and which can expressed as either an RNN or a CNN depending on the tasks, this resource is great at explaining state models if you want more information):
- Mamba, a state space model with an added selection mechanism
- Striped Hyena, a state space model with fast convolutions kernel

It's still a bit too early to say if these new approaches will take over the Transformer, but state space models are quite promising!

Takeaways

This year has seen a rise of open releases from all kinds of actors (big companies, start ups, research labs), which empowered the community to start experimenting and exploring at a rate never seen before.
Model announcement openness has seen ebbs and flow, from early releases this year being very open (dataset mixes, weights, architectures) to late releases indicating nothing about their training data, therefore being unreproducible.
Open models emerged from many new places, including China, with several new actors positioning themselves as strong contenders in the LLM game.
Personalization possibilities reached an all-time high, with new strategies for fine-tuning (RLHF, adapters, merging), which are only at their beginning.
Smaller model sizes and upgrades in quantization made LLMs really accessible to many more people!
New architectures have also appeared - will they finally replace the Transformer?

That's it folks! I hope you enjoyed this year's review, learned a thing or two, and feel as enthusiastic as me about how much of AI progress now relies on open source and community effort! 🤗

[^1]: Post embedding normalisation is a trick to make learning more stable. [^2]: ALiBi positional embeddings introduce a penalty when tokens too far away in a sequence are connected together by the model (where normal positional embeddings would just store information about the order and respective position of tokens in a sequence).