Fast, High-Fidelity LLM Decoding with Regex Constraints

Community Article Published February 23, 2024

image/png

Constraining the text generated by an LLM is essential for developing robust LLM applications. For example, a developer may require LLM responses to adhere to a specific JSON or YAML schema so that they are comprehensive and reliably parseable.

Powerful tools and methods (Beurer-Kellner et al., 2022; Lundberg & Ribeiro, 2023; Willard & Louf, 2023; Zheng et al., 2023) have been published in the past year to address this need. Notably, the open source library Outlines\texttt{Outlines} implements an efficient method (Willard & Louf, 2023) that ensures compliance with an arbitrary regex.

In this blog post, I propose two novel, fast, high-fidelity alternatives to this method. Specifically, I will:

  • Present the principle of Outlines\texttt{Outlines} and discuss how allowing improper token sequences (i.e. token sequences that do not correspond to the tokenization of a string) can skew the probability distribution of the LLM and hinder the decoding speed;
  • Introduce DirectMerge\texttt{DirectMerge}, a method that maintains guaranteed compliance while only generating proper token sequences. DirectMerge\texttt{DirectMerge} is valid for all tokenizers based on a merge table and in particular for the very popular Byte Pair Encoding tokenizers (Sennrich et al., 2016);
  • Propose CartesianMerge\texttt{CartesianMerge}, a variant of DirectMerge\texttt{DirectMerge} that scales better with the complexity of the regex;
  • Discuss the execution time of CartesianMerge\texttt{CartesianMerge} and show how this execution time can be significantly reduced in the case of a JSON or YAML schema constraint.

This blog post comes with a technical appendix formalizing and proving the algorithms. Additionally, a Python notebook with implementations of DirectMerge\texttt{DirectMerge} and CartesianMerge\texttt{CartesianMerge}, along with additional empirical results, is available.

Deterministic Finite Automata and Constrained Decoding

As a preprocessing step, Outlines\texttt{Outlines} converts a character-level deterministic finite automaton (DFA) testing whether a string matches a regex (cf. Figure 1a) into a token-level DFA testing whether a token sequence is decoded in a string matching the regex (cf. Figure 1b):

  • the states, including the start state and the accept states, remain the same;
  • the alphabet of the token-level DFA is the tokenizer's vocabulary;
  • the transition function of the token-level DFA is the extended transition function of the character-level DFA restricted to the tokenizer's vocabulary. For example with the example of Figure 1b, there is an abb\texttt{abb} transition from 0\texttt{0} to 1\texttt{1} because if we start from 0\texttt{0} on the character-level DFA and successively see a\texttt{a}, b\texttt{b} and b\texttt{b}, we'll end up in state 1\texttt{1}.

image/png

Figure 1. Top left: character-level DFA for a*b*\texttt{a*b*}, top right: token-level DFA returned by Outlines\texttt{Outlines} for the same regex, bottom: token masking at decoding time.

At decoding time, the DFA is used to determine, for each new token, which potential tokens are allowed. For this and starting from the initial state of the DFA, we determine the allowed tokens with the outgoing transitions from the current state, apply the corresponding mask to the next token probabilities and renormalize these probabilities. We can then sample a new token and update the state of the DFA.

For example, let's imagine that the tokens aaaa\texttt{aaaa}, aa\texttt{aa} and ab\texttt{ab} have already been generated. Starting from state 0\texttt{0}, we remained in state 0\texttt{0} with aaaa\texttt{aaaa} and aa\texttt{aa} but transitioned to state 1\texttt{1} with ab\texttt{ab}. The tokens now allowed are simply those with an outgoing transition from state 1\texttt{1}, i.e. b\texttt{b} and bb\texttt{bb}. We then consider the next token probabilities of these two tokens and we renormalize them so that they sum to 1.

Three Challenges due to Improper Token Sequences

The token sequences potentially generated with Outlines\texttt{Outlines} are exactly those that are decoded into a string matching the regex. This includes token sequences that do not result from the tokenization of a string. We call them improper token sequences below.

Consider the regex: "boolean: ((true)(false))"\texttt{"boolean: ((true)}\mid\texttt{(false))"}. While there are only two strings matching this regex, the corresponding DFA returned by Outlines\texttt{Outlines} has 16 states and 48 transitions (Figure 2). For example, the first token is chosen among boolean\texttt{boolean}, bool\texttt{bool}, bo\texttt{bo} and b\texttt{b}.

image/png

Figure 2. DFA built by Outlines\texttt{Outlines} for the "boolean: ((true)|(false))"\texttt{"boolean: ((true)|(false))"} regex

In contrast, as we will see in the following sections, the DFA built by DirectMerge\texttt{DirectMerge} and CartesianMerge\texttt{CartesianMerge} only accepts the token sequences resulting from the tokenization of compliant strings and it has only 2 states and 4 transitions (Figure 3).

image/png

Figure 3. DFA built by DirectMerge\texttt{DirectMerge} and CartesianMerge\texttt{CartesianMerge} for the "boolean: ((true)|(false))"\texttt{"boolean: ((true)|(false))"} regex

Challenge #1: Skewed Probability Distribution

The permissiveness of Outlines\texttt{Outlines} with improper token sequences may significantly distort the probability distribution of the LLM, beyond what is strictly necessary to enforce the regex constraint.

To illustrate this, let's imagine that we generate a continuation of the prompt Question: What is the first name of the US president who was a member of the Sigma Alpha Epsilon fraternity?\nAnswer: with the regex constraint "( William)( Theodore)"\texttt{"( William)}\mid \texttt{( Theodore)"} (the true answer is William like William McKinley, Theodore Roosevelt's predecessor).

When the text generation is unconstrained, we can compute the probability along all token sequences leading to one of the two possible answers. If we do so with Mistral-7B\texttt{Mistral-7B} and a temperature equal to 1, we get P(" William")P(" William")+P(" Theodore")=0.85879\frac{P(\texttt{" William"})}{P(\texttt{" William"})+P(\texttt{" Theodore"})} = 0.85879 and notice that almost all the probability mass for these two answers comes from proper token sequences (99.8586% for " Theodore"\texttt{" Theodore"} and 99.9997% for " William"\texttt{" William"}).

If we constrain the text generation with Outlines\texttt{Outlines} and multinomial sampling as a decoding strategy, any prefix of " William"\texttt{" William"} and " Theodore"\texttt{" Theodore"} will be accepted as the first token and determine the answer (if this first token is not the blank space). For example, the token " The"\texttt{" The"} can be selected as the first token and in this case, the answer will necessarily be " Theodore"\texttt{" Theodore"}. As a result, the probability to generate " Theodore"\texttt{" Theodore"} will be the sum of generating " T"\texttt{" T"}, " Th"\texttt{" Th"}, " The"\texttt{" The"}, etc. as the first token, even though these first tokens, except " Theod"\texttt{" Theod"} almost never lead to " Theodore"\texttt{" Theodore"} and " Theodore"\texttt{" Theodore"} will be significantly boosted by the sole virtue of having a very common token, " The"\texttt{" The"}, as a prefix. Indeed, P(" William")P(" William")+P(" Theodore")=0.52852\frac{P(\texttt{" William"})}{P(\texttt{" William"})+P(\texttt{" Theodore"})} = 0.52852 in this case.

With DirectMerge\texttt{DirectMerge} and CartesianMerge\texttt{CartesianMerge}, only the tokens corresponding to the tokenization of the possible answers are taken into account (e.g. " The"\texttt{" The"} is ignored). The conditional probability, P(" William")P(" William")+P(" Theodore")=0.85872\frac{P(\texttt{" William"})}{P(\texttt{" William"})+P(\texttt{" Theodore"})} = 0.85872, is very close the one in the unconstrained case, which makes sense given that the LLM spontaneously generates proper token sequences a large majority of the time.

Challenge #2: Self-Intoxication

Another concern is that the tokens generated are fed back to the LLM to select the following ones. During its pretraining and depending whether subword regularization techniques like BPE-Dropout (Provilkov et al., 2020) have been used, an LLM was never or rarely exposed to improper token sequences so this LLM may not properly interpret the newly added tokens and this can impact the quality of its output.

Challenge #3: Missed Speedup Opportunities

As suggested by two recent blog posts, Outlines\texttt{Outlines} offers opportunities to accelerate decoding.

Let's go back to the DFA provided by DirectMerge\texttt{DirectMerge} and CartesianMerge\texttt{CartesianMerge} on Figure 3. There is only one outgoing transition from states 0\texttt{0} and 1\texttt{1}. This means that we don't even need to compute the next token probability when we reach these states since there is only one possible token. We can then generate the whole text in just one LLM call. Meanwhile, the DFA built with DirectMerge\texttt{DirectMerge} for the same regex on Figure 3 has 16 states but only 3 states with one outgoing transition (states 6, 7 and 13). There are then much fewer opportunities to skip computationally expensive LLM calls. Moreover, there are many potential paths that are much longer than three transitions. This can also slow down the decoding.

Ensuring Both Compliance and Proper Tokenization

We can now focus on DirectMerge\texttt{DirectMerge}. Even though its aims are slightly different, DirectMerge\texttt{DirectMerge} shares the same overall approach as Outlines\texttt{Outlines}: it derives a token-level DFA from a character-level DFA recognizing the regex and this token-level DFA is used to identify allowed tokens at decoding time. DirectMerge\texttt{DirectMerge} applies to tokenizers based on a merge table, like Byte Pair Encoding tokenizers (Sennrich et al., 2016). Encoding a string with such tokenizers consists of converting the string into a sequence of single-letter tokens and then performing an ordered list of merge operations. Each of these merge operations is defined by two tokens (e.g. a\texttt{a} and b\texttt{b}) and merge all the corresponding pairs of consecutive tokens, starting from the left of the current token sequence.

Initial string aaaabaacac\texttt{aaaabaacac}
Conversion to a sequence of single-letter tokens a a a a b a a c a c\texttt{a a a a b a a c a c}
Token sequence after merge(a,b)\texttt{merge}(\texttt{a}, \texttt{b}) a a a ab a a c a c\texttt{a a a }\underline{\texttt{ab}}\texttt{ a a c a c}
Token sequence after merge(a,a)\texttt{merge}(\texttt{a}, \texttt{a}) aa a ab aa c a c\underline{\texttt{aa}}\texttt{ a ab }\underline{\texttt{aa}}\texttt{ c a c}
Final token sequence after merge(a,c)\texttt{merge}(\texttt{a}, \texttt{c}) aa a ab aa c ac\texttt{aa a ab aa c }\underline{\texttt{ac}}
Table 1. Tokenization of the string aaaabaacac\texttt{aaaabaacac} with the merge table [(a,b),(a,a),(a,c)][(\texttt{a}, \texttt{b}), (\texttt{a}, \texttt{a}), (\texttt{a}, \texttt{c})]. The order of the merge operations and the order of the letters in the string both affect the final token sequence.

With DirectMerge\texttt{DirectMerge}, we start with the character-level DFA and we apply a series of simple transformations for each merge operation. Let us consider the case of an (a,b)(a, b) merge operation with aba \neq b. DirectMerge\texttt{DirectMerge} affects each state SS with one or several incoming aa transitions and one outgoing bb transition and its impact depends on two criteria:

  1. Has SS an x(xa)x\:(x \neq a) incoming transition or is SS the start state?
  2. Has SS a y(yb)y\:(y \neq b) outgoing transitions or is SS an accept state?

More precisely and if the outgoing bb transition from SS leads to SbS_b, the transformations applied to state SS are the following:

  • Replace each incoming aa transition by an abab transition to SbS_b
  • If both criteria are false (cf. Figure 4a):
    • Remove SS
  • If Criterion 1 is false and Criterion 2 is true:
    • Remove the outgoing bb transition
  • If Criterion 1 is true and Criterion 2 is false:
    • Remove all incoming aa transitions
  • If both criteria are true (cf. Figure 4c):
    • Add a state SS'
    • Redirect all incoming aa transitions to SS'
    • Copy all y(yb)y\:(y \neq b) transitions of SS and add them to SS'
    • Make SS' an accept state if SS is an accept state

image/png

Figure 4. Examples of the transformations to apply in case of an (a,b)(a, b) merge operation with aba \neq b. Left: situation before DirectMerge\texttt{DirectMerge}, right: situation after DirectMerge\texttt{DirectMerge}.

The case of merge operations with two identical tokens is more challenging because not all pairs of matching tokens should be merged. For example, on the fourth row of Table 1 above, the second and third aa are not merged because the merge operation is applied left-to-right and the second aa is already merged with the first aa. You can find the description of DirectMerge\texttt{DirectMerge} in the case a=ba = b and the correctness proof for both cases in the technical appendix.

Scaling with the Complexity of the Regex

DirectMerge\texttt{DirectMerge} guarantees both compliance with the regex constraint and proper tokenization. However, DirectMerge\texttt{DirectMerge} removes or adds states and transitions and in some unfavorable circumstances, the number of states and transitions may become overwhelmingly large.

Fortunately, with CartesianMerge\texttt{CartesianMerge}, we can efficiently simulate the DFA created by DirectMerge\texttt{DirectMerge} without explicitly building it. We can indeed notice that the target language is the intersection of the set of all token sequences decoded into strings matching the regex and the set of all proper token sequences. Both of these sets are regular languages and we already know how to build the corresponding DFAs, with Outlines\texttt{Outlines} for the first one and with DirectMerge\texttt{DirectMerge} applied to the ".*"\texttt{".*"} regex for the second one. Furthermore, as opposed to the DFA returned by DirectMerge\texttt{DirectMerge}, they will not become uncontrolably large as the regex becomes more complex. Outlines\texttt{Outlines} does not add any state to the character-level DFA and the DFA returned by DirectMerge\texttt{DirectMerge} with ".*"\texttt{".*"} happens to have interesting properties that make it easy to compute and compress (cf. the technical appendix for more details).

image/png

Figure 5. The target language is the intersection of two regular languages and we already know how to build the DFAs for both of them.

Once we have built these two DFAs, we can keep track of their respective states at decoding time and the allowed tokens are in the intersection of their respective sets of allowed tokens.

However, knowing which tokens are valid transitions in both DFAs is not enough because we also need to check that the tokens generated can eventually lead to a token sequence decoded into a string matching the regex. This was not an issue before because both Outlines\texttt{Outlines} and DirectMerge\texttt{DirectMerge} generate token-level DFAs whose states are all relevant (i.e. on a path between the start state and an accept state) as long as the initial character-level DFA only has relevant states. The pairs of states that are valid can be computed with a breadth-first exploration starting from the pair of start states and then from the pairs of accept states with reverse transitions.

Is it Fast Enough?

Table 2 reports the execution time of the various steps of CartesianMerge\texttt{CartesianMerge} for a sample of JSON strings from the glaive-function-calling-v2\texttt{glaive-function-calling-v2} dataset. The delay of the first step is negligible because this activity only needs to be executed once for a given tokenizer and its outcome will be useful for all future regex. The second and fourth steps also add little overhead while the third step is the performance bottleneck.

Step Activity Mean duration (seconds) Comment
1 Compute the DFA recognizing proper token sequences with DirectMerge\texttt{DirectMerge} (preprocessing) 20.9 Only once for a given tokenizer
2 Compute the DFA recognizing the token sequences decoded into strings matching the regex with Outlines\texttt{Outlines} (preprocessing) 0.687 Only once for a given regex
3 Identify relevant pairs of states (preprocessing) 27.3 Only once for a given regex
4 Build the token mask (decoding) 0.00153 At each decoding step
Table 2. Execution time of the various steps of CartesianMerge\texttt{CartesianMerge} on a sample of 1000 JSON strings from the glaive-function-calling-v2\texttt{glaive-function-calling-v2} dataset.

The latency of the third step remains limited if the regex constraint is used a large number of times but it may become an issue when a new regex is expected to be provided only at the time of decoding. It would typically be the case if an LLM provider offers a feature similar to OpenAI's function calling. However, when the real goal is to adhere to a JSON or YAML schema, we can significantly shorten the preprocessing.

Oftentimes in practical situations, there will be parts of the regex surrounded by two delimiters with which they cannot merge. This is particularly interesting for us because we then know that the portion of the DFA representing these parts is unaffected by all other parts of the regex. Let's call these situations Vegas configurations (because what happens between the delimiters stays between the delimiters).

For example with the Mistral-7B tokenizer, whatever string starting with a blank space and delimited by a colon and a line break is a Vegas configuration and this is of course very useful for YAML schemas.

Vegas configurations can be leveraged in two ways, as illustrated by Figure 6:

  • if we realize that the same Vegas configuration appears several times in the regex, we can compute and store the corresponding portion of the DFA only once;
  • if the variable parts of the regex we use often correspond to a limited number of Vegas configuration, we can build a library of these Vegas configurations, store the corresponding portions of the DFA and directly use them needed. With JSON or YAML schemas, the library would typically include the Vegas configurations of the common types (strings, numbers, dates...).

image/png

Figure 6. The DFA corresponding to Vegas configurations, i.e. repeated parts surrounded by two delimiters with which they cannot merge, can be computed and stored only once. Top: naive implementation of CartesianMerge\texttt{CartesianMerge}, bottom: implementation of CartesianMerge\texttt{CartesianMerge} taking advantage of Vegas configurations.

Conclusion

In this blog post, I introduced two alternatives to Outlines\texttt{Outlines}. DirectMerge\texttt{DirectMerge} and CartesianMerge\texttt{CartesianMerge} also guarantee compliance with a regex constraint while accelerating decoding and mitigating the risks of distortions of the language model distribution.

It is unclear how often such distortions may happen. However it is easy to find examples of them and should they happen in practice, they would probably go unnoticed. Given that uncertainty and the decoding speedup, CartesianMerge\texttt{CartesianMerge} should be a no-regret option for most constrained decoding needs.

Beyond constrained decoding, it would also be interesting to assess whether the fact that models sometimes generate improper token sequences affect their performance. If it were the case, CartesianMerge\texttt{CartesianMerge} may be used to enforce proper tokenization.

Thanks to the authors of the papers cited above, in particular to Luca Beurer-Kellner for a helpful discussion, and to the contributors of the various resources used (inter alia, outlines\texttt{outlines}, interegular\texttt{interegular}, transformers\texttt{transformers} and glaive-function-calling-v2\texttt{glaive-function-calling-v2}).

If you want to cite this blog post, you are welcome to use the following BibTeX entry:

@misc{Tran-Thien_2024,
title={Fast, High-Fidelity LLM Decoding with Regex Constraints},
url={https://vivien000.github.io/blog/journal/llm-decoding-with-regex-constraints.html},
journal={Unsupervised Thoughts (blog)},
author={Tran-Thien, Vivien},
year={2024}
}

This blog post was initially published on my personal blog.