Edit model card

Open Australian Legal LLM β€βš–οΈ

The Open Australian Legal LLM is the largest open source language model trained on Australian law.

With over 1.5 billion parameters, the model's size and the richness and quality of its training data, comprising roughly 70,000 laws, regulations and decisions across six Australian jurisdictions from the Open Australian Legal Corpus, make it well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including text generation, text completion and question answering.

To ensure its accessibility to as wide an audience as possible, the model is issued under the Apache Licence 2.0.

Usage πŸ‘©β€πŸ’»

The code snippet below demonstrates just one of the many ways in which the model may be accessed:

>>> from transformers import pipeline, set_seed

>>> set_seed(42) # We set a seed for reproducibility.
>>> generator = pipeline('text-generation', model='umarbutler/open-australian-legal-llm')

>>> response = generator('Section 51 of the Constitution provides', max_length=55)
>>> print(response[0]['generated_text'])

Creation πŸ§ͺ

The following cleaning procedures were applied to all 218,340 laws, regulations and decisions in version 4.2.0 of the Open Australian Legal Corpus:

  1. Non-breaking spaces were replaced with regular spaces;
  2. Return carriages followed by newlines were replaced with newlines;
  3. Whitespace was removed from lines comprised entirely of whitespace;
  4. Newlines and whitespace preceding newlines were removed from the end of texts;
  5. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and
  6. Spaces and tabs were removed from the end of lines.

After cleaning, texts with less than 128 characters and those with duplicate XXH3 128-bit hashes were removed, leaving 218,207 documents. These documents were then used to pretrain a GPT2-like tokenizer, after which they were split into blocks 512-tokens-long, with the tokenizer's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block. An attention mask was applied to the end-of-sequence tokens used as padding, barring the first such token. The resulting blocks were subsequently randomly shuffled and split into a training dataset of 1,966,867 chunks and a validation dataset of 218,541.

GPT2-XL was used as a base model. Input embeddings for tokens shared between the vocabulary trained on the Corpus and that of GPT2 were preserved but moved to their new positions. Embeddings for unique tokens were set to the average embedding weights.

The model was trained with the following hyperparameters for the first 100,290 steps:

Hyperparameter Value
Sequence length 512
Epochs 1
Optimiser AdamW
Learning rate 1e-4
Learning rate scheduler Linear with warmup
Batch size 6
Weight decay 0.01
Warmup ratio 0.06

After training on two RTX A6000s for ~120,050 steps over a period of 91 hours, the vast.ai instance hosting the model crashed. Fortunately, a checkpoint had been saved at step 100,290 (~60% of an epoch), although the optimiser's state was mistakenly not downloaded. The model was subsequently moved to a new instance where it was trained on an L40 for a further 133,711 steps (~40% of an epoch) with the following hyperparameters (changes emphasised):

Hyperparameter Value
Sequence length 512
Epochs 1
Optimiser AdamW
Learning rate 4.255e-5
Learning rate scheduler Linear
Batch size 3
Weight decay 0.01
Warmup ratio 0.00

Naturally, as the optimiser state had been lost, the model's learning rate descended slower than it had been previously. Nevertheless, after completing an epoch of training, the model was able to achieve a validation loss of 2.04.

Benchmarks πŸ“Š

Tested against version 2.0.0 of the Open Australian Legal QA dataset, the model achieved a perplexity of 8.01, outperforming all known language models for Australian law.

Model Parameters Perplexity
Open Australian Legal LLM 1.5B 8.01
Open Australian Legal Phi 1.5 1.3B 8.69
Open Australian Legal GPT2 124M 16.37
Open Australian Legal DistilGPT2 88.2M 23.9

Limitations 🚧

Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of GPT2-XL.

One might also expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the Open Australian Legal Corpus at the time of the model's creation).

Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data.

Licence πŸ“œ

To ensure its accessibility to as wide an audience as possible, the model is issued under the Apache Licence 2.0.

Citation πŸ”–

If you've relied on the model for your work, please cite:

@misc{butler-2023-open-australian-legal-llm,
    author = {Butler, Umar},
    year = {2023},
    title = {Open Australian Legal LLM},
    publisher = {Hugging Face},
    version = {1.0.0},
    url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-llm}
}

Acknowledgements πŸ™

In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.

The author thanks the sources of the Open Australian Legal Corpus for making their data available under open licences.

The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of GPT2, which the model was built atop.

Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.

Downloads last month
29
Safetensors
Model size
1.56B params
Tensor type
F32
Β·

Finetuned from

Dataset used to train umarbutler/open-australian-legal-llm

Collection including umarbutler/open-australian-legal-llm

Evaluation results

  • Perplexity on Open Australian Legal QA
    lmppl
    8.015