--- language: - en license: apache-2.0 library_name: transformers base_model: gpt2-xl tags: - law - legal - australia - generated_from_trainer datasets: - umarbutler/open-australian-legal-corpus widget: - text: 'Under the Crimes Act' - text: 'A restraint of trade is' - text: 'Section 51 of the Constitution provides' - text: "'Unsatisfactory professional conduct' includes" metrics: - perplexity model-index: - name: open-australian-legal-llm results: - task: type: text-generation name: Text generation dataset: type: umarbutler/open-australian-legal-qa name: Open Australian Legal QA split: train revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae metrics: - type: perplexity value: 8.015031389864035 name: Perplexity source: name: lmppl url: https://github.com/asahi417/lmppl --- # Open Australian Legal LLM โ€โš–๏ธ The Open Australian Legal LLM is the largest open source language model trained on Australian law. With over 1.5 billion parameters, the model's size and the richness and quality of its training data, comprising roughly 70,000 laws, regulations and decisions across six Australian jurisdictions from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), make it well suited for finetuning on a diverse range of downstream natural language processing tasks applied to the Australian legal domain, including text generation, text completion and question answering. To ensure its accessibility to as wide an audience as possible, the model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). ## Usage ๐Ÿ‘ฉโ€๐Ÿ’ป The code snippet below demonstrates just one of the many ways in which the model may be accessed: ```python >>> from transformers import pipeline, set_seed >>> set_seed(42) # We set a seed for reproducibility. >>> generator = pipeline('text-generation', model='umarbutler/open-australian-legal-llm') >>> response = generator('Section 51 of the Constitution provides', max_length=55) >>> print(response[0]['generated_text']) ``` ## Creation ๐Ÿงช The following cleaning procedures were applied to all 218,340 laws, regulations and decisions in version 4.2.0 of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus): 1. Non-breaking spaces were replaced with regular spaces; 1. Return carriages followed by newlines were replaced with newlines; 1. Whitespace was removed from lines comprised entirely of whitespace; 1. Newlines and whitespace preceding newlines were removed from the end of texts; 1. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and 1. Spaces and tabs were removed from the end of lines. After cleaning, texts with less than 128 characters and those with duplicate XXH3 128-bit hashes were removed, leaving 218,207 documents. These documents were then used to pretrain a [GPT2](https://huggingface.co/gpt2-xl)-like tokenizer, after which they were split into blocks 512-tokens-long, with the tokenizer's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block. An attention mask was applied to the end-of-sequence tokens used as padding, barring the first such token. The resulting blocks were subsequently randomly shuffled and split into a training dataset of 1,966,867 chunks and a validation dataset of 218,541. [GPT2-XL](https://huggingface.co/gpt2-xl) was used as a base model. Input embeddings for tokens shared between the vocabulary trained on the Corpus and that of [GPT2](https://huggingface.co/gpt2-xl) were preserved but moved to their new positions. Embeddings for unique tokens were set to the average embedding weights. The model was trained with the following hyperparameters for the first 100,290 steps: | Hyperparameter | Value | | --- | --- | | Sequence length | 512 | | Epochs | 1 | | Optimiser | AdamW | | Learning rate | 1e-4 | | Learning rate scheduler | Linear with warmup | | Batch size | 6 | | Weight decay | 0.01 | | Warmup ratio | 0.06 | After training on two RTX A6000s for \~120,050 steps over a period of 91 hours, the [vast.ai](https://vast.ai) instance hosting the model crashed. Fortunately, a checkpoint had been saved at step 100,290 (\~60% of an epoch), although the optimiser's state was mistakenly not downloaded. The model was subsequently moved to a new instance where it was trained on an L40 for a further 133,711 steps (\~40% of an epoch) with the following hyperparameters (changes emphasised): | Hyperparameter | Value | | --- | --- | | Sequence length | 512 | | Epochs | 1 | | Optimiser | AdamW | | Learning rate | *4.255e-5* | | Learning rate scheduler | *Linear* | | Batch size | *3* | | Weight decay | 0.01 | | Warmup ratio | *0.00* | Naturally, as the optimiser state had been lost, the model's learning rate descended slower than it had been previously. Nevertheless, after completing an epoch of training, the model was able to achieve a validation loss of 2.04. ## Benchmarks ๐Ÿ“Š Tested against version 2.0.0 of the [Open Australian Legal QA](https://huggingface.co/datasets/umarbutler/open-australian-legal-qa) dataset, the model achieved a perplexity of 8.01, outperforming all known language models for Australian law. | Model | Parameters | Perplexity | |--|--|--| | **Open Australian Legal LLM** | **1.5B** | **8.01** | | [Open Australian Legal Phi 1.5](https://huggingface.co/umarbutler/open-australian-legal-phi-1_5) | 1.3B | 8.69 | | [Open Australian Legal GPT2](https://huggingface.co/umarbutler/open-australian-legal-gpt2) | 124M | 16.37 | | [Open Australian Legal DistilGPT2](https://huggingface.co/umarbutler/open-australian-legal-distilgpt2) | 88.2M | 23.9 | ## Limitations ๐Ÿšง Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of [GPT2-XL](https://huggingface.co/gpt2-xl). One might also expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation). Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data. ## Licence ๐Ÿ“œ To ensure its accessibility to as wide an audience as possible, the model is issued under the [Apache Licence 2.0](https://www.apache.org/licenses/LICENSE-2.0.html). ## Citation ๐Ÿ”– If you've relied on the model for your work, please cite: ```bibtex @misc{butler-2023-open-australian-legal-llm, author = {Butler, Umar}, year = {2023}, title = {Open Australian Legal LLM}, publisher = {Hugging Face}, version = {1.0.0}, url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-llm} } ``` ## Acknowledgements ๐Ÿ™ In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today. The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences. The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of [GPT2](https://huggingface.co/gpt2-xl), which the model was built atop. Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.