⚠️ This model has been superseded by the Open Australian Legal LLM, the largest open source language model trained on Australian law. You are encouraged to use that model instead. ⚠️

Open Australian Legal Phi-1.5 ‍⚖️

Open Australian Legal Phi-1.5 is an open source Phi-1.5 model trained on Australian law.

Naturally, as a finetune of Phi-1.5, the model may be used for any of the tasks for which Phi-1.5 is suitable, including text generation, text completion and question answering.

Trained on roughly 45,000 laws, regulations and decisions, comprising 422,373,888 tokens, taken from the Open Australian Legal Corpus, the model is intended specifically to be finetuned for downstream natural language processing tasks applied to the Australian legal domain.

The model is issued under the same licence as its parent model, namely the Microsoft Research License.

Usage 👩‍💻

The code snippet below demonstrates just one of the many ways in which the model may be accessed:

>>> from transformers import set_seed, AutoModelForCausalLM, AutoTokenizer, pipeline

>>> set_seed(42) # We set a seed for reproducibility.
>>> model = AutoModelForCausalLM.from_pretrained('umarbutler/open-australian-legal-phi-1_5', trust_remote_code=True) # `trust_remote_code=True` is required to load Phi 1.5.
>>> tokenizer = AutoTokenizer.from_pretrained('umarbutler/open-australian-legal-phi-1_5')
>>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
>>> generator('Section 51 of the Constitution provides', max_length=24)
[{'generated_text': 'Section 51 of the Constitution provides that the Parliament may make laws for the peace, order and good government of the Commonwealth.'}]

Creation 🧪

50,000 laws, regulations and decisions were randomly sampled from the Open Australian Legal Corpus, excluding duplicate texts and documents that, when stripped of leading and trailing whitespace, were less than 128 characters long. The following cleaning procedures were then applied:

Non-breaking spaces were replaced with regular spaces;
Return carriages followed by newlines were replaced with newlines;
Whitespace was removed from lines comprised entirely of whitespace;
Newlines and whitespace preceding newlines were removed from the end of texts;
Newlines and whitespace succeeding newlines were removed from the beginning of texts; and
Spaces and tabs were removed from the end of lines.

After cleaning, the documents were added to blocks 512-tokens-long, with Phi-1.5's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block. These blocks were then randomly shuffled and split into a training dataset of 742,454 and a validation dataset of 82,495 blocks, or 380,136,448 and 42,237,440 tokens, respectively.

The training dataset was subsequently fed to Phi-1.5 via with the following hyperparameters:

Hyperparameter	Value
Sequence length	512
Epochs	1
Optimiser	AdamW
Learning rate	2e-5
Learning rate scheduler	Linear with warmup
Batch size per device	4
Weight decay	0.1
Warmup steps	0.03

After training for 1 epoch, or 185,614 steps, over a period of ~16 hours on a single GeForce RTX 4090, the model achieved a validation loss of 2.21.

Limitations 🚧

Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of Phi-1.5.

One might also expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the Open Australian Legal Corpus at the time of the model's creation).

Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data.

Licence 📜

The model is issued under the same licence as its parent model, namely the Microsoft Research License.

Citation 🔖

If you've relied on the model for your work, please cite:

@misc{butler-2023-open-australian-legal-phi-1.5,
    author = {Butler, Umar},
    year = {2023},
    title = {Open Australian Legal Phi-1.5},
    publisher = {Hugging Face},
    version = {1.0.0},
    url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-phi-1_5}
}

Acknowledgements 🙏

In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.

The author thanks the sources of the Open Australian Legal Corpus for making their data available under open licences.

The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of Phi-1.5, which the model was built atop.

Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.

isaacus
/

open-australian-legal-phi-1_5