File size: 6,494 Bytes
a8aade8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1650537
a8aade8
cc3d96d
a8aade8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
---
language:
- en
license: other
license_name: microsoft-research-license
license_link: https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx
library_name: transformers
base_model: microsoft/phi-1_5
tags:
- law
- legal
- australia
- generated_from_trainer
datasets:
- umarbutler/open-australian-legal-corpus
inference: false
---
⚠️ This model has been superseded by the [Open Australian Legal LLM](https://huggingface.co/umarbutler/open-australian-legal-llm), the largest open source language model trained on Australian law. You are encouraged to use that model instead. ⚠️

# Open Australian Legal Phi-1.5 β€βš–οΈ
Open Australian Legal Phi-1.5 is an open source [Phi-1.5](https://huggingface.co/microsoft/phi-1_5) model trained on Australian law.

Naturally, as a finetune of [Phi-1.5](https://huggingface.co/microsoft/phi-1_5), the model may be used for any of the tasks for which [Phi-1.5](https://huggingface.co/microsoft/phi-1_5) is suitable, including text generation, text completion and question answering.

Trained on roughly 45,000 laws, regulations and decisions, comprising 422,373,888 tokens, taken from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), the model is intended specifically to be finetuned for downstream natural language processing tasks applied to the Australian legal domain.

The model is issued under the same licence as its parent model, namely the [Microsoft Research License](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx).

## Usage πŸ‘©β€πŸ’»
The code snippet below demonstrates just one of the many ways in which the model may be accessed:
```python
>>> from transformers import set_seed, AutoModelForCausalLM, AutoTokenizer, pipeline

>>> set_seed(42) # We set a seed for reproducibility.
>>> model = AutoModelForCausalLM.from_pretrained('umarbutler/open-australian-legal-phi-1_5', trust_remote_code=True) # `trust_remote_code=True` is required to load Phi 1.5.
>>> tokenizer = AutoTokenizer.from_pretrained('umarbutler/open-australian-legal-phi-1_5')
>>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
>>> generator('Section 51 of the Constitution provides', max_length=24)
[{'generated_text': 'Section 51 of the Constitution provides that the Parliament may make laws for the peace, order and good government of the Commonwealth.'}]
```

## Creation πŸ§ͺ
50,000 laws, regulations and decisions were randomly sampled from the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus), excluding duplicate texts and documents that, when stripped of leading and trailing whitespace, were less than 128 characters long. The following cleaning procedures were then applied:
1. Non-breaking spaces were replaced with regular spaces;
1. Return carriages followed by newlines were replaced with newlines;
1. Whitespace was removed from lines comprised entirely of whitespace;
1. Newlines and whitespace preceding newlines were removed from the end of texts;
1. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and
1. Spaces and tabs were removed from the end of lines.

After cleaning, the documents were added to blocks 512-tokens-long, with [Phi-1.5](https://huggingface.co/microsoft/phi-1_5)'s end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block. These blocks were then randomly shuffled and split into a training dataset of 742,454 and a validation dataset of 82,495 blocks, or 380,136,448 and 42,237,440 tokens, respectively.

The training dataset was subsequently fed to [Phi-1.5](https://huggingface.co/microsoft/phi-1_5) via with the following hyperparameters:
| Hyperparameter | Value |
| --- | --- |
| Sequence length | 512 |
| Epochs | 1 |
| Optimiser | AdamW |
| Learning rate | 2e-5 |
| Learning rate scheduler | Linear with warmup |
| Batch size per device | 4 |
| Weight decay | 0.1 |
| Warmup steps | 0.03 |

After training for 1 epoch, or 185,614 steps, over a period of ~16 hours on a single GeForce RTX 4090, the model achieved a validation loss of 2.21.

## Limitations 🚧
Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of [Phi-1.5](https://huggingface.co/microsoft/phi-1_5).

One might also expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) at the time of the model's creation).

Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data.

## Licence πŸ“œ
The model is issued under the same licence as its parent model, namely the [Microsoft Research License](https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx).

## Citation πŸ”–
If you've relied on the model for your work, please cite:
```bibtex
@misc{butler-2023-open-australian-legal-phi-1.5,
    author = {Butler, Umar},
    year = {2023},
    title = {Open Australian Legal Phi-1.5},
    publisher = {Hugging Face},
    version = {1.0.0},
    url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-phi-1_5}
}
```

## Acknowledgements πŸ™
In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.

The author thanks the sources of the [Open Australian Legal Corpus](https://huggingface.co/datasets/umarbutler/open-australian-legal-corpus) for making their data available under open licences.

The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of [Phi-1.5](https://huggingface.co/microsoft/phi-1_5), which the model was built atop.

Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.