umarbutler's picture
Update README.md
dddda6d
|
raw
history blame
6.99 kB
metadata
language:
  - en
license: other
license_name: microsoft-research-license
license_link: https://huggingface.co/microsoft/phi-1_5/resolve/main/Research%20License.docx
library_name: transformers
base_model: microsoft/phi-1_5
tags:
  - law
  - legal
  - australia
  - generated_from_trainer
datasets:
  - umarbutler/open-australian-legal-corpus
inference: false
metrics:
  - perplexity
model-index:
  - name: open-australian-legal-llm
    results:
      - task:
          type: text-generation
          name: Text generation
        dataset:
          type: umarbutler/open-australian-legal-qa
          name: Open Australian Legal QA
          split: train
          revision: b53a24f8edf5eb33d033a53b5b53d0a4a220d4ae
        metrics:
          - type: perplexity
            value: 8.693482443009522
            name: Perplexity
        source:
          name: lmppl
          url: https://github.com/asahi417/lmppl

⚠️ This model has been superseded by the Open Australian Legal LLM, the largest open source language model trained on Australian law. You are encouraged to use that model instead. ⚠️

Open Australian Legal Phi-1.5 β€βš–οΈ

Open Australian Legal Phi-1.5 is an open source Phi-1.5 model trained on Australian law.

Naturally, as a finetune of Phi-1.5, the model may be used for any of the tasks for which Phi-1.5 is suitable, including text generation, text completion and question answering.

Trained on roughly 45,000 laws, regulations and decisions, comprising 422,373,888 tokens, taken from the Open Australian Legal Corpus, the model is intended specifically to be finetuned for downstream natural language processing tasks applied to the Australian legal domain.

The model is issued under the same licence as its parent model, namely the Microsoft Research License.

Usage πŸ‘©β€πŸ’»

The code snippet below demonstrates just one of the many ways in which the model may be accessed:

>>> from transformers import set_seed, AutoModelForCausalLM, AutoTokenizer, pipeline

>>> set_seed(42) # We set a seed for reproducibility.
>>> model = AutoModelForCausalLM.from_pretrained('umarbutler/open-australian-legal-phi-1_5', trust_remote_code=True) # `trust_remote_code=True` is required to load Phi 1.5.
>>> tokenizer = AutoTokenizer.from_pretrained('umarbutler/open-australian-legal-phi-1_5')
>>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
>>> generator('Section 51 of the Constitution provides', max_length=24)
[{'generated_text': 'Section 51 of the Constitution provides that the Parliament may make laws for the peace, order and good government of the Commonwealth.'}]

Creation πŸ§ͺ

50,000 laws, regulations and decisions were randomly sampled from the Open Australian Legal Corpus, excluding duplicate texts and documents that, when stripped of leading and trailing whitespace, were less than 128 characters long. The following cleaning procedures were then applied:

  1. Non-breaking spaces were replaced with regular spaces;
  2. Return carriages followed by newlines were replaced with newlines;
  3. Whitespace was removed from lines comprised entirely of whitespace;
  4. Newlines and whitespace preceding newlines were removed from the end of texts;
  5. Newlines and whitespace succeeding newlines were removed from the beginning of texts; and
  6. Spaces and tabs were removed from the end of lines.

After cleaning, the documents were added to blocks 512-tokens-long, with Phi-1.5's end-of-sequence token ('<|endoftext|>') being used as a delimiter as well as to pad the end of the final block. These blocks were then randomly shuffled and split into a training dataset of 742,454 and a validation dataset of 82,495 blocks, or 380,136,448 and 42,237,440 tokens, respectively.

The training dataset was subsequently fed to Phi-1.5 via with the following hyperparameters:

Hyperparameter Value
Sequence length 512
Epochs 1
Optimiser AdamW
Learning rate 2e-5
Learning rate scheduler Linear with warmup
Batch size per device 4
Weight decay 0.1
Warmup steps 0.03

After training for 1 epoch, or 185,614 steps, over a period of ~16 hours on a single GeForce RTX 4090, the model achieved a validation loss of 2.21.

Limitations 🚧

Although the model has not been tested for bias, one would expect it to exhibit much of the same, if not all, the biases of Phi-1.5.

One might also expect the model to exhibit a bias towards the type of language employed in laws, regulations and decisions (its source material) as well as towards Commonwealth and New South Wales law (the largest sources of documents in the Open Australian Legal Corpus at the time of the model's creation).

Finally, it is worth noting that the model may lack knowledge of Victorian, Northern Territory and Australian Capital Territory law as licensing restrictions had prevented their inclusion in the training data.

Licence πŸ“œ

The model is issued under the same licence as its parent model, namely the Microsoft Research License.

Citation πŸ”–

If you've relied on the model for your work, please cite:

@misc{butler-2023-open-australian-legal-phi-1.5,
    author = {Butler, Umar},
    year = {2023},
    title = {Open Australian Legal Phi-1.5},
    publisher = {Hugging Face},
    version = {1.0.0},
    url = {https://huggingface.co/datasets/umarbutler/open-australian-legal-phi-1_5}
}

Acknowledgements πŸ™

In the spirit of reconciliation, the author acknowledges the Traditional Custodians of Country throughout Australia and their connections to land, sea and community. He pays his respect to their Elders past and present and extends that respect to all Aboriginal and Torres Strait Islander peoples today.

The author thanks the sources of the Open Australian Legal Corpus for making their data available under open licences.

The author also acknowledges the developers of the many Python libraries relied upon in the training of the model, as well as the makers of Phi-1.5, which the model was built atop.

Finally, the author is eternally grateful for the endless support of his wife and her willingness to put up with many a late night spent writing code and quashing bugs.