Lawma 70B

Lawma 70B is a fine-tune of Llama 3 70B Instruct on 260 legal classification tasks derived from Supreme Court and Songer Court of Appeals databases. Lawma was fine-tuned on over 500k task examples, totalling 2B tokens. As a result, Lawma 70B outperforms GPT-4 on 95% of these legal classification tasks, on average by over 17 accuracy points. See our arXiv preprint and GitHub repository for more details.

Evaluations

We report mean classification accuracy across the 260 legal classification tasks that we consider. We use the standard MMLU multiple-choice prompt, and evaluate models zero-shot. You can find our evaluation code here.

Model All tasks Supreme Court tasks Court of Appeals tasks
Lawma 70B 81.9 84.1 81.5
Lawma 8B 80.3 82.4 79.9
GPT4 62.9 59.8 63.4
Llama 3 70B Inst 58.4 47.1 60.3
Mixtral 8x7B Inst 43.2 24.4 46.4
Llama 3 8B Inst 42.6 32.8 44.2
Majority classifier 41.7 31.5 43.5
Mistral 7B Inst 39.9 19.5 43.4
Saul 7B Inst 34.4 20.2 36.8
LegalBert 24.6 13.6 26.4

FAQ

What are the Lawma models useful for? We recommend using the Lawma models only for the legal classification tasks that they models were fine-tuned on. The model has been fine-tuned on multiple-choice questions, not on general instructions. Therefore, the model only outputs multiple choice letters (e.g., A, B, C, etc) or numbers. The main take-away of our paper is that specializing models leads to large improvements in performance. Therefore, we strongly recommend practitioners to further fine-tune Lawma on the actual tasks that the models will be used for. Relatively few examples --i.e, dozens or hundreds-- may already lead to large gains in performance.

Should I use Lawma 8B or 70B? Lawma 70B outperforms Lawma 8B on about 67% of the tasks, on average by a small amount, typically between 1 and 2 accuracy points. Therefore, practitioners may prefer to use Lawma 8B for its significantly cheaper inference and fine-tuning, with little cost in terms of model performance.

What legal classification tasks is Lawma fine-tuned on? We consider almost all of the variables of the Supreme Court and Songer Court of Appeals databases. Our reasons to study these legal classification tasks are both technical and substantive. From a technical machine learning perspective, these tasks provide highly non-trivial classification problems where even the best models leave much room for improvement. From a substantive legal perspective, efficient solutions to such classification problems have rich and important applications in legal research.

Example use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_dir = "ricdomolm/lawma-70b"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(model_dir)

def generate_response(input_text):
    inputs = tokenizer(input_text, return_tensors="pt")
    outputs = model.generate(inputs.input_ids, max_length=2048, do_sample=False)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    return response

input_text = """This case may seem at first blush too inconsequential to find its way into our bdoks, but the issue it presents is of no small constitutional significance.
Appellant Paul Robert Cohen was convicted in the Los Angeles Municipal Court of violating that part of California Penal Code § 415 which prohibits “maliciously and willfully disturb [ing] the peace or quiet of any neighborhood or person. ■.. by... offensive conduct... He was given 30 days’imprisonment. The facts upon which his conviction rests are detailed in the opinion of the Court of Appeal of California, Second Appellate District, as follows:
“On April 26, 1968, the defendant was observed in the Los Angeles County Courthouse in the corridor outside of division 20 of the municipal court wearing a jacket bearing the words ‘F the Draft’ which were plainly visible. There were women and children present in the corridor. The defendant was arrested. The defendant testified that he wore the jacket knowing that the words were on the jacket as a means of informing the public of the depth of his feelings against the Vietnam War and the draft.
“The defendant did not engage in, nor threaten to engage in, nor did anyone as the result of his conduct in fact commit or threaten to commit any act. of violence. The defendant did not make any loud or unusual noise, nor was there any evidence that he uttered any sound prior to his arrest.” 1 Cal. App. 3d 94, 97-98, 81 Cal. Rptr. 503, 505 (1969).
In affirming the conviction the Court of Appeal held that' “offensive conduct” means “behavior which has a tendency to provoke others to acts of violence or to in turn disturb the peace,” and that the State had,proved this element because, on the facts of this case, “[i]t was certainly reasonably foreseeable that.such conduct might cause others to rise up to commit a violent act against the person of the defendant or attempt to forceably remove his jacket.” 1 Cal. App. 3d, at 99-100, 81 Cal.

Question: What is the issue area of the decision?
A. Criminal Procedure
B. Civil Rights
C. First Amendment
D. Due Process
E. Privacy
F. Attorneys
G. Unions
H. Economic Activity
I. Judicial Power
J. Federalism
K. Interstate Relations
L. Federal Taxation
M. Miscellaneous
N. Private Action
Answer:"""

output = generate_response(input_text)
print(output)

Citation

This model was trained for the project

Lawma: The Power of Specizalization for Legal Tasks. Ricardo Dominguez-Olmedo and Vedant Nanda and Rediet Abebe and Stefan Bechtold and Christoph Engel and Jens Frankenreiter and Krishna Gummadi and Moritz Hardt and Michael Livermore. 2024

Please cite as:

@misc{dominguezolmedo2024lawmapowerspecializationlegal,
      title={Lawma: The Power of Specialization for Legal Tasks}, 
      author={Ricardo Dominguez-Olmedo and Vedant Nanda and Rediet Abebe and Stefan Bechtold and Christoph Engel and Jens Frankenreiter and Krishna Gummadi and Moritz Hardt and Michael Livermore},
      year={2024},
      eprint={2407.16615},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2407.16615}, 
}
Downloads last month
13
Safetensors
Model size
70.6B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train ricdomolm/lawma-70b