shellzero's picture
Update README.md
1c34a32 verified
metadata
license: mit
datasets:
  - ymoslem/Law-StackExchange
language:
  - en
metrics:
  - f1
base_model:
  - google/gemma-2-2b
library_name: mlx
tags:
  - legal
widget:
  - text: |
      <start_of_turn>user
      ## Instructions
      You are a helpful AI assistant.
      ## User
      How to make scrambled eggs?<end_of_turn>
      <start_of_turn>model

shellzero/gemma2-2b-ft-law-data-tag-generation

This model was converted to MLX format from google/gemma-7b-it. Refer to the original model card for more details on the model.

pip install mlx-lm

The model was LoRA fine-tuned on the ymoslem/Law-StackExchange and Synthetic data generated from GPT-4o and GPT-35-Turbo using the format below, for 1500 steps using mlx.

This fine tune was one of the best runs with our data and achieved high F1 score on our eval dataset. (Part of the Nvidia hackathon)

def format_prompt(system_prompt: str, title: str, question: str) -> str:
    "Format the question to the format of the dataset we fine-tuned to."
    return """<bos><start_of_turn>user
## Instructions
{}
## User
TITLE:
{}
QUESTION:
{}<end_of_turn>
<start_of_turn>model
""".format(
        system_prompt, title, question
    )

Here's an example of the system_prompt from the dataset:

Read the following title and question about a legal issue and assign the most appropriate tag to it. All tags must be in lowercase, ordered lexicographically and separated by commas.

Loading the model using mlx_lm

from mlx_lm import generate, load
model, tokenizer = load("shellzero/gemma2-2b-ft-law-data-tag-generation")
response = generate(
    model,
    tokenizer,
    prompt=format_prompt(system_prompt, question),
    verbose=True,  # Set to True to see the prompt and response
    temp=0.0,
    max_tokens=32,
)