Aalap: A 32K context length Indian legal LLM

Aalap (Assistant for Legal and Paralegal functions in India) is an instructions fine-tuned version of Mistral 7B that can perform specific legal tasks in the Indian context. The details about which specific legal tasks Aalap can perform and training dataset can be found here.

This research model intends to show that we can develop tasks for the legal domain and teach LLMs to do them at an affordable cost. The details about the dataset, model training, and evaluations can be found in the paper. Aalap training & evaluation code can be found on the repo.

What is Aalap’s intended use(s)?

From the evaluation results, we can conclude that, for the tasks that are present in the training data, Aalap is performing comparably to ‘gpt-3.5-turbo’. But for the AIBE exam and Legalbench data, Aalap is not doing any better than the Mistral 7B base model. Hence Aalap is not a general-purpose India legal LLM but can do well under the constraints of the specific legal tasks.

Model Details

Aalap is a finetuned version of Mistral 7b. Aalap’s training data is a synthetic dataset that was created to develop explanations and solve legal task capabilites.

Please refer to the Aalap paper for details on the model architecture.

License

Apache 2.0

Getting started with Aalap

Inference with Hugging Face library

import torch
import transformers
if torch.cuda.is_available():
    torch.set_default_device("cuda")
else:
    torch.set_default_device("cpu")
    
model = transformers.AutoModelForCausalLM.from_pretrained("opennyaiorg/Aalap-Mistral-7B-v0.1-bf16", device_map='auto')
tokenizer = transformers.AutoTokenizer.from_pretrained("opennyaiorg/Aalap-Mistral-7B-v0.1-bf16")


## load the dataset to test on samples
dataset_pandas = load_dataset("opennyaiorg/aalap_instruction_dataset", split='test').to_pandas()
def get_sample(task=None, skip_keyword=None):
    if task==None:
        task_list = list(set(dataset_pandas['task'].to_list()))
        if skip_keyword:
            task_list = [i for i in task_list if skip_keyword not in i]
        task=random.choice(task_list)
    print('Random sample selected from ',task)
    filtered = dataset_pandas[dataset_pandas.task==task]['combined_input_prompt'].to_list()
    choice = random.choice(filtered);prompt = choice.split('[/INST]')[0]+'[/INST]';response = choice.split('[/INST]')[-1]
    return prompt,response


## Provide your custom prompt
user_prompt = 'Who are you and what are your capabilites?'
system_prompt = '''You are helpful assistant.'''
prompt = f'<s> [INST] <<SYS>> {system_prompt} <</SYS>> {user_prompt} [/INST]'

inputs = tokenizer(prompt, return_tensors='pt')
output_ids = model.generate(inputs["input_ids"],)
answer = tokenizer.batch_decode(output_ids)[0]

print("Custom prompt answer:\n"answer)

## Test on a random data point from dataset
prompt, response = get_sample(task=None,skip_keyword=None)

inputs = tokenizer(prompt, return_tensors='pt')
output_ids = model.generate(inputs["input_ids"],)
answer = tokenizer.batch_decode(output_ids)[0]

print("Random prompt answer:\n"answer)

Limitations:

We have synthetically generated data for various legal tasks and used a portion of the ORCA dataset relevant to law for teach LLM do explanation and certain tasks.
The model is not optimized for chat and supports single-turn conversations.
The model has not been trained with RLHF or DPO. It can sometimes generate wrong information, and hence, it is recommended to review the answers by an expert before using them.
Beyond explanation generation, the model inherits the capabilities and limitations of its base (Mistral 7B).

Bias, Risks, and Limitations

Aalap, built upon the Mistral 7B, retains many of its limitations, as well as the common limitations of other large language models or limitation caused by its training process, including:

Data Biases: Large language models, trained on extensive data, can inadvertently carry biases present in the source data. Consequently, the models may generate outputs that could be potentially biased or unfair.

Lack of Contextual Understanding: Despite their impressive capabilities in language understanding and generation, these models exhibit limited real-world understanding, resulting in potential inaccuracies or nonsensical responses.

Lack of Transparency: Due to the complexity and size, large language models can act as “black boxes”, making it difficult to comprehend the rationale behind specific outputs or decisions. We have tried to generated explanations to solve this but there are some limitations still present.

Content Harms: There are various types of content harms that large language models can cause. It is important to be aware of them when using these models, and to take actions to prevent them. It is recommended to leverage various content moderation services provided by different companies and institutions. On an important note, we hope for better regulations and standards from government and technology leaders around content harms for AI technologies in future. We value and acknowledge the important role that research and open source community can play in this direction.

Hallucination: It is important to be aware and cautious not to entirely rely on a given language model for critical decisions or information that might have deep impact as it is not obvious how to prevent these models from fabricating content. Moreover, it is not clear whether small models may be more susceptible to hallucination in ungrounded generation use cases due to their smaller sizes and hence reduced memorization capacities. This is an active research topic and we hope there will be more rigorous measurement, understanding and mitigations around this topic.

Potential for Misuse: Without suitable safeguards, there is a risk that these models could be maliciously used for generating disinformation or harmful content.

Data Distribution: Aalap’s performance is likely to correlate strongly with the distribution of the tuning data. This correlation might limit its accuracy in areas underrepresented in the training dataset such as summary generation, reasoning, in-context learning, etc...

System messages: Aalap demonstrates variance in performance depending on the system instructions.

Zero-Shot Settings: Aalap was trained on data that mostly simulate zero-shot settings. While the model demonstrate good performance in zero-shot settings, it does not show the same gains of using few-shot learning compared to other.

Synthetic data: As Aalap is trained on synthetic data, it could inherit both the advantages and shortcomings of the models and methods used for data generation.

Citation

@misc{tiwari2024aalap,
      title={Aalap: AI Assistant for Legal & Paralegal Functions in India}, 
      author={Aman Tiwari and Prathamesh Kalamkar and Atreyo Banerjee and Saurabh Karn and Varun Hemachandran and Smita Gupta},
      year={2024},
      eprint={2402.01758},
      archivePrefix={arXiv},
      primaryClass={cs.CY}
}

opennyaiorg
/

Aalap-Mistral-7B-v0.1-bf16

You need to agree to share your contact information to access this model