Zyphen-CO-Legal-Autocomplete Micro

Overview

Zyphen-CO-Legal-Autocomplete Micro is an experimental language model fine-tuned to assist legal professionals by providing intelligent autocomplete suggestions. Leveraging data from 100 Colombian jurisdictions, this model aims to enhance efficiency and accuracy in legal documentation tasks. While currently trained on a limited dataset, ongoing testing will expand its capabilities with additional data sources to ensure comprehensive coverage and reliability.

Features

Experimental Model: Currently in the testing phase with foundational training on 100 Colombian jurisdictions.
Domain-Specific Expertise: Tailored for the Colombian legal framework, ensuring relevance and precision in legal contexts.
Efficient Inference: Optimized with LoRA adapters and 4-bit quantization to minimize memory usage and accelerate response times.
Scalable Architecture: Designed to handle extensive legal documents with support for up to 30,000 tokens in context.
Seamless Integration: Compatible with various applications and services, enabling effortless embedding into existing legal workflows.
Multilingual Support: Capable of understanding and generating content in Spanish.

Installation

Ensure you have the necessary libraries installed:

pip install transformers huggingface_hub torch

Loading the Model

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load the tokenizer and model from Hugging Face Hub
tokenizer = AutoTokenizer.from_pretrained("AcropolisLabs/Zyphen-CO-Legal-Autocomplete-micro")
model = AutoModelForCausalLM.from_pretrained("AcropolisLabs/Zyphen-CO-Legal-Autocomplete-micro")

# Move the model to GPU for faster inference (if available)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Define the autocomplete function
def generate_autocomplete(prompt, max_new_tokens=100):
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        top_k=50,
        top_p=0.95,
        temperature=0.7
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
prompt = "En el presente fallo de la Corte Suprema de Justicia se dispuso que"
suggestion = generate_autocomplete(prompt)
print("Autocomplete Suggestion:", suggestion)

Data Preparation

The model was trained on a curated dataset comprising legal judgments from 100 Colombian jurisdictions. The data underwent the following preprocessing steps:

Filtering: Excluding unavailable or irrelevant content to ensure data quality.
Chunking: Splitting extensive texts into manageable segments, each appended with an end-of-sequence token to facilitate coherent text generation.
Tokenization: Converting textual data into tokens using the unsloth/Qwen2.5-0.5B tokenizer, optimized for efficient processing.

Future Plans: As testing progresses, the dataset will be expanded to include additional jurisdictions and more comprehensive legal documents to enhance the model's robustness and applicability.

Fine-Tuning Details

Base Model: unsloth/Qwen2.5-0.5B
Adapter Method: LoRA (Low-Rank Adaptation) with the following configurations:
- Rank (r): 16
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Alpha (lora_alpha): 16
- Dropout (lora_dropout): 0
Training Parameters:
- Batch Size: 1
- Gradient Accumulation Steps: 4
- Learning Rate: 2e-4
- Optimizer: AdamW with 8-bit precision
- Weight Decay: 0.01
- Scheduler: Linear
- Total Training Steps: 500

Model Evaluation

Zyphen-CO-Legal-Autocomplete has undergone preliminary evaluations focusing on its ability to generate contextually relevant and legally accurate autocomplete suggestions within the scope of the 100 jurisdictions it was trained on. Feedback from legal professionals indicates promising utility, with ongoing assessments aimed at identifying areas for improvement as the dataset expands.

Usage Guidelines

To integrate Zyphen-CO-Legal-Autocomplete into your applications:

Import the Model and Tokenizer: As demonstrated in the Loading the Model section.
Generate Autocomplete Suggestions: Utilize the generate_autocomplete function by providing appropriate legal prompts.
Integrate with User Interfaces: Embed the autocomplete functionality within your legal software tools to enhance productivity.

Contributions

Contributions are highly valued! If you have suggestions, improvements, or encounter any issues, please submit an issue or a pull request. Ensure that your contributions align with the project's focus on legal domain expertise and efficiency.

Contact

For inquiries or support, please contact support@acropolisis.com.

Acknowledgements

Unsloth: For their efficient model optimization techniques.
Hugging Face: For providing robust tools and an excellent community.
TRL: For their insightful fine-tuning methodologies.
Colombian Legal Institutions: For providing comprehensive legal data essential for training.

AcropolisLabs
/

Zyphen-CO-Legal-Autocomplete-micro