Spaces:

ulanch
/

lexai

Sleeping

App Files Files Community

MrAl3x0 commited on Jul 12

Commit

3712c2c

1 Parent(s): 6d05e12

style: make entire codebase PEP 8 compliant using Ruff

Browse files

Files changed (17) hide show

.gitignore +2 -20
README.md +73 -75
assets/screenshot.png +2 -2
lexai/__main__.py +8 -157
lexai/config.py +23 -18
lexai/core/data_loader.py +3 -2
lexai/core/match_engine.py +52 -27
lexai/core/matcher.py +20 -6
lexai/models/embedding_model.py +0 -0
lexai/services/openai_client.py +39 -46
lexai/ui/gradio_interface.py +7 -5
pyproject.toml +31 -22
tests/test_data_loader.py +5 -4
tests/test_main.py +0 -0
tests/test_matcher.py +100 -37
tests/test_matching.py +0 -0
tests/test_openai_client.py +4 -2

.gitignore CHANGED Viewed

@@ -1,40 +1,22 @@
-# Python
 __pycache__/
 *.pyc
 *.egg-info/
 .venv/
 .pytest_cache/
 .mypy_cache/
-*.so
 build/
 dist/
 htmlcov/
 .coverage.*
 .nox/
 .tox/
-pip-log.txt
-pip-delete-this-directory.txt
 .ipynb_checkpoints/
-# Environment
 .env
-# VSCode
-.vscode/*.code-workspace
-.vscode/*.log
-.vscode/*.vsix
-.vscode/*.bak
-.vscode/.history/
-.vscode/extensions.json
-.vscode/launch.json
-.vscode/settings.json
-# Operating System Files
 .DS_Store
 Thumbs.db
 desktop.ini
-# Logs and temporary files
 *.log
 *.tmp
 *.bak

 __pycache__/
 *.pyc
+*.so
 *.egg-info/
 .venv/
 .pytest_cache/
 .mypy_cache/
 build/
 dist/
 htmlcov/
 .coverage.*
 .nox/
 .tox/
 .ipynb_checkpoints/
 .env
+.vscode/
 .DS_Store
 Thumbs.db
 desktop.ini
 *.log
 *.tmp
 *.bak

README.md CHANGED Viewed

@@ -1,136 +1,134 @@
-# LexAI Demo
 ## AI-Powered Legal Research Assistant
-This repository hosts a demonstration of **LexAI**, an AI-powered legal research assistant designed to provide relevant legal information based on user queries and specified locations. This project serves as a proof of concept, showcasing the integration of large language models (LLMs) with local embedding data for specialized information retrieval.
-### Features
-![LexAI Demo Screenshot](assets/screenshot.png)
-- **AI-Powered Responses**: Utilizes OpenAI's GPT-4 model to generate natural language responses to legal queries.
-- **Location-Specific Information**: Provides legal information tailored to specific jurisdictions (currently Boulder County, Colorado, and Denver, Colorado).
-- **Semantic Search**: Employs embeddings and vector similarity search to find the most relevant legal documents.
-- **Interactive Web Interface**: Built with Gradio for an easy-to-use, browser-based demonstration.
----
-### Getting Started
-Follow these steps to set up and run the LexAI demo on your local machine.
-#### 1. Clone the Repository
 ```bash
-git clone https://github.com/alexulanch/lexai-demo.git
-cd lexai-demo
 ```
-This project uses [Git LFS](https://git-lfs.github.com/) to manage the embedding data.
-If you’re **not using the provided dev container**, install Git LFS before cloning:
 ```bash
 git lfs install
 git lfs pull
 ```
----
-#### 2. Install Dependencies
-Install the required Python packages using pip. The dependencies are: `pandas`, `numpy`, `openai`, `gradio`, `scipy`, and `python-dotenv`.
 ```bash
 pip install -r requirements.txt
 ```
----
-#### 3. Configure Your OpenAI API Key
-This application relies on the OpenAI API. You will need an API key to access the models used for embeddings and chat completions (e.g., `text-embedding-ada-002`, `gpt-4`).
-**Using a `.env` file (recommended for local development):**
-1. Create a file named `.env` in the root directory of the project.
-2. Add your API key to the file like this:
-    ```dotenv
-    OPENAI_API_KEY="your_openai_api_key_here"
-    ```
-3. Ensure `.env` is listed in `.gitignore` to avoid committing it by mistake.
 ---
-#### 4. Run the Application
-Start the Gradio app:
 ```bash
 python -m lexai
 ```
-You’ll see a local URL like `http://127.0.0.1:7860` — open it in your browser to use LexAI Demo.
 ---
-### Project Structure
 ```
-lexai-demo/
-├── .devcontainer/             # Dev container config for VS Code
-│   └── devcontainer.json
-├── lexai/                     # Main Python package
 │   ├── __init__.py
-│   ├── __main__.py            # Gradio app entry point
-│   ├── config.py              # Global app config and constants
-│   ├── core/                  # Core logic components
-│   │   ├── data_loader.py     # Loads embedding data
-│   │   └── matcher.py         # Semantic search logic
-│   └── services/              # External API integrations
-│       └── openai_client.py   # Interacts with OpenAI API
-├── pyproject.toml             # Project metadata and build config
-├── requirements.txt           # Python dependencies
-└── .gitignore                 # Files/directories Git should ignore
 ```
 ---
-### Usage
-1. Enter your legal question in the "Query" textbox.
-2. Select the desired "Location" (Boulder or Denver) from the dropdown.
-3. Click "Submit" to get an AI-generated response and relevant legal references.
-4. Use the "Clear" button to reset.
-5. Explore the example queries provided.
----
-### Error Handling
-The app handles several common error cases:
-- `Invalid OpenAI API key...`: Check your `.env` file or environment variable setup.
-- `OpenAI API Error...`: Rate limits, network issues, etc.
-- `File Error...`: Missing or unreadable `.npz` embedding files.
-- `Input Error...`: Malformed or missing user input.
----
-### Contributing
-Contributions are welcome! Open an issue or pull request with ideas or fixes.
 ---
-### License
-MIT
 ---
-### Acknowledgements
 - Built with [Gradio](https://gradio.app)
 - Powered by [OpenAI](https://openai.com)

+# LexAI
 ## AI-Powered Legal Research Assistant
+LexAI is an AI assistant that delivers jurisdiction-specific legal information by integrating OpenAI's language models with local vector embeddings. The system uses semantic search to surface relevant legal references and provides a web interface for users to query the model interactively.
+![LexAI Screenshot](assets/screenshot.png)
+---
+## Features
+- **GPT-4 Integration**: Uses OpenAI's GPT-4 to generate concise, relevant legal responses.
+- **Jurisdiction-Specific Search**: Preloaded embeddings for Boulder County and Denver, Colorado.
+- **Semantic Search Engine**: Uses cosine similarity for embedding-based document retrieval.
+- **Modern Web Interface**: Built with Gradio for real-time interaction.
+- **Modular Design**: Separation of logic for UI, inference, and API handling.
+- **Fully Tested**: Includes unit tests for embedding loading, matching logic, and OpenAI API integration.
+---
+## Getting Started
+### 1. Clone the Repository
 ```bash
+git clone https://github.com/alexulanch/lexai.git
+cd lexai
 ```
+### 2. Install Git LFS (if needed)
+This project uses [Git LFS](https://git-lfs.github.com/) for storing large `.npz` embedding files.
 ```bash
 git lfs install
 git lfs pull
 ```
+### 3. Install Python Dependencies
 ```bash
 pip install -r requirements.txt
 ```
+### 4. Configure OpenAI API Key and Embedding Paths
+Create a `.env` file in the root directory:
+```dotenv
+OPENAI_API_KEY=your_openai_api_key_here
+BOULDER_EMBEDDINGS_PATH=lexai/data/boulder_embeddings.npz
+DENVER_EMBEDDINGS_PATH=lexai/data/denver_embeddings.npz
+```
 ---
+## Running the App
 ```bash
 python -m lexai
 ```
+Then open `http://127.0.0.1:7860` in your browser.
 ---
+## Project Structure
 ```
+.
+├── LICENSE
+├── README.md
+├── assets
+│   └── screenshot.png
+├── lexai
 │   ├── __init__.py
+│   ├── __main__.py
+│   ├── config.py
+│   ├── core
+│   │   ├── __init__.py
+│   │   ├── data_loader.py
+│   │   ├── match_engine.py
+│   │   └── matcher.py
+│   ├── data
+│   │   ├── boulder_embeddings.npz
+│   │   └── denver_embeddings.npz
+│   ├── models
+│   │   └── embedding_model.py
+│   ├── services
+│   │   └── openai_client.py
+│   └── ui
+│       ├── __init__.py
+│       └── gradio_interface.py
+├── pyproject.toml
+├── pytest.ini
+├── requirements.txt
+└── tests
+    ├── __init__.py
+    ├── test_data_loader.py
+    ├── test_matcher.py
+    └── test_openai_client.py
 ```
 ---
+## Testing
+LexAI includes a full suite of unit tests using `pytest`.
+To run the tests:
+```bash
+pytest
+```
+Tests are located in the `tests/` directory and cover:
+- Embedding data loading
+- Semantic similarity matching
+- OpenAI API interaction
 ---
+## License
+MIT License
 ---
+## Acknowledgements
 - Built with [Gradio](https://gradio.app)
 - Powered by [OpenAI](https://openai.com)

assets/screenshot.png CHANGED Viewed

Git LFS Details

SHA256: dd2bd4dcccaddbeffda844540cc35ad9775c36fabda0ed5afd5d0be0f856552a
Pointer size: 131 Bytes
Size of remote file: 177 kB

Git LFS Details

SHA256: 78f6b31d42be479f16bf8cf654968120c769199bdc9f538dee9e74874126da5d
Pointer size: 131 Bytes
Size of remote file: 175 kB

lexai/__main__.py CHANGED Viewed

@@ -1,165 +1,16 @@
 import logging
-import os
-import openai
-import gradio as gr
-from dotenv import load_dotenv
-if not os.getenv("OPENAI_API_KEY"):
-    load_dotenv(override=True)
-from lexai.config import (
-    LOCATION_INFO,
-    APP_DESCRIPTION,
-    AI_ROLE_TEMPLATE,
-)
-from lexai.core.data_loader import load_embeddings_data
-from lexai.core.matcher import find_top_matches
-from lexai.services.openai_client import get_embedding, get_chat_completion
-logging.basicConfig(
-    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
-)
-def generate_matches(query: str, location: str) -> str:
-    """
-    Generate legal information matches based on the user's query and location.
-    This function orchestrates the process of generating legal information matches
-    using OpenAI's models and local embedding data. It returns an HTML response
-    containing the AI-generated response and references to relevant legal information.
-    Parameters
-    ----------
-    query : str
-        The user's query for legal information.
-    location : str
-        The location for which the user is seeking legal information.
-        Possible values are "Boulder" and "Denver".
-    Returns
-    -------
-    str
-        An HTML response containing the AI-generated response and references
-        to relevant legal information. In case of an error, an error message
-        is returned in HTML format.
-    """
-    try:
-        logging.info(f"Generating embedding for query: '{query}'")
-        query_embedding = get_embedding(query)
-        location_data = LOCATION_INFO.get(location)
-        if not location_data:
-            logging.error(f"No data found for location '{location}'.")
-            raise ValueError(
-                f"No data found for location '{location}'. Please select a valid location."
-            )
-        npz_file = location_data["npz_file"]
-        role_description_base = location_data["role_description"]
-        logging.info(f"Loading embeddings data from: {npz_file}")
-        embeddings, jurisdiction_data = load_embeddings_data(npz_file)
-        logging.info("Finding top matches...")
-        top_matches = find_top_matches(
-            query_embedding, embeddings, jurisdiction_data, num_matches=3
-        )
-        full_ai_role = f"{role_description_base}\n{AI_ROLE_TEMPLATE}"
-        top_matches_str = str(
-            top_matches
-        )
-        logging.info("Getting chat completion from OpenAI...")
-        ai_message = get_chat_completion(full_ai_role, top_matches_str, query)
-        html_response = "<p><strong>Response:</strong></p><p>" + ai_message + "</p>"
-        html_references = "<p><strong>References:</strong></p><ul>"
-        for match in top_matches:
-            url = match.get("url", "#")
-            title = match.get("title", "No Title")
-            subtitle = match.get("subtitle", "No Subtitle")
-            html_references += (
-                f'<li><a href="{url}" target="_blank">{title}: {subtitle}</a></li>'
-            )
-        html_references += "</ul>"
-        logging.info("Successfully generated response and references.")
-        return html_response + html_references
-    except openai.AuthenticationError:
-        logging.error("OpenAI Authentication Error: Invalid API key provided.")
-        return """<p style="font-family: Arial, sans-serif; font-size: 16px; color: #d9534f;">
-    <strong>Error:</strong> Invalid OpenAI API key. Please ensure your `OPENAI_API_KEY` environment variable is correctly set.
-</p>"""
-    except openai.OpenAIError as e:
-        logging.error(f"OpenAI API Error: {e}")
-        return f"""<p style="font-family: Arial, sans-serif; font-size: 16px; color: #d9534f;">
-    <strong>OpenAI API Error:</strong> {str(e)}
-</p>"""
-    except FileNotFoundError as e:
-        logging.error(f"File Not Found Error: {e}")
-        return f"""<p style="font-family: Arial, sans-serif; font-size: 16px; color: #d9534f;">
-    <strong>File Error:</strong> {str(e)} Please ensure embedding files are correctly placed.
-</p>"""
-    except ValueError as e:
-        logging.error(f"Value Error: {e}")
-        return f"""<p style="font-family: Arial, sans-serif; font-size: 16px; color: #333;">
-    <strong>Input Error:</strong> {str(e)}
-</p>"""
-    except Exception as e:
-        logging.exception(
-            "An unexpected error occurred during generate_matches.")
-        return f"""<p style="font-family: Arial, sans-serif; font-size: 16px; color: #d9534f;">
-    <strong>Notice:</strong> An unexpected error occurred while processing your request. Please see the details below:
-    <br>{str(e)}
-</p>"""
-with gr.Blocks(title="LexAI") as iface:
-    gr.HTML("<h1 style='text-align: center;'>LexAI</h1>")
-    gr.Markdown(APP_DESCRIPTION)
-    with gr.Row():
-        with gr.Column(scale=2):
-            query_input = gr.Textbox(
-                label="Query", lines=3, placeholder="Enter your legal question here...")
-            location_input = gr.Dropdown(choices=list(
-                LOCATION_INFO.keys()), label="Location", value=list(LOCATION_INFO.keys())[0])
-            with gr.Row():
-                clear_btn = gr.Button("Clear", variant="secondary")
-                submit_btn = gr.Button("Submit", variant="primary")
-        with gr.Column(scale=3):
-            response_output = gr.HTML(
-                value="<p><strong>Response:</strong></p>",
-                show_label=False
-            )
-            gr.Button("Flag", variant="secondary")
-    def handle_submit(query, location):
-        return gr.update(value=generate_matches(query, location))
-    def handle_clear():
-        return gr.update(value="<p><strong>Response:</strong></p>")
-    submit_btn.click(fn=handle_submit, inputs=[
-                     query_input, location_input], outputs=[response_output])
-    clear_btn.click(fn=handle_clear, outputs=[response_output])
-    gr.Examples(
-        examples=[
-            ["Is it legal for me to use rocks to construct a cairn in an outdoor area?", "Boulder"],
-            ["Is it legal to possess a dog and take ownership of it as a pet?", "Denver"],
-            ["Am I allowed to go shirtless in public spaces?", "Boulder"],
-            ["What is the maximum height I can legally build a structure?", "Denver"],
-            ["Is it legal to place indoor furniture on an outdoor porch?", "Boulder"],
-            ["Can I legally graze livestock like llamas on public land?", "Denver"],
-        ],
-        inputs=[query_input, location_input]
-    )
 if __name__ == "__main__":
-    logging.info("Starting LexAI Gradio application...")
-    iface.launch()

 import logging
+from lexai.ui.gradio_interface import build_interface
+def main():
+    logging.basicConfig(level=logging.INFO)
+    logging.getLogger("httpx").setLevel(logging.WARNING)
+    logging.info("Launching LexAI...")
+    iface = build_interface()
+    iface.launch()
 if __name__ == "__main__":
+    main()

lexai/config.py CHANGED Viewed

@@ -1,18 +1,22 @@
-MODEL_ENGINE = "text-embedding-ada-002"
 LOCATION_INFO = {
     "Boulder": {
-        "npz_file": "lexai/data/boulder_embeddings.npz",
         "role_description": (
-            "You are an AI-powered legal assistant specializing in the jurisdiction of "
-            "Boulder County, Colorado."
         ),
     },
     "Denver": {
-        "npz_file": "lexai/data/denver_embeddings.npz",
         "role_description": (
-            "You are an AI-powered legal assistant specializing in the jurisdiction of "
-            "Denver, Colorado."
         ),
     },
 }
@@ -20,13 +24,12 @@ LOCATION_INFO = {
 APP_DESCRIPTION = (
     "LexAI is an AI-powered legal research app designed to assist individuals, "
     "including law enforcement officers, legal professionals, and the general public, "
-    "in accessing accurate legal information. The app covers various jurisdictions "
-    "and ensures that users can stay informed and confident, regardless of their location. "
-    "This demo is meant to serve as a proof of concept."
 )
-OPENAI_API_KEY_PLACEHOLDER = "Enter your OpenAI API key"
 GPT4_MODEL = "gpt-4"
 GPT4_TEMPERATURE = 0.7
 GPT4_MAX_TOKENS = 120
@@ -34,9 +37,11 @@ GPT4_TOP_P = 1
 GPT4_FREQUENCY_PENALTY = 0
 GPT4_PRESENCE_PENALTY = 0
-AI_ROLE_TEMPLATE = """
-Your expertise lies in providing accurate and timely information on the laws and regulations specific to your jurisdiction.
-Your role is to assist individuals, including law enforcement officers, legal professionals, and the general public,
-in understanding and applying legal standards within this jurisdiction. You are knowledgeable, precise, and always
-ready to offer guidance on legal matters. Your max_tokens is set to 120 so keep your response below that.
-"""

+import os
+EMBEDDING_MODEL = "text-embedding-ada-002"
 LOCATION_INFO = {
     "Boulder": {
+        "npz_file": os.getenv(
+            "BOULDER_NPZ_FILE", "lexai/data/boulder_embeddings.npz"
+        ),
         "role_description": (
+            "You are an AI-powered legal assistant specializing in the jurisdiction "
+            "of Boulder County, Colorado."
         ),
     },
     "Denver": {
+        "npz_file": os.getenv("DENVER_NPZ_FILE", "lexai/data/denver_embeddings.npz"),
         "role_description": (
+            "You are an AI-powered legal assistant specializing in the jurisdiction "
+            "of Denver, Colorado."
         ),
     },
 }
 APP_DESCRIPTION = (
     "LexAI is an AI-powered legal research app designed to assist individuals, "
     "including law enforcement officers, legal professionals, and the general public, "
+    "in accessing jurisdiction-specific legal information. While LexAI aims to provide "
+    "useful and relevant results, it does not constitute legal advice. Its output may "
+    "not always be accurate or up to date. Users should verify information "
+    "independently and consult qualified legal professionals when needed."
 )
 GPT4_MODEL = "gpt-4"
 GPT4_TEMPERATURE = 0.7
 GPT4_MAX_TOKENS = 120
 GPT4_FREQUENCY_PENALTY = 0
 GPT4_PRESENCE_PENALTY = 0
+AI_ROLE_TEMPLATE = (
+    "Your expertise lies in providing accurate and timely information on the laws and "
+    "regulations specific to your jurisdiction. Your role is to assist individuals, "
+    "including law enforcement officers, legal professionals, and the general public. "
+    "You help them understand and apply legal standards within this jurisdiction. You "
+    "are knowledgeable, precise, and always ready to offer guidance on legal matters. "
+    "Your max_tokens is set to 120, so keep your response below that."
+)

lexai/core/data_loader.py CHANGED Viewed

@@ -1,9 +1,10 @@
 import numpy as np
 import pandas as pd
-import os
-def load_embeddings_data(npz_file_path: str) -> tuple[np.ndarray, pd.DataFrame]:
     """
     Loads embeddings and associated jurisdiction data from a .npz file.

+import os
 import numpy as np
 import pandas as pd
+def load_embeddings(npz_file_path: str) -> tuple[np.ndarray, pd.DataFrame]:
     """
     Loads embeddings and associated jurisdiction data from a .npz file.

lexai/core/match_engine.py CHANGED Viewed

@@ -1,62 +1,87 @@
 import logging
-import os
 import openai
-from dotenv import load_dotenv
-from lexai.config import LOCATION_INFO, AI_ROLE_TEMPLATE
 from lexai.core.data_loader import load_embeddings
 from lexai.core.matcher import find_top_matches
-from lexai.services.openai_client import get_embedding, get_chat_completion
 logger = logging.getLogger(__name__)
-if not os.getenv("OPENAI_API_KEY"):
-    load_dotenv(override=True)
 def generate_matches(query: str, location: str) -> str:
-    try:
-        location_data = LOCATION_INFO.get(location)
-        if not location_data:
-            raise ValueError(f"Invalid location: '{location}'")
         query_embedding = get_embedding(query)
-        embeddings, metadata_df = load_embeddings(location_data["npz_file"])
-        if embeddings.shape[0] != len(metadata_df):
             raise ValueError(
-                "Mismatch between number of embeddings and metadata entries")
-        top_matches = find_top_matches(
-            query_embedding, embeddings, metadata_df)
-        system_prompt = f"{location_data['role_description']}\n{AI_ROLE_TEMPLATE}"
         ai_response = get_chat_completion(
             system_prompt, str(top_matches), query)
-        response_html = f"<p><strong>Response:</strong></p><p>{ai_response}</p>"
         reference_html = "<p><strong>References:</strong></p><ul>"
         for match in top_matches:
-            url = match.get("url", "#")
-            title = match.get("title", "Untitled")
-            subtitle = match.get("subtitle", "")
-            reference_html += f'<li><a href="{url}" target="_blank">{title}: {subtitle}</a></li>'
         reference_html += "</ul>"
         return response_html + reference_html
     except openai.AuthenticationError:
         logger.error("Invalid OpenAI API key.")
-        return "<p style='color: #d9534f;'><strong>Error:</strong> Invalid OpenAI API key.</p>"
     except openai.OpenAIError as e:
         logger.error(f"OpenAI API Error: {e}")
-        return f"<p style='color: #d9534f;'><strong>OpenAI Error:</strong> {e}</p>"
     except FileNotFoundError as e:
         logger.error(f"File not found: {e}")
-        return f"<p style='color: #d9534f;'><strong>File Error:</strong> {e}</p>"
     except ValueError as e:
         logger.error(f"Value error: {e}")
-        return f"<p><strong>Input Error:</strong> {e}</p>"
     except Exception as e:
         logger.exception("Unhandled exception during generate_matches.")
-        return f"<p style='color: #d9534f;'><strong>Unexpected error:</strong> {e}</p>"

 import logging
+from html import escape
 import openai
+from lexai.config import AI_ROLE_TEMPLATE, LOCATION_INFO
 from lexai.core.data_loader import load_embeddings
 from lexai.core.matcher import find_top_matches
+from lexai.services.openai_client import get_chat_completion, get_embedding
 logger = logging.getLogger(__name__)
 def generate_matches(query: str, location: str) -> str:
+    if location not in LOCATION_INFO:
+        logger.error(f"Invalid location: {location}")
+        return (
+            "<p><strong>Input Error:</strong> "
+            f"Invalid location: '{escape(location)}'</p>"
+        )
+    try:
         query_embedding = get_embedding(query)
+        location_data = LOCATION_INFO[location]
+        embeddings, metadata = load_embeddings(location_data["npz_file"])
+        if embeddings.shape[0] != len(metadata):
             raise ValueError(
+                "Mismatch between number of embeddings and metadata entries"
+            )
+        top_matches = find_top_matches(query_embedding, embeddings, metadata)
+        system_prompt = (
+            f"{location_data['role_description']}\n{AI_ROLE_TEMPLATE}"
+        )
         ai_response = get_chat_completion(
             system_prompt, str(top_matches), query)
+        response_html = (
+            "<p><strong>Response:</strong></p>"
+            f"<p>{escape(ai_response)}</p>"
+        )
         reference_html = "<p><strong>References:</strong></p><ul>"
         for match in top_matches:
+            url = escape(match["url"])
+            title = escape(match["title"])
+            subtitle = escape(match["subtitle"])
+            reference_html += (
+                f'<li><a href="{url}" target="_blank">'
+                f"{title}: {subtitle}</a></li>"
+            )
         reference_html += "</ul>"
         return response_html + reference_html
     except openai.AuthenticationError:
         logger.error("Invalid OpenAI API key.")
+        return (
+            "<p style='color: #d9534f;'><strong>Error:</strong> "
+            "Invalid OpenAI API key.</p>"
+        )
     except openai.OpenAIError as e:
         logger.error(f"OpenAI API Error: {e}")
+        return (
+            "<p style='color: #d9534f;'><strong>OpenAI Error:</strong> "
+            f"{escape(str(e))}</p>"
+        )
     except FileNotFoundError as e:
         logger.error(f"File not found: {e}")
+        return (
+            "<p style='color: #d9534f;'><strong>File Error:</strong> "
+            f"{escape(str(e))}</p>"
+        )
     except ValueError as e:
         logger.error(f"Value error: {e}")
+        return (
+            "<p><strong>Input Error:</strong> "
+            f"{escape(str(e))}</p>"
+        )
     except Exception as e:
         logger.exception("Unhandled exception during generate_matches.")
+        return (
+            "<p style='color: #d9534f;'><strong>Unexpected error:</strong> "
+            f"{escape(str(e))}</p>"
+        )

lexai/core/matcher.py CHANGED Viewed

@@ -1,7 +1,9 @@
 import numpy as np
 import pandas as pd
 from scipy.spatial.distance import cdist
-from typing import Any
 def find_top_matches(
     query_embedding: np.ndarray,
@@ -30,10 +32,22 @@ def find_top_matches(
         A list of dictionaries, where each dictionary represents a top match
         and contains its 'url', 'title', 'subtitle', and 'content'.
     """
-    distances = cdist(query_embedding.reshape(1, -1), embeddings, metric="cosine")[0]
-    indices = np.argsort(distances)[:num_matches]
-    subset: pd.DataFrame = jurisdiction_data.loc[indices]
-    top_matches: list[dict[str, Any]] = subset.to_dict("records")
-    return top_matches

+from typing import Any
 import numpy as np
 import pandas as pd
 from scipy.spatial.distance import cdist
 def find_top_matches(
     query_embedding: np.ndarray,
         A list of dictionaries, where each dictionary represents a top match
         and contains its 'url', 'title', 'subtitle', and 'content'.
     """
+    if jurisdiction_data.empty or embeddings.shape[0] == 0:
+        return []
+    if jurisdiction_data.shape[0] != embeddings.shape[0]:
+        raise ValueError(
+            "Number of embeddings and metadata entries must match.")
+    if query_embedding.ndim != 1 or query_embedding.shape[0] != embeddings.shape[1]:
+        raise ValueError(
+            "Query embedding must match the dimensionality of the embeddings."
+        )
+    distances = cdist(query_embedding.reshape(1, -1),
+                      embeddings, metric="cosine")[0]
+    safe_num_matches = min(num_matches, len(jurisdiction_data))
+    indices = np.argsort(distances)[:safe_num_matches]
+    subset = jurisdiction_data.iloc[indices]
+    return subset.to_dict("records")

lexai/models/embedding_model.py DELETED Viewed

File without changes

lexai/services/openai_client.py CHANGED Viewed

@@ -1,14 +1,16 @@
-from openai import OpenAI
-import numpy as np
 import os
 from lexai.config import (
-    MODEL_ENGINE,
     GPT4_MODEL,
     GPT4_TEMPERATURE,
-    GPT4_MAX_TOKENS,
     GPT4_TOP_P,
-    GPT4_FREQUENCY_PENALTY,
-    GPT4_PRESENCE_PENALTY,
 )
 API_KEY = os.getenv("OPENAI_API_KEY")
@@ -17,10 +19,7 @@ client = OpenAI(api_key=API_KEY)
 def get_embedding(text: str) -> np.ndarray:
     """
-    Generates an embedding for the given text using OpenAI's text-embedding model.
-    The OpenAI API key is loaded from the OPENAI_API_KEY environment variable
-    to authenticate the request.
     Parameters
     ----------
@@ -30,67 +29,61 @@ def get_embedding(text: str) -> np.ndarray:
     Returns
     -------
     np.ndarray
-        A NumPy array representing the embedding of the input text.
     Raises
     ------
     openai.AuthenticationError
-        If the OPENAI_API_KEY environment variable is not set or is invalid.
     openai.OpenAIError
-        If there's another issue with the OpenAI API call, such as network problems.
     """
-    response = client.embeddings.create(input=text, model=MODEL_ENGINE)
     return np.array(response.data[0].embedding)
-def get_chat_completion(role_description: str, top_matches_str: str, query: str) -> str:
     """
-    Generates a chat completion response using OpenAI's GPT-4 model.
-    The OpenAI API key is loaded from the OPENAI_API_KEY environment variable
-    to authenticate the request. The function constructs a conversation history
-    with system and user roles to provide context to the language model.
     Parameters
     ----------
     role_description : str
-        The system role description for the AI assistant, defining its persona
-        and limitations.
     top_matches_str : str
-        A string representation of the top legal information matches. This is
-        provided as system context to help the AI formulate relevant responses.
     query : str
-        The user's direct query or question.
     Returns
     -------
     str
-        The AI-generated response message from the chat completion.
     Raises
     ------
     openai.AuthenticationError
-        If the OPENAI_API_KEY environment variable is not set or is invalid.
     openai.OpenAIError
-        If there's an issue with the OpenAI API call, such as rate limiting,
-        or other API-related errors.
     """
-    response = client.chat.completions.create(model=GPT4_MODEL,
-                                              messages=[
-                                                  {"role": "system",
-                                                      "content": role_description.strip()},
-                                                  {"role": "system",
-                                                      "content": top_matches_str},
-                                                  {"role": "user",
-                                                      "content": query},
-                                                  {"role": "assistant",
-                                                      "content": ""},
-                                              ],
-                                              temperature=GPT4_TEMPERATURE,
-                                              max_tokens=GPT4_MAX_TOKENS,
-                                              top_p=GPT4_TOP_P,
-                                              frequency_penalty=GPT4_FREQUENCY_PENALTY,
-                                              presence_penalty=GPT4_PRESENCE_PENALTY)
     return response.choices[0].message.content.strip()

 import os
+import numpy as np
+from openai import OpenAI
 from lexai.config import (
+    EMBEDDING_MODEL,
+    GPT4_FREQUENCY_PENALTY,
+    GPT4_MAX_TOKENS,
     GPT4_MODEL,
+    GPT4_PRESENCE_PENALTY,
     GPT4_TEMPERATURE,
     GPT4_TOP_P,
 )
 API_KEY = os.getenv("OPENAI_API_KEY")
 def get_embedding(text: str) -> np.ndarray:
     """
+    Generates an embedding for the given text using OpenAI's embedding model.
     Parameters
     ----------
     Returns
     -------
     np.ndarray
+        A NumPy array representing the embedding.
     Raises
     ------
     openai.AuthenticationError
+        If the API key is not set or invalid.
     openai.OpenAIError
+        For other API-related issues.
     """
+    response = client.embeddings.create(input=text, model=EMBEDDING_MODEL)
     return np.array(response.data[0].embedding)
+def get_chat_completion(
+    role_description: str,
+    top_matches_str: str,
+    query: str,
+) -> str:
     """
+    Generates a chat completion using OpenAI's GPT-4 model.
     Parameters
     ----------
     role_description : str
+        Description of the assistant's persona and context.
     top_matches_str : str
+        Summary of top legal matches used to guide the assistant.
     query : str
+        The user’s legal query.
     Returns
     -------
     str
+        The AI-generated response.
     Raises
     ------
     openai.AuthenticationError
+        If the API key is not set or invalid.
     openai.OpenAIError
+        For other API-related issues.
     """
+    response = client.chat.completions.create(
+        model=GPT4_MODEL,
+        messages=[
+            {"role": "system", "content": role_description.strip()},
+            {"role": "system", "content": top_matches_str},
+            {"role": "user", "content": query},
+            {"role": "assistant", "content": ""},
+        ],
+        temperature=GPT4_TEMPERATURE,
+        max_tokens=GPT4_MAX_TOKENS,
+        top_p=GPT4_TOP_P,
+        frequency_penalty=GPT4_FREQUENCY_PENALTY,
+        presence_penalty=GPT4_PRESENCE_PENALTY,
+    )
     return response.choices[0].message.content.strip()

lexai/ui/gradio_interface.py CHANGED Viewed

@@ -1,12 +1,14 @@
-import gradio as gr
 import logging
-from lexai.config import LOCATION_INFO, APP_DESCRIPTION
 from lexai.core.match_engine import generate_matches
 logger = logging.getLogger(__name__)
-def launch_interface():
     with gr.Blocks(title="LexAI") as iface:
         gr.HTML("<h1 style='text-align: center;'>LexAI</h1>")
         gr.Markdown(APP_DESCRIPTION)
@@ -56,5 +58,5 @@ def launch_interface():
             inputs=[query_input, location_input]
         )
-    logger.info("Launching LexAI interface...")
-    iface.launch()

 import logging
+import gradio as gr
+from lexai.config import APP_DESCRIPTION, LOCATION_INFO
 from lexai.core.match_engine import generate_matches
 logger = logging.getLogger(__name__)
+def build_interface():
     with gr.Blocks(title="LexAI") as iface:
         gr.HTML("<h1 style='text-align: center;'>LexAI</h1>")
         gr.Markdown(APP_DESCRIPTION)
             inputs=[query_input, location_input]
         )
+    logger.info("LexAI interface built.")
+    return iface

pyproject.toml CHANGED Viewed

@@ -1,31 +1,34 @@
 [project]
 name = "lexai"
 version = "0.1.0"
-description = "A demo of LexAI, an AI legal assistant that delivers accurate legal information."
 readme = "README.md"
 requires-python = ">=3.8"
 license = { text = "MIT" }
 authors = [
   { name = "Alex Ulanch", email = "alexulanch@gmail.com" },
 ]
 keywords = ["AI", "Legal", "Gradio", "OpenAI", "RAG"]
 classifiers = [
-    "Programming Language :: Python :: 3",
-    "License :: OSI Approved :: MIT License",
-    "Operating System :: OS Independent",
-    "Development Status :: 3 - Alpha",
-    "Intended Audience :: Developers",
-    "Topic :: Scientific/Engineering :: Artificial Intelligence",
-    "Topic :: Software Development :: Libraries :: Application Frameworks",
 ]
 dependencies = [
-    "pandas",
-    "numpy",
-    "openai",
-    "gradio",
-    "scipy",
-    "python-dotenv",
 ]
 [project.urls]
@@ -42,11 +45,11 @@ include = ["lexai*"]
 [tool.black]
 line-length = 88
-target-version = ['py38']
 include = '\.pyi?$'
 exclude = '''
 /(
-    \.git
   | \.venv
   | \.mypy_cache
   | \.pytest_cache
@@ -60,20 +63,26 @@ exclude = '''
 '''
 [tool.isort]
-known_local_folder = ["lexai"]
 profile = "black"
-line_length = 88
 known_first_party = ["lexai"]
-skip_glob = ["**/data/*"]
 multi_line_output = 3
 include_trailing_comma = true
 force_grid_wrap = 0
 use_parentheses = true
 ensure_newline_before_comments = true
 [tool.pytest.ini_options]
 minversion = "6.0"
 addopts = "-ra -q"
-testpaths = [
-    "tests",
-]

 [project]
 name = "lexai"
 version = "0.1.0"
+description = "LexAI is an AI legal assistant that provides accurate, location-specific legal information in a clear and accessible format."
 readme = "README.md"
 requires-python = ">=3.8"
 license = { text = "MIT" }
 authors = [
   { name = "Alex Ulanch", email = "alexulanch@gmail.com" },
 ]
 keywords = ["AI", "Legal", "Gradio", "OpenAI", "RAG"]
 classifiers = [
+  "Programming Language :: Python :: 3",
+  "License :: OSI Approved :: MIT License",
+  "Operating System :: OS Independent",
+  "Development Status :: 3 - Alpha",
+  "Intended Audience :: Developers",
+  "Topic :: Scientific/Engineering :: Artificial Intelligence",
+  "Topic :: Software Development :: Libraries :: Application Frameworks"
 ]
 dependencies = [
+  "pandas",
+  "numpy",
+  "openai",
+  "gradio",
+  "scipy",
+  "python-dotenv"
 ]
 [project.urls]
 [tool.black]
 line-length = 88
+target-version = ["py38"]
 include = '\.pyi?$'
 exclude = '''
 /(
+  \.git
   | \.venv
   | \.mypy_cache
   | \.pytest_cache
 '''
 [tool.isort]
 profile = "black"
 known_first_party = ["lexai"]
+known_local_folder = ["lexai"]
+line_length = 88
 multi_line_output = 3
 include_trailing_comma = true
 force_grid_wrap = 0
 use_parentheses = true
 ensure_newline_before_comments = true
+skip_glob = ["**/data/*"]
 [tool.pytest.ini_options]
 minversion = "6.0"
 addopts = "-ra -q"
+testpaths = ["tests"]
+[tool.ruff]
+line-length = 88
+target-version = "py38"
+exclude = ["data", "build", "dist"]
+[tool.ruff.lint]
+select = ["E", "F", "W", "I"]

tests/test_data_loader.py CHANGED Viewed

@@ -1,7 +1,8 @@
-import pytest
 import numpy as np
 import pandas as pd
-from pathlib import Path
 from lexai.core.data_loader import load_embeddings
@@ -46,12 +47,12 @@ def test_load_embeddings_success(temp_npz_file):
 def test_load_embeddings_missing_key(broken_npz_missing_embeddings):
-    with pytest.raises(KeyError, match="Missing key 'embeddings'"):
         load_embeddings(broken_npz_missing_embeddings)
 def test_load_metadata_missing_key(broken_npz_missing_columns):
-    with pytest.raises(KeyError, match="Missing key 'titles'"):
         load_embeddings(broken_npz_missing_columns)

+from pathlib import Path
 import numpy as np
 import pandas as pd
+import pytest
 from lexai.core.data_loader import load_embeddings
 def test_load_embeddings_missing_key(broken_npz_missing_embeddings):
+    with pytest.raises(KeyError, match="Missing key"):
         load_embeddings(broken_npz_missing_embeddings)
 def test_load_metadata_missing_key(broken_npz_missing_columns):
+    with pytest.raises(KeyError, match="Missing key"):
         load_embeddings(broken_npz_missing_columns)

tests/test_main.py DELETED Viewed

File without changes

tests/test_matcher.py CHANGED Viewed

@@ -1,29 +1,46 @@
-import pytest
 import numpy as np
 import pandas as pd
 from lexai.core.matcher import find_top_matches
 @pytest.fixture
 def sample_embeddings():
-    return np.array([
-        [1.0, 0.1, 0.1],
-        [0.8, 0.3, 0.2],
-        [0.5, 0.5, 0.5],
-        [0.1, 0.1, 1.0],
-        [0.0, 0.0, 0.0]
-    ], dtype=np.float32)
 @pytest.fixture
 def sample_jurisdiction_data():
-    return pd.DataFrame({
-        "url": ["url1", "url2", "url3", "url4", "url5"],
-        "title": ["Title 1", "Title 2", "Title 3", "Title 4", "Title 5"],
-        "subtitle": ["Subtitle A", "Subtitle B", "Subtitle C", "Subtitle D", "Subtitle E"],
-        "content": ["Content X", "Content Y", "Content Z", "Content W", "Content V"]
-    })
 @pytest.fixture
@@ -41,35 +58,69 @@ def empty_jurisdiction_data():
     return pd.DataFrame(columns=["url", "title", "subtitle", "content"])
-def test_returns_expected_number_of_matches(sample_query_embedding, sample_embeddings, sample_jurisdiction_data):
     matches = find_top_matches(
-        sample_query_embedding, sample_embeddings, sample_jurisdiction_data, num_matches=3)
     assert len(matches) == 3
-    assert [m["title"] for m in matches] == ["Title 1", "Title 2", "Title 3"]
-def test_returns_all_available_matches_if_less_than_requested(sample_query_embedding, sample_embeddings, sample_jurisdiction_data):
     matches = find_top_matches(
-        sample_query_embedding, sample_embeddings, sample_jurisdiction_data, num_matches=10)
     assert len(matches) == len(sample_embeddings)
     assert matches[0]["title"] == "Title 1"
-def test_returns_empty_list_for_empty_embeddings(sample_query_embedding, empty_embeddings, empty_jurisdiction_data):
     matches = find_top_matches(
-        sample_query_embedding, empty_embeddings, empty_jurisdiction_data, num_matches=3)
     assert matches == []
-def test_returns_empty_list_for_empty_jurisdiction_data(sample_query_embedding, sample_embeddings, empty_jurisdiction_data):
     matches = find_top_matches(
-        sample_query_embedding, sample_embeddings, empty_jurisdiction_data, num_matches=3)
     assert matches == []
-def test_output_contains_expected_keys(sample_query_embedding, sample_embeddings, sample_jurisdiction_data):
     matches = find_top_matches(
-        sample_query_embedding, sample_embeddings, sample_jurisdiction_data, num_matches=1)
     match = matches[0]
     assert set(match.keys()) == {"url", "title", "subtitle", "content"}
     assert match["url"] == "url1"
@@ -78,24 +129,36 @@ def test_output_contains_expected_keys(sample_query_embedding, sample_embeddings
     assert match["content"] == "Content X"
-def test_handles_single_embedding_and_row(sample_query_embedding, sample_embeddings, sample_jurisdiction_data):
     matches = find_top_matches(
         sample_query_embedding,
-        sample_embeddings[0:1],
-        sample_jurisdiction_data.iloc[0:1],
-        num_matches=3
     )
     assert len(matches) == 1
     assert matches[0]["title"] == "Title 1"
-def test_raises_for_invalid_query_embedding_shape(sample_embeddings, sample_jurisdiction_data):
-    bad_embedding = np.array([1.0, 2.0], dtype=np.float32)
-    with pytest.raises(ValueError, match="same number of columns"):
-        find_top_matches(bad_embedding, sample_embeddings,
-                         sample_jurisdiction_data, num_matches=1)
-    scalar_embedding = np.array(1.0, dtype=np.float32)
     with pytest.raises(ValueError):
-        find_top_matches(scalar_embedding, sample_embeddings,
-                         sample_jurisdiction_data, num_matches=1)

 import numpy as np
 import pandas as pd
+import pytest
 from lexai.core.matcher import find_top_matches
 @pytest.fixture
 def sample_embeddings():
+    return np.array(
+        [
+            [1.0, 0.1, 0.1],
+            [0.8, 0.3, 0.2],
+            [0.5, 0.5, 0.5],
+            [0.1, 0.1, 1.0],
+            [0.0, 0.0, 0.0],
+        ],
+        dtype=np.float32,
+    )
 @pytest.fixture
 def sample_jurisdiction_data():
+    return pd.DataFrame(
+        {
+            "url": ["url1", "url2", "url3", "url4", "url5"],
+            "title": ["Title 1", "Title 2", "Title 3", "Title 4", "Title 5"],
+            "subtitle": [
+                "Subtitle A",
+                "Subtitle B",
+                "Subtitle C",
+                "Subtitle D",
+                "Subtitle E",
+            ],
+            "content": [
+                "Content X",
+                "Content Y",
+                "Content Z",
+                "Content W",
+                "Content V",
+            ],
+        }
+    )
 @pytest.fixture
     return pd.DataFrame(columns=["url", "title", "subtitle", "content"])
+def test_returns_expected_number_of_matches(
+    sample_query_embedding, sample_embeddings, sample_jurisdiction_data
+):
     matches = find_top_matches(
+        sample_query_embedding,
+        sample_embeddings,
+        sample_jurisdiction_data,
+        num_matches=3,
+    )
     assert len(matches) == 3
+    assert [match["title"] for match in matches] == [
+        "Title 1",
+        "Title 2",
+        "Title 3",
+    ]
+def test_returns_all_available_matches_if_less_than_requested(
+    sample_query_embedding, sample_embeddings, sample_jurisdiction_data
+):
     matches = find_top_matches(
+        sample_query_embedding,
+        sample_embeddings,
+        sample_jurisdiction_data,
+        num_matches=10,
+    )
     assert len(matches) == len(sample_embeddings)
     assert matches[0]["title"] == "Title 1"
+def test_returns_empty_list_for_empty_embeddings(
+    sample_query_embedding, empty_embeddings, empty_jurisdiction_data
+):
     matches = find_top_matches(
+        sample_query_embedding,
+        empty_embeddings,
+        empty_jurisdiction_data,
+        num_matches=3,
+    )
     assert matches == []
+def test_returns_empty_list_for_empty_jurisdiction_data(
+    sample_query_embedding, sample_embeddings, empty_jurisdiction_data
+):
     matches = find_top_matches(
+        sample_query_embedding,
+        sample_embeddings,
+        empty_jurisdiction_data,
+        num_matches=3,
+    )
     assert matches == []
+def test_output_contains_expected_keys(
+    sample_query_embedding, sample_embeddings, sample_jurisdiction_data
+):
     matches = find_top_matches(
+        sample_query_embedding,
+        sample_embeddings,
+        sample_jurisdiction_data,
+        num_matches=1,
+    )
     match = matches[0]
     assert set(match.keys()) == {"url", "title", "subtitle", "content"}
     assert match["url"] == "url1"
     assert match["content"] == "Content X"
+def test_handles_single_embedding_and_row(
+    sample_query_embedding, sample_embeddings, sample_jurisdiction_data
+):
     matches = find_top_matches(
         sample_query_embedding,
+        sample_embeddings[:1],
+        sample_jurisdiction_data.iloc[:1],
+        num_matches=3,
     )
     assert len(matches) == 1
     assert matches[0]["title"] == "Title 1"
+def test_raises_for_invalid_query_embedding_shape(
+    sample_embeddings, sample_jurisdiction_data
+):
+    invalid_vector = np.array([1.0, 2.0], dtype=np.float32)
+    with pytest.raises(ValueError, match="dimensionality of the embeddings"):
+        find_top_matches(
+            invalid_vector,
+            sample_embeddings,
+            sample_jurisdiction_data,
+            num_matches=1,
+        )
+    scalar_value = np.array(1.0, dtype=np.float32)
     with pytest.raises(ValueError):
+        find_top_matches(
+            scalar_value,
+            sample_embeddings,
+            sample_jurisdiction_data,
+            num_matches=1,
+        )

tests/test_matching.py DELETED Viewed

File without changes

tests/test_openai_client.py CHANGED Viewed

@@ -1,6 +1,8 @@
 import numpy as np
-from unittest.mock import patch, MagicMock
-from lexai.services.openai_client import get_embedding, get_chat_completion
 @patch("lexai.services.openai_client.client")

+from unittest.mock import MagicMock, patch
 import numpy as np
+from lexai.services.openai_client import get_chat_completion, get_embedding
 @patch("lexai.services.openai_client.client")